Rust AI Inference Architecture
Design patterns and system architecture for building scalable AI inference services in Rust. Covers model serving, request routing, batching pipelines, and multi-model orchestration.
Topic: Ai Inference
Search intent: High-intent search: "rust ai inference architecture"
Rust AI Inference Architecture
Overview
A production AI inference system in Rust typically has four layers: ingestion, scheduling, execution, and response. Getting this separation right determines whether your system handles 100 or 100,000 requests per second.
Core architecture diagram
┌────────────────────────────────────────────────────────┐
│ Client Layer │
│ HTTP/gRPC ──► Axum / Tonic │
└────────────────────────────┬───────────────────────────┘
│
┌────────────────────────────▼───────────────────────────┐
│ Scheduling Layer │
│ Priority Queue ──► Batcher ──► Concurrency Limiter│
└────────────────────────────┬───────────────────────────┘
│
┌────────────────────────────▼───────────────────────────┐
│ Execution Layer │
│ Model Registry ──► Inference Worker Pool │
│ (candle / ort / tch) ──► GPU / CPU dispatch │
└────────────────────────────┬───────────────────────────┘
│
┌────────────────────────────▼───────────────────────────┐
│ Response Layer │
│ Result Cache ──► Serializer ──► Client │
└────────────────────────────────────────────────────────┘Runnable example — multi-model registry
use std::collections::HashMap;
use std::sync::{Arc, RwLock};
/// Trait that all inference models implement
trait Model: Send + Sync {
fn name(&self) -> &str;
fn infer(&self, input: &[f32]) -> Vec<f32>;
}
/// Dummy sentiment model
struct SentimentModel;
impl Model for SentimentModel {
fn name(&self) -> &str { "sentiment-v1" }
fn infer(&self, input: &[f32]) -> Vec<f32> {
// Simplified: return [positive_score, negative_score]
let mean = input.iter().sum::<f32>() / input.len() as f32;
vec![mean.abs(), 1.0 - mean.abs()]
}
}
/// Dummy embedding model
struct EmbeddingModel { dim: usize }
impl Model for EmbeddingModel {
fn name(&self) -> &str { "embed-v2" }
fn infer(&self, input: &[f32]) -> Vec<f32> {
// Pad or truncate to embedding dim
let mut out = input.to_vec();
out.resize(self.dim, 0.0);
out
}
}
/// Thread-safe model registry
struct ModelRegistry {
models: RwLock<HashMap<String, Arc<dyn Model>>>,
}
impl ModelRegistry {
fn new() -> Self {
Self { models: RwLock::new(HashMap::new()) }
}
fn register(&self, model: Arc<dyn Model>) {
let mut map = self.models.write().unwrap();
map.insert(model.name().to_string(), model);
}
fn infer(&self, model_name: &str, input: &[f32]) -> Option<Vec<f32>> {
let map = self.models.read().unwrap();
map.get(model_name).map(|m| m.infer(input))
}
fn list_models(&self) -> Vec<String> {
self.models.read().unwrap().keys().cloned().collect()
}
}
fn main() {
let registry = ModelRegistry::new();
registry.register(Arc::new(SentimentModel));
registry.register(Arc::new(EmbeddingModel { dim: 8 }));
println!("Registered models: {:?}", registry.list_models());
let input = vec![0.1, 0.5, -0.3, 0.8];
if let Some(scores) = registry.infer("sentiment-v1", &input) {
println!("Sentiment scores: {:?}", scores);
}
if let Some(embed) = registry.infer("embed-v2", &input) {
println!("Embedding (dim=8): {:?}", embed);
}
}Architecture decisions
When to use async vs sync for inference
| Scenario | Recommendation |
|----------|----------------|
| I/O-bound (HTTP, DB lookups) | async with Tokio |
| CPU-bound (matrix ops) | rayon thread pool or dedicated OS thread |
| GPU-bound (CUDA ops) | Blocking thread + channel to async runtime |
| Mixed (tokenize + infer + decode) | Pipeline stages with channels |
Model loading strategy
use std::sync::OnceLock;
// Load model once at startup; share via Arc across all worker threads
static MODEL: OnceLock<Arc<dyn Model>> = OnceLock::new();
fn get_model() -> &'static Arc<dyn Model> {
MODEL.get_or_init(|| Arc::new(SentimentModel))
}Request routing for multi-tenant serving
#[derive(Debug)]
struct RoutingKey {
model_id: String,
version: u32,
tenant_id: String,
}
fn route_request(key: &RoutingKey) -> &'static str {
match (key.model_id.as_str(), key.version) {
("gpt", v) if v >= 4 => "gpu-pool-a",
("gpt", _) => "gpu-pool-b",
("embed", _) => "cpu-pool",
_ => "default-pool",
}
}Deployment topology
- Single node: one Tokio runtime,
rayonpool for compute, shared model cache. - Multi-node: stateless inference workers behind a load balancer; model weights on shared NFS or downloaded at startup from object storage.
- GPU cluster: one worker per GPU, coordinated by a central scheduler using
tokio::mpscchannels.