Rust AI Inference Architecture

Overview

A production AI inference system in Rust typically has four layers: ingestion, scheduling, execution, and response. Getting this separation right determines whether your system handles 100 or 100,000 requests per second.

Core architecture diagram

rust

┌────────────────────────────────────────────────────────┐
│                    Client Layer                         │
│          HTTP/gRPC  ──►  Axum / Tonic                  │
└────────────────────────────┬───────────────────────────┘
                             │
┌────────────────────────────▼───────────────────────────┐
│                 Scheduling Layer                        │
│   Priority Queue  ──►  Batcher  ──►  Concurrency Limiter│
└────────────────────────────┬───────────────────────────┘
                             │
┌────────────────────────────▼───────────────────────────┐
│                 Execution Layer                         │
│   Model Registry  ──►  Inference Worker Pool           │
│   (candle / ort / tch)  ──►  GPU / CPU dispatch        │
└────────────────────────────┬───────────────────────────┘
                             │
┌────────────────────────────▼───────────────────────────┐
│                  Response Layer                         │
│     Result Cache  ──►  Serializer  ──►  Client         │
└────────────────────────────────────────────────────────┘

Runnable example — multi-model registry

rust

use std::collections::HashMap;
use std::sync::{Arc, RwLock};

/// Trait that all inference models implement
trait Model: Send + Sync {
    fn name(&self) -> &str;
    fn infer(&self, input: &[f32]) -> Vec<f32>;
}

/// Dummy sentiment model
struct SentimentModel;
impl Model for SentimentModel {
    fn name(&self) -> &str { "sentiment-v1" }
    fn infer(&self, input: &[f32]) -> Vec<f32> {
        // Simplified: return [positive_score, negative_score]
        let mean = input.iter().sum::<f32>() / input.len() as f32;
        vec![mean.abs(), 1.0 - mean.abs()]
    }
}

/// Dummy embedding model
struct EmbeddingModel { dim: usize }
impl Model for EmbeddingModel {
    fn name(&self) -> &str { "embed-v2" }
    fn infer(&self, input: &[f32]) -> Vec<f32> {
        // Pad or truncate to embedding dim
        let mut out = input.to_vec();
        out.resize(self.dim, 0.0);
        out
    }
}

/// Thread-safe model registry
struct ModelRegistry {
    models: RwLock<HashMap<String, Arc<dyn Model>>>,
}

impl ModelRegistry {
    fn new() -> Self {
        Self { models: RwLock::new(HashMap::new()) }
    }

    fn register(&self, model: Arc<dyn Model>) {
        let mut map = self.models.write().unwrap();
        map.insert(model.name().to_string(), model);
    }

    fn infer(&self, model_name: &str, input: &[f32]) -> Option<Vec<f32>> {
        let map = self.models.read().unwrap();
        map.get(model_name).map(|m| m.infer(input))
    }

    fn list_models(&self) -> Vec<String> {
        self.models.read().unwrap().keys().cloned().collect()
    }
}

fn main() {
    let registry = ModelRegistry::new();

    registry.register(Arc::new(SentimentModel));
    registry.register(Arc::new(EmbeddingModel { dim: 8 }));

    println!("Registered models: {:?}", registry.list_models());

    let input = vec![0.1, 0.5, -0.3, 0.8];

    if let Some(scores) = registry.infer("sentiment-v1", &input) {
        println!("Sentiment scores: {:?}", scores);
    }

    if let Some(embed) = registry.infer("embed-v2", &input) {
        println!("Embedding (dim=8): {:?}", embed);
    }
}

Architecture decisions

When to use async vs sync for inference

| Scenario | Recommendation |

|----------|----------------|

| I/O-bound (HTTP, DB lookups) | async with Tokio |

| CPU-bound (matrix ops) | rayon thread pool or dedicated OS thread |

| GPU-bound (CUDA ops) | Blocking thread + channel to async runtime |

| Mixed (tokenize + infer + decode) | Pipeline stages with channels |

Model loading strategy

rust

use std::sync::OnceLock;

// Load model once at startup; share via Arc across all worker threads
static MODEL: OnceLock<Arc<dyn Model>> = OnceLock::new();

fn get_model() -> &'static Arc<dyn Model> {
    MODEL.get_or_init(|| Arc::new(SentimentModel))
}

Request routing for multi-tenant serving

rust

#[derive(Debug)]
struct RoutingKey {
    model_id: String,
    version: u32,
    tenant_id: String,
}

fn route_request(key: &RoutingKey) -> &'static str {
    match (key.model_id.as_str(), key.version) {
        ("gpt", v) if v >= 4 => "gpu-pool-a",
        ("gpt", _) => "gpu-pool-b",
        ("embed", _) => "cpu-pool",
        _ => "default-pool",
    }
}

Deployment topology

Single node: one Tokio runtime, rayon pool for compute, shared model cache.
Multi-node: stateless inference workers behind a load balancer; model weights on shared NFS or downloaded at startup from object storage.
GPU cluster: one worker per GPU, coordinated by a central scheduler using tokio::mpsc channels.

Rust AI Inference Architecture

Rust AI Inference Architecture

Overview

Core architecture diagram

Runnable example — multi-model registry

Architecture decisions

When to use async vs sync for inference

Model loading strategy

Request routing for multi-tenant serving

Deployment topology

Related reading

Related Guides

Rust AI Inference Best Practices

Rust AI Inference Scaling

Continue in This Topic

Rust AI Inference Anti-Patterns

Rust AI Inference Benchmarking

More Rust Guides

Building LLM Applications with Rust

LLM API Gateway in Rust

LLM Rust Anti-Patterns

LLM Rust Benchmarking

LLM Rust Decision Matrix

LLM Rust Interview Q&A