Rust AI Inference Migration Guide

Why migrate to Rust?

|---|---|---|---|

| p99 latency (small model) | 25–80ms | 2–8ms | 5–10x |

| Memory per instance | 400–800MB | 40–80MB | 10x |

| Requests/sec (single core) | 2,000–5,000 | 50,000–200,000 | 20–40x |

| Cold start time | 2–5s | 0.1–0.5s | 10x |

| Binary size | N/A (Python) | 5–20MB | N/A |

Migration strategy: strangler fig pattern

rust

Phase 1: Dual-run
  Python server still handles all traffic
  Rust server mirrors traffic for validation (shadow mode)

Phase 2: Canary
  5% → 25% → 50% → 100% of traffic to Rust
  Monitor metrics at each step

Phase 3: Full migration
  Python server decommissioned
  Rust server handles 100%

Step 1: Port the inference logic

Python original:

python

# Python FastAPI version
import numpy as np
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class InferRequest(BaseModel):
    input: list[float]

@app.post("/infer")
def infer(req: InferRequest) -> dict:
    arr = np.array(req.input)
    result = arr * 2.0 + 0.1  # Simplified model
    return {"output": result.tolist()}

Rust equivalent:

rust

use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct InferRequest {
    input: Vec<f32>,
}

#[derive(Serialize)]
struct InferResponse {
    output: Vec<f32>,
}

fn run_inference(input: &[f32]) -> Vec<f32> {
    input.iter().map(|x| x * 2.0 + 0.1).collect()
}

// In an axum handler:
// async fn infer(Json(req): Json<InferRequest>) -> Json<InferResponse> {
//     let output = tokio::task::spawn_blocking(move || run_inference(&req.input))
//         .await.unwrap();
//     Json(InferResponse { output })
// }

fn main() {
    let req = InferRequest { input: vec![1.0, 2.0, 3.0] };
    let resp = InferResponse { output: run_inference(&req.input) };
    println!("{}", serde_json::to_string(&resp).unwrap());
}

Step 2: Port input/output schemas

rust

use serde::{Deserialize, Serialize};

/// Match your existing Python/OpenAPI schema exactly during migration
/// to ensure zero client changes needed
#[derive(Debug, Deserialize)]
#[serde(deny_unknown_fields)] // Catch schema drift during migration
struct LegacyInferRequest {
    /// Model name — must match your Python API field names exactly
    model: String,
    /// Input tensor as flat array
    inputs: Vec<f32>,
    /// Optional top-k parameter
    #[serde(default = "default_top_k")]
    top_k: u32,
    /// Optional temperature (LLM inference)
    #[serde(default = "default_temperature")]
    temperature: f32,
}

fn default_top_k() -> u32 { 5 }
fn default_temperature() -> f32 { 1.0 }

#[derive(Debug, Serialize)]
struct LegacyInferResponse {
    model: String,
    outputs: Vec<f32>,
    /// Preserve all legacy fields to avoid breaking consumers
    latency_ms: f64,
    version: String,
}

fn main() {
    let json = r#"{
        "model": "resnet-50",
        "inputs": [0.1, 0.2, 0.3],
        "top_k": 3
    }"#;

    let req: LegacyInferRequest = serde_json::from_str(json).unwrap();
    println!("Parsed request: model={} inputs={:?}", req.model, req.inputs);

    let resp = LegacyInferResponse {
        model: req.model.clone(),
        outputs: req.inputs.iter().map(|x| x * 2.0).collect(),
        latency_ms: 1.5,
        version: "rust-v1".to_string(),
    };
    println!("{}", serde_json::to_string_pretty(&resp).unwrap());
}

Step 3: Shadow mode validation

rust

use std::sync::Arc;
use tokio::sync::mpsc;

/// Shadow mode: send each request to both old and new implementation,
/// compare outputs, log discrepancies
async fn shadow_infer(
    input: Vec<f32>,
    shadow_tx: mpsc::Sender<(Vec<f32>, Vec<f32>)>,
) -> Vec<f32> {
    // Run new (Rust) implementation
    let rust_result: Vec<f32> = input.iter().map(|x| x * 2.0 + 0.1).collect();

    // Run old (Python) implementation via HTTP (fire-and-forget)
    let input_clone = input.clone();
    let rust_clone = rust_result.clone();
    tokio::spawn(async move {
        let python_result = call_python_inference(&input_clone).await;
        if !outputs_match(&rust_clone, &python_result, 1e-4) {
            let _ = shadow_tx.send((rust_clone, python_result)).await;
        }
    });

    rust_result
}

async fn call_python_inference(input: &[f32]) -> Vec<f32> {
    // Simulate Python API call
    input.iter().map(|x| x * 2.0 + 0.1).collect()
}

fn outputs_match(a: &[f32], b: &[f32], tol: f32) -> bool {
    a.len() == b.len() && a.iter().zip(b).all(|(x, y)| (x - y).abs() < tol)
}

#[tokio::main]
async fn main() {
    let (tx, mut rx) = mpsc::channel::<(Vec<f32>, Vec<f32>)>(100);

    let result = shadow_infer(vec![1.0, 2.0, 3.0], tx).await;
    println!("Inference result: {:?}", result);

    // Check for discrepancies (would be logged/alerted in production)
    while let Ok((rust, python)) = rx.try_recv() {
        eprintln!("⚠️  Output mismatch: rust={:?} python={:?}", rust, python);
    }
}

Migration checklist

[ ] Schema compatibility: Rust API accepts all existing request formats.
[ ] Shadow mode running for 48h with < 0.1% discrepancy rate.
[ ] Latency benchmarks show improvement in all percentiles.
[ ] Error codes and messages match the Python service.
[ ] Health check endpoints identical (/healthz, /readyz).
[ ] Logging format matches (for existing dashboards/alerts).
[ ] Load test at 2x production traffic before cutover.

Rust AI Inference Migration Guide

Rust AI Inference Migration Guide

Why migrate to Rust?

Migration strategy: strangler fig pattern

Step 1: Port the inference logic

Step 2: Port input/output schemas

Step 3: Shadow mode validation

Migration checklist

Related reading

Related Guides

Rust AI Inference Architecture

Rust AI Inference Best Practices

Continue in This Topic

Rust AI Inference Maintainability

Rust AI Inference Performance Tuning

More Rust Guides

Building LLM Applications with Rust

LLM API Gateway in Rust

LLM Rust Anti-Patterns

LLM Rust Benchmarking

LLM Rust Decision Matrix

LLM Rust Interview Q&A