Rust AI Inference Migration Guide
How to migrate your AI inference service to Rust from Python (FastAPI, Flask), Go, or Java. Step-by-step migration strategy, API compatibility, and performance comparison.
Topic: Ai Inference
Search intent: High-intent search: "migrate python ai inference to rust"
Rust AI Inference Migration Guide
Why migrate to Rust?
| Metric | Python (FastAPI) | Rust (Axum) | Improvement |
|---|---|---|---|
| p99 latency (small model) | 25–80ms | 2–8ms | 5–10x |
| Memory per instance | 400–800MB | 40–80MB | 10x |
| Requests/sec (single core) | 2,000–5,000 | 50,000–200,000 | 20–40x |
| Cold start time | 2–5s | 0.1–0.5s | 10x |
| Binary size | N/A (Python) | 5–20MB | N/A |
Migration strategy: strangler fig pattern
Phase 1: Dual-run
Python server still handles all traffic
Rust server mirrors traffic for validation (shadow mode)
Phase 2: Canary
5% → 25% → 50% → 100% of traffic to Rust
Monitor metrics at each step
Phase 3: Full migration
Python server decommissioned
Rust server handles 100%Step 1: Port the inference logic
Python original:
# Python FastAPI version
import numpy as np
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class InferRequest(BaseModel):
input: list[float]
@app.post("/infer")
def infer(req: InferRequest) -> dict:
arr = np.array(req.input)
result = arr * 2.0 + 0.1 # Simplified model
return {"output": result.tolist()}Rust equivalent:
use serde::{Deserialize, Serialize};
#[derive(Deserialize)]
struct InferRequest {
input: Vec<f32>,
}
#[derive(Serialize)]
struct InferResponse {
output: Vec<f32>,
}
fn run_inference(input: &[f32]) -> Vec<f32> {
input.iter().map(|x| x * 2.0 + 0.1).collect()
}
// In an axum handler:
// async fn infer(Json(req): Json<InferRequest>) -> Json<InferResponse> {
// let output = tokio::task::spawn_blocking(move || run_inference(&req.input))
// .await.unwrap();
// Json(InferResponse { output })
// }
fn main() {
let req = InferRequest { input: vec![1.0, 2.0, 3.0] };
let resp = InferResponse { output: run_inference(&req.input) };
println!("{}", serde_json::to_string(&resp).unwrap());
}Step 2: Port input/output schemas
use serde::{Deserialize, Serialize};
/// Match your existing Python/OpenAPI schema exactly during migration
/// to ensure zero client changes needed
#[derive(Debug, Deserialize)]
#[serde(deny_unknown_fields)] // Catch schema drift during migration
struct LegacyInferRequest {
/// Model name — must match your Python API field names exactly
model: String,
/// Input tensor as flat array
inputs: Vec<f32>,
/// Optional top-k parameter
#[serde(default = "default_top_k")]
top_k: u32,
/// Optional temperature (LLM inference)
#[serde(default = "default_temperature")]
temperature: f32,
}
fn default_top_k() -> u32 { 5 }
fn default_temperature() -> f32 { 1.0 }
#[derive(Debug, Serialize)]
struct LegacyInferResponse {
model: String,
outputs: Vec<f32>,
/// Preserve all legacy fields to avoid breaking consumers
latency_ms: f64,
version: String,
}
fn main() {
let json = r#"{
"model": "resnet-50",
"inputs": [0.1, 0.2, 0.3],
"top_k": 3
}"#;
let req: LegacyInferRequest = serde_json::from_str(json).unwrap();
println!("Parsed request: model={} inputs={:?}", req.model, req.inputs);
let resp = LegacyInferResponse {
model: req.model.clone(),
outputs: req.inputs.iter().map(|x| x * 2.0).collect(),
latency_ms: 1.5,
version: "rust-v1".to_string(),
};
println!("{}", serde_json::to_string_pretty(&resp).unwrap());
}Step 3: Shadow mode validation
use std::sync::Arc;
use tokio::sync::mpsc;
/// Shadow mode: send each request to both old and new implementation,
/// compare outputs, log discrepancies
async fn shadow_infer(
input: Vec<f32>,
shadow_tx: mpsc::Sender<(Vec<f32>, Vec<f32>)>,
) -> Vec<f32> {
// Run new (Rust) implementation
let rust_result: Vec<f32> = input.iter().map(|x| x * 2.0 + 0.1).collect();
// Run old (Python) implementation via HTTP (fire-and-forget)
let input_clone = input.clone();
let rust_clone = rust_result.clone();
tokio::spawn(async move {
let python_result = call_python_inference(&input_clone).await;
if !outputs_match(&rust_clone, &python_result, 1e-4) {
let _ = shadow_tx.send((rust_clone, python_result)).await;
}
});
rust_result
}
async fn call_python_inference(input: &[f32]) -> Vec<f32> {
// Simulate Python API call
input.iter().map(|x| x * 2.0 + 0.1).collect()
}
fn outputs_match(a: &[f32], b: &[f32], tol: f32) -> bool {
a.len() == b.len() && a.iter().zip(b).all(|(x, y)| (x - y).abs() < tol)
}
#[tokio::main]
async fn main() {
let (tx, mut rx) = mpsc::channel::<(Vec<f32>, Vec<f32>)>(100);
let result = shadow_infer(vec![1.0, 2.0, 3.0], tx).await;
println!("Inference result: {:?}", result);
// Check for discrepancies (would be logged/alerted in production)
while let Ok((rust, python)) = rx.try_recv() {
eprintln!("⚠️ Output mismatch: rust={:?} python={:?}", rust, python);
}
}Migration checklist
- [ ] Schema compatibility: Rust API accepts all existing request formats.
- [ ] Shadow mode running for 48h with < 0.1% discrepancy rate.
- [ ] Latency benchmarks show improvement in all percentiles.
- [ ] Error codes and messages match the Python service.
- [ ] Health check endpoints identical (
/healthz,/readyz). - [ ] Logging format matches (for existing dashboards/alerts).
- [ ] Load test at 2x production traffic before cutover.