RRust By Example

Rust AI Inference Migration Guide

How to migrate your AI inference service to Rust from Python (FastAPI, Flask), Go, or Java. Step-by-step migration strategy, API compatibility, and performance comparison.

Topic: Ai Inference

Search intent: High-intent search: "migrate python ai inference to rust"

Rust AI Inference Migration Guide

Why migrate to Rust?

| Metric | Python (FastAPI) | Rust (Axum) | Improvement |

|---|---|---|---|

| p99 latency (small model) | 25–80ms | 2–8ms | 5–10x |

| Memory per instance | 400–800MB | 40–80MB | 10x |

| Requests/sec (single core) | 2,000–5,000 | 50,000–200,000 | 20–40x |

| Cold start time | 2–5s | 0.1–0.5s | 10x |

| Binary size | N/A (Python) | 5–20MB | N/A |

Migration strategy: strangler fig pattern

rust
Phase 1: Dual-run
  Python server still handles all traffic
  Rust server mirrors traffic for validation (shadow mode)

Phase 2: Canary
  5%25%50%100% of traffic to Rust
  Monitor metrics at each step

Phase 3: Full migration
  Python server decommissioned
  Rust server handles 100%

Step 1: Port the inference logic

Python original:

python
# Python FastAPI version
import numpy as np
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class InferRequest(BaseModel):
    input: list[float]

@app.post("/infer")
def infer(req: InferRequest) -> dict:
    arr = np.array(req.input)
    result = arr * 2.0 + 0.1  # Simplified model
    return {"output": result.tolist()}

Rust equivalent:

rust
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct InferRequest {
    input: Vec<f32>,
}

#[derive(Serialize)]
struct InferResponse {
    output: Vec<f32>,
}

fn run_inference(input: &[f32]) -> Vec<f32> {
    input.iter().map(|x| x * 2.0 + 0.1).collect()
}

// In an axum handler:
// async fn infer(Json(req): Json<InferRequest>) -> Json<InferResponse> {
//     let output = tokio::task::spawn_blocking(move || run_inference(&req.input))
//         .await.unwrap();
//     Json(InferResponse { output })
// }

fn main() {
    let req = InferRequest { input: vec![1.0, 2.0, 3.0] };
    let resp = InferResponse { output: run_inference(&req.input) };
    println!("{}", serde_json::to_string(&resp).unwrap());
}

Step 2: Port input/output schemas

rust
use serde::{Deserialize, Serialize};

/// Match your existing Python/OpenAPI schema exactly during migration
/// to ensure zero client changes needed
#[derive(Debug, Deserialize)]
#[serde(deny_unknown_fields)] // Catch schema drift during migration
struct LegacyInferRequest {
    /// Model name — must match your Python API field names exactly
    model: String,
    /// Input tensor as flat array
    inputs: Vec<f32>,
    /// Optional top-k parameter
    #[serde(default = "default_top_k")]
    top_k: u32,
    /// Optional temperature (LLM inference)
    #[serde(default = "default_temperature")]
    temperature: f32,
}

fn default_top_k() -> u32 { 5 }
fn default_temperature() -> f32 { 1.0 }

#[derive(Debug, Serialize)]
struct LegacyInferResponse {
    model: String,
    outputs: Vec<f32>,
    /// Preserve all legacy fields to avoid breaking consumers
    latency_ms: f64,
    version: String,
}

fn main() {
    let json = r#"{
        "model": "resnet-50",
        "inputs": [0.1, 0.2, 0.3],
        "top_k": 3
    }"#;

    let req: LegacyInferRequest = serde_json::from_str(json).unwrap();
    println!("Parsed request: model={} inputs={:?}", req.model, req.inputs);

    let resp = LegacyInferResponse {
        model: req.model.clone(),
        outputs: req.inputs.iter().map(|x| x * 2.0).collect(),
        latency_ms: 1.5,
        version: "rust-v1".to_string(),
    };
    println!("{}", serde_json::to_string_pretty(&resp).unwrap());
}

Step 3: Shadow mode validation

rust
use std::sync::Arc;
use tokio::sync::mpsc;

/// Shadow mode: send each request to both old and new implementation,
/// compare outputs, log discrepancies
async fn shadow_infer(
    input: Vec<f32>,
    shadow_tx: mpsc::Sender<(Vec<f32>, Vec<f32>)>,
) -> Vec<f32> {
    // Run new (Rust) implementation
    let rust_result: Vec<f32> = input.iter().map(|x| x * 2.0 + 0.1).collect();

    // Run old (Python) implementation via HTTP (fire-and-forget)
    let input_clone = input.clone();
    let rust_clone = rust_result.clone();
    tokio::spawn(async move {
        let python_result = call_python_inference(&input_clone).await;
        if !outputs_match(&rust_clone, &python_result, 1e-4) {
            let _ = shadow_tx.send((rust_clone, python_result)).await;
        }
    });

    rust_result
}

async fn call_python_inference(input: &[f32]) -> Vec<f32> {
    // Simulate Python API call
    input.iter().map(|x| x * 2.0 + 0.1).collect()
}

fn outputs_match(a: &[f32], b: &[f32], tol: f32) -> bool {
    a.len() == b.len() && a.iter().zip(b).all(|(x, y)| (x - y).abs() < tol)
}

#[tokio::main]
async fn main() {
    let (tx, mut rx) = mpsc::channel::<(Vec<f32>, Vec<f32>)>(100);

    let result = shadow_infer(vec![1.0, 2.0, 3.0], tx).await;
    println!("Inference result: {:?}", result);

    // Check for discrepancies (would be logged/alerted in production)
    while let Ok((rust, python)) = rx.try_recv() {
        eprintln!("⚠️  Output mismatch: rust={:?} python={:?}", rust, python);
    }
}

Migration checklist

  • [ ] Schema compatibility: Rust API accepts all existing request formats.
  • [ ] Shadow mode running for 48h with < 0.1% discrepancy rate.
  • [ ] Latency benchmarks show improvement in all percentiles.
  • [ ] Error codes and messages match the Python service.
  • [ ] Health check endpoints identical (/healthz, /readyz).
  • [ ] Logging format matches (for existing dashboards/alerts).
  • [ ] Load test at 2x production traffic before cutover.

Related reading

Related Guides

Continue in This Topic

More Rust Guides