Rust AI Inference Review Checklist

Use this checklist during code review of any Rust AI inference service.

✅ Correctness

rust

// CHECK: Input validation happens before any processing
fn check_correctness_example(input: &[f32]) -> Result<Vec<f32>, String> {
    // ✅ Validate first
    if input.is_empty() { return Err("empty input".into()); }
    if input.iter().any(|v| !v.is_finite()) { return Err("non-finite input".into()); }

    // ✅ Then compute
    Ok(input.iter().map(|x| x * 2.0).collect())
}

[ ] Input size, dtype, and value range validated before inference.
[ ] Output shape and value range validated after inference.
[ ] Softmax outputs sum to ~1.0 for classification tasks.
[ ] No silent default values that could mask wrong inputs.
[ ] Model version and config logged at startup.

✅ Performance

rust

use std::sync::Arc;

// CHECK: Model weights are behind Arc, not cloned
struct GoodService {
    weights: Arc<Vec<f32>>,  // ✅ Shared reference
}

struct BadService {
    weights: Vec<f32>,       // ❌ Will be cloned per request if moved
}

// CHECK: Heavy compute is in spawn_blocking, not async fn
async fn good_handler(weights: Arc<Vec<f32>>, input: Vec<f32>) -> Vec<f32> {
    tokio::task::spawn_blocking(move || {  // ✅
        input.iter().zip(&*weights).map(|(x, w)| x * w).collect()
    }).await.unwrap()
}

[ ] No clone() on model weights per request (use Arc).
[ ] CPU-heavy inference is in spawn_blocking.
[ ] Request queue is bounded (not unbounded_channel).
[ ] Buffer pool used for tensor allocations in hot paths.
[ ] Release build confirmed (opt-level = 3).
[ ] No .unwrap() on allocations that could fail under memory pressure.

✅ Concurrency

rust

use tokio::sync::Semaphore;
use std::sync::Arc;

// CHECK: Concurrency limit prevents thread exhaustion
struct InferenceService {
    semaphore: Arc<Semaphore>,  // ✅ Explicit concurrency control
}

// CHECK: No std::sync::Mutex held across await points
// ❌ BAD:
// async fn bad() {
//     let guard = std::sync::Mutex::new(0).lock().unwrap();
//     tokio::time::sleep(Duration::from_secs(1)).await;  // holding mutex!
//     drop(guard);
// }

[ ] Concurrency is explicitly limited (Semaphore or bounded channel).
[ ] No std::sync::Mutex held across .await points.
[ ] Timeout set for concurrency acquisition (not just for inference).
[ ] No Arc cycles that could prevent model deallocation during reload.

✅ Error handling

rust

// CHECK: Errors are typed, not stringly-typed
#[derive(Debug)]
enum InferError {
    Timeout,
    InvalidInput(String),
    ModelError(String),
}

// ✅ Typed error
fn infer(input: &[f32]) -> Result<Vec<f32>, InferError> {
    if input.is_empty() { return Err(InferError::InvalidInput("empty".into())); }
    Ok(input.iter().map(|x| x * 2.0).collect())
}

// ❌ Stringly-typed error
fn infer_bad(input: &[f32]) -> Result<Vec<f32>, String> {
    if input.is_empty() { return Err("empty input".into()); }
    Ok(input.iter().map(|x| x * 2.0).collect())
}

[ ] Errors are typed enums, not String or Box in hot paths.
[ ] Every error maps to a specific HTTP status code.
[ ] Retryable vs non-retryable errors are distinguishable.
[ ] No unwrap() or expect() in request-handling paths.
[ ] Panic handler installed to convert panics to 500 responses.

✅ Observability

rust

use std::sync::atomic::{AtomicU64, Ordering};
use std::time::Instant;

static REQUESTS_TOTAL: AtomicU64 = AtomicU64::new(0);
static ERRORS_TOTAL: AtomicU64 = AtomicU64::new(0);

fn with_metrics<F: FnOnce() -> Result<Vec<f32>, String>>(f: F) -> Result<Vec<f32>, String> {
    REQUESTS_TOTAL.fetch_add(1, Ordering::Relaxed);
    let start = Instant::now();
    let result = f();
    if result.is_err() { ERRORS_TOTAL.fetch_add(1, Ordering::Relaxed); }
    let _duration = start.elapsed();
    // In production: histogram!("inference_duration_ms", duration.as_millis())
    result
}

[ ] Request count, error count, and latency histograms exported.
[ ] Queue depth metric exported.
[ ] Model version included in metrics labels.
[ ] Structured logging with request ID on every log line.
[ ] /metrics endpoint exists and is Prometheus-compatible.
[ ] /healthz (liveness) and /readyz (readiness) endpoints present.

✅ Operational readiness

[ ] Graceful shutdown: drains in-flight requests before exiting.
[ ] Model files checksummed at startup; startup fails if mismatch.
[ ] Warm-up inference runs before accepting live traffic.
[ ] Resource limits (memory, file descriptors) documented and set.
[ ] Runbook exists for: high latency, OOM, wrong outputs, model rollback.

Rust AI Inference Review Checklist

Rust AI Inference Review Checklist

✅ Correctness

✅ Performance

✅ Concurrency

✅ Error handling

✅ Observability

✅ Operational readiness

Related reading

Related Guides

Rust AI Inference Best Practices

Rust AI Inference Production Guide

Continue in This Topic

Rust AI Inference Real-World Cases

Rust AI Inference Scaling

More Rust Guides

Building LLM Applications with Rust

LLM API Gateway in Rust

LLM Rust Anti-Patterns

LLM Rust Benchmarking

LLM Rust Decision Matrix

LLM Rust Interview Q&A