RRust By Example

Rust AI Inference Review Checklist

Code review checklist for AI inference Rust services. Covers correctness, performance, security, observability, and operational readiness for production AI serving systems.

Topic: Ai Inference

Search intent: High-intent search: "rust ai inference code review checklist"

Rust AI Inference Review Checklist

Use this checklist during code review of any Rust AI inference service.

✅ Correctness

rust
// CHECK: Input validation happens before any processing
fn check_correctness_example(input: &[f32]) -> Result<Vec<f32>, String> {
    // ✅ Validate first
    if input.is_empty() { return Err("empty input".into()); }
    if input.iter().any(|v| !v.is_finite()) { return Err("non-finite input".into()); }

    // ✅ Then compute
    Ok(input.iter().map(|x| x * 2.0).collect())
}
  • [ ] Input size, dtype, and value range validated before inference.
  • [ ] Output shape and value range validated after inference.
  • [ ] Softmax outputs sum to ~1.0 for classification tasks.
  • [ ] No silent default values that could mask wrong inputs.
  • [ ] Model version and config logged at startup.

✅ Performance

rust
use std::sync::Arc;

// CHECK: Model weights are behind Arc, not cloned
struct GoodService {
    weights: Arc<Vec<f32>>,  // ✅ Shared reference
}

struct BadService {
    weights: Vec<f32>,       // ❌ Will be cloned per request if moved
}

// CHECK: Heavy compute is in spawn_blocking, not async fn
async fn good_handler(weights: Arc<Vec<f32>>, input: Vec<f32>) -> Vec<f32> {
    tokio::task::spawn_blocking(move || {  // ✅
        input.iter().zip(&*weights).map(|(x, w)| x * w).collect()
    }).await.unwrap()
}
  • [ ] No clone() on model weights per request (use Arc).
  • [ ] CPU-heavy inference is in spawn_blocking.
  • [ ] Request queue is bounded (not unbounded_channel).
  • [ ] Buffer pool used for tensor allocations in hot paths.
  • [ ] Release build confirmed (opt-level = 3).
  • [ ] No .unwrap() on allocations that could fail under memory pressure.

✅ Concurrency

rust
use tokio::sync::Semaphore;
use std::sync::Arc;

// CHECK: Concurrency limit prevents thread exhaustion
struct InferenceService {
    semaphore: Arc<Semaphore>,  // ✅ Explicit concurrency control
}

// CHECK: No std::sync::Mutex held across await points
// ❌ BAD:
// async fn bad() {
//     let guard = std::sync::Mutex::new(0).lock().unwrap();
//     tokio::time::sleep(Duration::from_secs(1)).await;  // holding mutex!
//     drop(guard);
// }
  • [ ] Concurrency is explicitly limited (Semaphore or bounded channel).
  • [ ] No std::sync::Mutex held across .await points.
  • [ ] Timeout set for concurrency acquisition (not just for inference).
  • [ ] No Arc cycles that could prevent model deallocation during reload.

✅ Error handling

rust
// CHECK: Errors are typed, not stringly-typed
#[derive(Debug)]
enum InferError {
    Timeout,
    InvalidInput(String),
    ModelError(String),
}

// ✅ Typed error
fn infer(input: &[f32]) -> Result<Vec<f32>, InferError> {
    if input.is_empty() { return Err(InferError::InvalidInput("empty".into())); }
    Ok(input.iter().map(|x| x * 2.0).collect())
}

// ❌ Stringly-typed error
fn infer_bad(input: &[f32]) -> Result<Vec<f32>, String> {
    if input.is_empty() { return Err("empty input".into()); }
    Ok(input.iter().map(|x| x * 2.0).collect())
}
  • [ ] Errors are typed enums, not String or Box in hot paths.
  • [ ] Every error maps to a specific HTTP status code.
  • [ ] Retryable vs non-retryable errors are distinguishable.
  • [ ] No unwrap() or expect() in request-handling paths.
  • [ ] Panic handler installed to convert panics to 500 responses.

✅ Observability

rust
use std::sync::atomic::{AtomicU64, Ordering};
use std::time::Instant;

static REQUESTS_TOTAL: AtomicU64 = AtomicU64::new(0);
static ERRORS_TOTAL: AtomicU64 = AtomicU64::new(0);

fn with_metrics<F: FnOnce() -> Result<Vec<f32>, String>>(f: F) -> Result<Vec<f32>, String> {
    REQUESTS_TOTAL.fetch_add(1, Ordering::Relaxed);
    let start = Instant::now();
    let result = f();
    if result.is_err() { ERRORS_TOTAL.fetch_add(1, Ordering::Relaxed); }
    let _duration = start.elapsed();
    // In production: histogram!("inference_duration_ms", duration.as_millis())
    result
}
  • [ ] Request count, error count, and latency histograms exported.
  • [ ] Queue depth metric exported.
  • [ ] Model version included in metrics labels.
  • [ ] Structured logging with request ID on every log line.
  • [ ] /metrics endpoint exists and is Prometheus-compatible.
  • [ ] /healthz (liveness) and /readyz (readiness) endpoints present.

✅ Operational readiness

  • [ ] Graceful shutdown: drains in-flight requests before exiting.
  • [ ] Model files checksummed at startup; startup fails if mismatch.
  • [ ] Warm-up inference runs before accepting live traffic.
  • [ ] Resource limits (memory, file descriptors) documented and set.
  • [ ] Runbook exists for: high latency, OOM, wrong outputs, model rollback.

Related reading

Related Guides

Continue in This Topic

More Rust Guides