Rust AI Inference Review Checklist
Code review checklist for AI inference Rust services. Covers correctness, performance, security, observability, and operational readiness for production AI serving systems.
Topic: Ai Inference
Search intent: High-intent search: "rust ai inference code review checklist"
Rust AI Inference Review Checklist
Use this checklist during code review of any Rust AI inference service.
✅ Correctness
// CHECK: Input validation happens before any processing
fn check_correctness_example(input: &[f32]) -> Result<Vec<f32>, String> {
// ✅ Validate first
if input.is_empty() { return Err("empty input".into()); }
if input.iter().any(|v| !v.is_finite()) { return Err("non-finite input".into()); }
// ✅ Then compute
Ok(input.iter().map(|x| x * 2.0).collect())
}- [ ] Input size, dtype, and value range validated before inference.
- [ ] Output shape and value range validated after inference.
- [ ] Softmax outputs sum to ~1.0 for classification tasks.
- [ ] No silent default values that could mask wrong inputs.
- [ ] Model version and config logged at startup.
✅ Performance
use std::sync::Arc;
// CHECK: Model weights are behind Arc, not cloned
struct GoodService {
weights: Arc<Vec<f32>>, // ✅ Shared reference
}
struct BadService {
weights: Vec<f32>, // ❌ Will be cloned per request if moved
}
// CHECK: Heavy compute is in spawn_blocking, not async fn
async fn good_handler(weights: Arc<Vec<f32>>, input: Vec<f32>) -> Vec<f32> {
tokio::task::spawn_blocking(move || { // ✅
input.iter().zip(&*weights).map(|(x, w)| x * w).collect()
}).await.unwrap()
}- [ ] No
clone()on model weights per request (useArc). - [ ] CPU-heavy inference is in
spawn_blocking. - [ ] Request queue is bounded (not
unbounded_channel). - [ ] Buffer pool used for tensor allocations in hot paths.
- [ ] Release build confirmed (
opt-level = 3). - [ ] No
.unwrap()on allocations that could fail under memory pressure.
✅ Concurrency
use tokio::sync::Semaphore;
use std::sync::Arc;
// CHECK: Concurrency limit prevents thread exhaustion
struct InferenceService {
semaphore: Arc<Semaphore>, // ✅ Explicit concurrency control
}
// CHECK: No std::sync::Mutex held across await points
// ❌ BAD:
// async fn bad() {
// let guard = std::sync::Mutex::new(0).lock().unwrap();
// tokio::time::sleep(Duration::from_secs(1)).await; // holding mutex!
// drop(guard);
// }- [ ] Concurrency is explicitly limited (Semaphore or bounded channel).
- [ ] No
std::sync::Mutexheld across.awaitpoints. - [ ] Timeout set for concurrency acquisition (not just for inference).
- [ ] No
Arccycles that could prevent model deallocation during reload.
✅ Error handling
// CHECK: Errors are typed, not stringly-typed
#[derive(Debug)]
enum InferError {
Timeout,
InvalidInput(String),
ModelError(String),
}
// ✅ Typed error
fn infer(input: &[f32]) -> Result<Vec<f32>, InferError> {
if input.is_empty() { return Err(InferError::InvalidInput("empty".into())); }
Ok(input.iter().map(|x| x * 2.0).collect())
}
// ❌ Stringly-typed error
fn infer_bad(input: &[f32]) -> Result<Vec<f32>, String> {
if input.is_empty() { return Err("empty input".into()); }
Ok(input.iter().map(|x| x * 2.0).collect())
}- [ ] Errors are typed enums, not
StringorBoxin hot paths. - [ ] Every error maps to a specific HTTP status code.
- [ ] Retryable vs non-retryable errors are distinguishable.
- [ ] No
unwrap()orexpect()in request-handling paths. - [ ] Panic handler installed to convert panics to 500 responses.
✅ Observability
use std::sync::atomic::{AtomicU64, Ordering};
use std::time::Instant;
static REQUESTS_TOTAL: AtomicU64 = AtomicU64::new(0);
static ERRORS_TOTAL: AtomicU64 = AtomicU64::new(0);
fn with_metrics<F: FnOnce() -> Result<Vec<f32>, String>>(f: F) -> Result<Vec<f32>, String> {
REQUESTS_TOTAL.fetch_add(1, Ordering::Relaxed);
let start = Instant::now();
let result = f();
if result.is_err() { ERRORS_TOTAL.fetch_add(1, Ordering::Relaxed); }
let _duration = start.elapsed();
// In production: histogram!("inference_duration_ms", duration.as_millis())
result
}- [ ] Request count, error count, and latency histograms exported.
- [ ] Queue depth metric exported.
- [ ] Model version included in metrics labels.
- [ ] Structured logging with request ID on every log line.
- [ ]
/metricsendpoint exists and is Prometheus-compatible. - [ ]
/healthz(liveness) and/readyz(readiness) endpoints present.
✅ Operational readiness
- [ ] Graceful shutdown: drains in-flight requests before exiting.
- [ ] Model files checksummed at startup; startup fails if mismatch.
- [ ] Warm-up inference runs before accepting live traffic.
- [ ] Resource limits (memory, file descriptors) documented and set.
- [ ] Runbook exists for: high latency, OOM, wrong outputs, model rollback.