Rust AI Inference Debug Checklist
Step-by-step debug checklist for AI inference issues in Rust. Use this checklist when your inference server is slow, crashing, or producing incorrect results.
Topic: Ai Inference
Search intent: High-intent search: "rust ai inference debug checklist"
Rust AI Inference Debug Checklist
Use this checklist systematically when diagnosing any issue with a Rust-based AI inference service.
Step 1: Confirm the build profile
# Always profile and debug with release mode
cargo build --release
# Debug builds can be 10–50x slower — never benchmark debug builds
cargo run --release -- --port 8080- [ ] Running
--releasebuild, not debug. - [ ] Confirm
opt-level = 3inCargo.toml[profile.release].
Step 2: Check request lifecycle
use std::time::Instant;
use tracing::{info, instrument};
#[instrument(skip(input))]
async fn traced_infer(request_id: u64, input: Vec<f32>) -> Vec<f32> {
let t0 = Instant::now();
info!(request_id, input_len = input.len(), "inference started");
let result = tokio::task::spawn_blocking(move || {
input.iter().map(|x| x * 2.0).collect::<Vec<_>>()
}).await.unwrap();
info!(
request_id,
duration_ms = t0.elapsed().as_millis(),
output_len = result.len(),
"inference complete"
);
result
}
#[tokio::main]
async fn main() {
tracing_subscriber::fmt::init();
let r = traced_infer(1, vec![1.0, 2.0, 3.0]).await;
println!("{:?}", r);
}- [ ] Every request has a unique ID that appears in all log lines.
- [ ] Log timestamps at: enqueue, dequeue, inference start, inference end, response sent.
Step 3: Validate input before inference
fn check_input(input: &[f32]) -> Result<(), String> {
if input.is_empty() { return Err("empty input".into()); }
if input.len() > 4096 { return Err(format!("too large: {}", input.len())); }
if input.iter().any(|v| !v.is_finite()) { return Err("contains NaN or Inf".into()); }
Ok(())
}
fn main() {
let cases = [
vec![1.0f32, 2.0, 3.0],
vec![f32::NAN],
vec![],
];
for c in &cases {
println!("{:?} → {:?}", &c[..c.len().min(2)], check_input(c));
}
}- [ ] Input is not empty.
- [ ] Input length is within expected bounds.
- [ ] No NaN or Inf values in input.
- [ ] Input dtype matches model expectation (f32 vs f16 vs i64).
Step 4: Check for blocking calls in async context
// Find any of these patterns in your async code:
// std::thread::sleep(...) → replace with tokio::time::sleep(...)
// std::sync::Mutex::lock() → replace with tokio::sync::Mutex or RwLock
// Heavy CPU loops → wrap in spawn_blocking
// Quick audit: search your codebase for these
// grep -r "std::thread::sleep" src/
// grep -r "\.lock()" src/ | grep -v "spawn_blocking"- [ ] No
std::thread::sleepinsideasync fn. - [ ] No
std::sync::Mutex::lock()insideasync fnthat can be long-held. - [ ] All heavy compute is in
spawn_blocking.
Step 5: Inspect queue and concurrency metrics
use std::sync::atomic::{AtomicUsize, Ordering};
static QUEUE_DEPTH: AtomicUsize = AtomicUsize::new(0);
static IN_FLIGHT: AtomicUsize = AtomicUsize::new(0);
fn report_metrics() {
println!(
"queue_depth={} in_flight={}",
QUEUE_DEPTH.load(Ordering::Relaxed),
IN_FLIGHT.load(Ordering::Relaxed),
);
}
fn main() { report_metrics(); }- [ ] Queue depth stays under 100 under normal load.
- [ ] In-flight requests don't exceed
max_concurrentsetting. - [ ] No requests stuck in queue for > 5 seconds.
Step 6: Verify output sanity
fn output_ok(output: &[f32]) -> bool {
!output.is_empty()
&& output.iter().all(|v| v.is_finite())
&& output.iter().any(|&v| v != 0.0) // not all zeros
}
fn main() {
let good = vec![0.1f32, 0.9, 0.5];
let bad = vec![0.0f32; 8];
println!("good: {}", output_ok(&good));
println!("bad (all zeros): {}", output_ok(&bad));
}- [ ] Output is not all-zeros.
- [ ] Output contains no NaN/Inf values.
- [ ] Output shape matches expected shape.
- [ ] Softmax outputs sum to ~1.0 (classification models).
Step 7: Memory health check
# Watch RSS over time
watch -n 5 'cat /proc/$(pgrep my-inference-server)/status | grep VmRSS'
# Check for leaks with valgrind (slower, but definitive)
valgrind --leak-check=full --track-origins=yes ./target/debug/my-server- [ ] RSS is stable after warm-up period.
- [ ] No allocation spike on batch boundaries.
- [ ] Buffer pool is returning buffers after each request.