Rust AI Inference Troubleshooting

Quick diagnosis flowchart

rust

Request slow or failed?
├── p99 latency spike → check queue depth, batch fill rate
├── OOM crash → check model size, buffer pool leak, tensor cloning
├── Wrong output → check input preprocessing, dtype, normalization
├── Tokio task stall → check for blocking calls in async context
└── GPU error → check CUDA context, driver version, memory fragmentation

Issue 1: High tail latency (p99 >> p50)

Symptoms: Average latency is fine, but p99 is 10x worse.

Common causes:

Queue backed up during traffic bursts.
GC-like allocation pauses from frequent Vec::new().
Lock contention on model weight access.

Diagnosis:

rust

use std::time::Instant;

struct LatencyTracker {
    samples: Vec<u64>, // microseconds
}

impl LatencyTracker {
    fn new() -> Self { Self { samples: Vec::new() } }

    fn record(&mut self, start: Instant) {
        self.samples.push(start.elapsed().as_micros() as u64);
    }

    fn percentile(&mut self, p: f64) -> u64 {
        self.samples.sort_unstable();
        let idx = (self.samples.len() as f64 * p / 100.0) as usize;
        self.samples[idx.min(self.samples.len() - 1)]
    }
}

fn main() {
    let mut tracker = LatencyTracker::new();

    for i in 0..1000u64 {
        let start = Instant::now();
        // Simulate variable latency
        let work: u64 = (0..=i % 100).sum();
        std::hint::black_box(work);
        tracker.record(start);
    }

    println!("p50:  {}µs", tracker.percentile(50.0));
    println!("p95:  {}µs", tracker.percentile(95.0));
    println!("p99:  {}µs", tracker.percentile(99.0));
    println!("p99.9:{}µs", tracker.percentile(99.9));
}

Fix: Use a buffer pool (see performance-tuning guide), set queue size limits, and switch from Mutex to Arc (read-only during inference).

---

Issue 2: Memory grows unboundedly

Symptoms: RSS climbs over hours; server eventually OOM-killed.

Common causes:

Inference results cached without eviction.
Request objects not dropped after sending response.
Arc cycles preventing deallocation.

Diagnosis:

rust

// Add periodic memory reporting
fn log_memory_usage() {
    // On Linux, read /proc/self/status
    if let Ok(status) = std::fs::read_to_string("/proc/self/status") {
        for line in status.lines() {
            if line.starts_with("VmRSS:") || line.starts_with("VmPeak:") {
                println!("[mem] {}", line.trim());
            }
        }
    }
}

fn main() {
    log_memory_usage();
}

Fix: Use WeakRef to break Arc cycles; implement LRU cache with TTL for result caches; run under valgrind --tool=massif to find the leak source.

---

Issue 3: Wrong inference outputs

Symptoms: Model returns garbage or all-zero outputs.

Common causes:

Input not normalized to the range the model expects.
Float precision issues (f64 input to f32 model).
Input tensor transposed incorrectly.

Diagnosis:

rust

fn validate_output(output: &[f32], min: f32, max: f32) -> bool {
    if output.iter().all(|&v| v == 0.0) {
        eprintln!("⚠️  Output is all zeros — check input normalization");
        return false;
    }
    if output.iter().any(|v| !v.is_finite()) {
        eprintln!("⚠️  Output contains NaN/Inf — check for division by zero");
        return false;
    }
    if output.iter().any(|&v| v < min || v > max) {
        eprintln!("⚠️  Output out of expected range [{}, {}]", min, max);
        return false;
    }
    true
}

/// Normalize input to [-1, 1]
fn normalize(input: &mut Vec<f32>) {
    let max = input.iter().copied().fold(f32::NEG_INFINITY, f32::max);
    let min = input.iter().copied().fold(f32::INFINITY, f32::min);
    let range = max - min;
    if range > 1e-6 {
        for v in input.iter_mut() {
            *v = (*v - min) / range * 2.0 - 1.0;
        }
    }
}

fn main() {
    let mut input = vec![100.0f32, 200.0, 50.0, 300.0];
    normalize(&mut input);
    println!("Normalized: {:?}", input);

    let output = input.iter().map(|x| x * 0.5).collect::<Vec<_>>();
    validate_output(&output, -1.0, 1.0);
    println!("Output valid: {:?}", &output[..2]);
}

---

Issue 4: Tokio runtime stall

Symptoms: All requests hang indefinitely; CPU at 0%.

Diagnosis:

rust

// Enable tokio-console for runtime introspection:
// In Cargo.toml:
//   tokio = { features = ["full", "tracing"] }
//   console-subscriber = "0.3"
//
// At startup:
//   console_subscriber::init();
//
// Then run: tokio-console

// Quick check: add timeout to every inference call
use tokio::time::{timeout, Duration};

async fn safe_infer(input: Vec<f32>) -> Result<Vec<f32>, &'static str> {
    timeout(Duration::from_secs(5), async move {
        tokio::task::spawn_blocking(move || {
            input.iter().map(|x| x * 2.0).collect::<Vec<_>>()
        })
        .await
        .map_err(|_| "task panic")
    })
    .await
    .map_err(|_| "inference timeout")?
}

#[tokio::main]
async fn main() {
    let result = safe_infer(vec![1.0, 2.0, 3.0]).await;
    println!("{:?}", result);
}

---

Troubleshooting checklist

[ ] Add tracing spans around every inference call.
[ ] Log queue depth every 10 seconds.
[ ] Set memory limit alerts at 70% and 90% of container limit.
[ ] Validate model output range after every deploy.
[ ] Test with wrk or hey under sustained load to surface tail latency.
[ ] Run cargo build --release — debug builds are 10–50x slower.

Rust AI Inference Troubleshooting

Rust AI Inference Troubleshooting

Quick diagnosis flowchart

Issue 1: High tail latency (p99 >> p50)

Issue 2: Memory grows unboundedly

Issue 3: Wrong inference outputs

Issue 4: Tokio runtime stall

Troubleshooting checklist

Related reading

Related Guides

Rust AI Inference Anti-Patterns

Rust AI Inference Debug Checklist

Continue in This Topic

Rust AI Inference Testing Strategy

More Rust Guides

Building LLM Applications with Rust

LLM API Gateway in Rust

LLM Rust Anti-Patterns

LLM Rust Benchmarking

LLM Rust Decision Matrix

LLM Rust Interview Q&A