RRust By Example

Rust AI Inference Troubleshooting

Diagnose and fix common AI inference issues in Rust: high latency, OOM crashes, model output corruption, GPU errors, and async runtime stalls. Step-by-step troubleshooting playbook.

Topic: Ai Inference

Search intent: High-intent search: "rust ai inference debugging errors"

Rust AI Inference Troubleshooting

Quick diagnosis flowchart

rust
Request slow or failed?
├── p99 latency spike → check queue depth, batch fill rate
├── OOM crash → check model size, buffer pool leak, tensor cloning
├── Wrong output → check input preprocessing, dtype, normalization
├── Tokio task stall → check for blocking calls in async context
└── GPU error → check CUDA context, driver version, memory fragmentation

Issue 1: High tail latency (p99 >> p50)

Symptoms: Average latency is fine, but p99 is 10x worse.

Common causes:

  • Queue backed up during traffic bursts.
  • GC-like allocation pauses from frequent Vec::new().
  • Lock contention on model weight access.

Diagnosis:

rust
use std::time::Instant;

struct LatencyTracker {
    samples: Vec<u64>, // microseconds
}

impl LatencyTracker {
    fn new() -> Self { Self { samples: Vec::new() } }

    fn record(&mut self, start: Instant) {
        self.samples.push(start.elapsed().as_micros() as u64);
    }

    fn percentile(&mut self, p: f64) -> u64 {
        self.samples.sort_unstable();
        let idx = (self.samples.len() as f64 * p / 100.0) as usize;
        self.samples[idx.min(self.samples.len() - 1)]
    }
}

fn main() {
    let mut tracker = LatencyTracker::new();

    for i in 0..1000u64 {
        let start = Instant::now();
        // Simulate variable latency
        let work: u64 = (0..=i % 100).sum();
        std::hint::black_box(work);
        tracker.record(start);
    }

    println!("p50:  {}µs", tracker.percentile(50.0));
    println!("p95:  {}µs", tracker.percentile(95.0));
    println!("p99:  {}µs", tracker.percentile(99.0));
    println!("p99.9:{}µs", tracker.percentile(99.9));
}

Fix: Use a buffer pool (see performance-tuning guide), set queue size limits, and switch from Mutex to Arc (read-only during inference).

---

Issue 2: Memory grows unboundedly

Symptoms: RSS climbs over hours; server eventually OOM-killed.

Common causes:

  • Inference results cached without eviction.
  • Request objects not dropped after sending response.
  • Arc cycles preventing deallocation.

Diagnosis:

rust
// Add periodic memory reporting
fn log_memory_usage() {
    // On Linux, read /proc/self/status
    if let Ok(status) = std::fs::read_to_string("/proc/self/status") {
        for line in status.lines() {
            if line.starts_with("VmRSS:") || line.starts_with("VmPeak:") {
                println!("[mem] {}", line.trim());
            }
        }
    }
}

fn main() {
    log_memory_usage();
}

Fix: Use WeakRef to break Arc cycles; implement LRU cache with TTL for result caches; run under valgrind --tool=massif to find the leak source.

---

Issue 3: Wrong inference outputs

Symptoms: Model returns garbage or all-zero outputs.

Common causes:

  • Input not normalized to the range the model expects.
  • Float precision issues (f64 input to f32 model).
  • Input tensor transposed incorrectly.

Diagnosis:

rust
fn validate_output(output: &[f32], min: f32, max: f32) -> bool {
    if output.iter().all(|&v| v == 0.0) {
        eprintln!("⚠️  Output is all zeros — check input normalization");
        return false;
    }
    if output.iter().any(|v| !v.is_finite()) {
        eprintln!("⚠️  Output contains NaN/Inf — check for division by zero");
        return false;
    }
    if output.iter().any(|&v| v < min || v > max) {
        eprintln!("⚠️  Output out of expected range [{}, {}]", min, max);
        return false;
    }
    true
}

/// Normalize input to [-1, 1]
fn normalize(input: &mut Vec<f32>) {
    let max = input.iter().copied().fold(f32::NEG_INFINITY, f32::max);
    let min = input.iter().copied().fold(f32::INFINITY, f32::min);
    let range = max - min;
    if range > 1e-6 {
        for v in input.iter_mut() {
            *v = (*v - min) / range * 2.0 - 1.0;
        }
    }
}

fn main() {
    let mut input = vec![100.0f32, 200.0, 50.0, 300.0];
    normalize(&mut input);
    println!("Normalized: {:?}", input);

    let output = input.iter().map(|x| x * 0.5).collect::<Vec<_>>();
    validate_output(&output, -1.0, 1.0);
    println!("Output valid: {:?}", &output[..2]);
}

---

Issue 4: Tokio runtime stall

Symptoms: All requests hang indefinitely; CPU at 0%.

Diagnosis:

rust
// Enable tokio-console for runtime introspection:
// In Cargo.toml:
//   tokio = { features = ["full", "tracing"] }
//   console-subscriber = "0.3"
//
// At startup:
//   console_subscriber::init();
//
// Then run: tokio-console

// Quick check: add timeout to every inference call
use tokio::time::{timeout, Duration};

async fn safe_infer(input: Vec<f32>) -> Result<Vec<f32>, &'static str> {
    timeout(Duration::from_secs(5), async move {
        tokio::task::spawn_blocking(move || {
            input.iter().map(|x| x * 2.0).collect::<Vec<_>>()
        })
        .await
        .map_err(|_| "task panic")
    })
    .await
    .map_err(|_| "inference timeout")?
}

#[tokio::main]
async fn main() {
    let result = safe_infer(vec![1.0, 2.0, 3.0]).await;
    println!("{:?}", result);
}

---

Troubleshooting checklist

  • [ ] Add tracing spans around every inference call.
  • [ ] Log queue depth every 10 seconds.
  • [ ] Set memory limit alerts at 70% and 90% of container limit.
  • [ ] Validate model output range after every deploy.
  • [ ] Test with wrk or hey under sustained load to surface tail latency.
  • [ ] Run cargo build --release — debug builds are 10–50x slower.

Related reading

Related Guides

Continue in This Topic

Previous

Rust AI Inference Testing Strategy

Next

No next guide in this topic.

More Rust Guides