RRust By Example

Rust AI Inference Performance Tuning

Deep dive into performance tuning techniques for AI inference in Rust. Covers batching, SIMD, memory layout, thread pinning, and GPU dispatch strategies to minimize latency.

Topic: Ai Inference

Search intent: High-intent search: "rust ai inference performance"

Rust AI Inference Performance Tuning

Problem

AI inference is one of the most latency-sensitive workloads. A 10ms improvement in p99 latency can mean the difference between a great user experience and timeouts. This guide covers the systematic techniques that Rust enables uniquely well.

Performance tuning hierarchy

Apply optimizations in this order — earlier steps typically yield larger wins:

1. Algorithmic — batch requests, reduce unnecessary copies.

2. Data layout — contiguous memory, avoid pointer indirection.

3. Parallelism — saturate all CPU cores; pipeline stages.

4. SIMD — use packed_simd or autovectorization-friendly code.

5. Allocation — reuse buffers; use arena allocators.

6. I/O — async request handling; zero-copy serialization.

Runnable example — zero-copy tensor buffer pool

rust
use std::sync::{Arc, Mutex};
use std::collections::VecDeque;

/// Reusable fixed-size f32 buffer to avoid repeated allocation
struct TensorBuffer {
    data: Vec<f32>,
    capacity: usize,
}

impl TensorBuffer {
    fn new(capacity: usize) -> Self {
        Self {
            data: vec![0.0f32; capacity],
            capacity,
        }
    }

    fn fill(&mut self, src: &[f32]) {
        let len = src.len().min(self.capacity);
        self.data[..len].copy_from_slice(&src[..len]);
        // Zero-fill remainder
        for v in &mut self.data[len..] { *v = 0.0; }
    }

    fn as_slice(&self) -> &[f32] { &self.data }
}

/// Pool of reusable buffers
struct BufferPool {
    pool: Mutex<VecDeque<TensorBuffer>>,
    capacity: usize,
}

impl BufferPool {
    fn new(pool_size: usize, buffer_capacity: usize) -> Self {
        let mut pool = VecDeque::with_capacity(pool_size);
        for _ in 0..pool_size {
            pool.push_back(TensorBuffer::new(buffer_capacity));
        }
        Self { pool: Mutex::new(pool), capacity: buffer_capacity }
    }

    fn acquire(&self) -> TensorBuffer {
        let mut pool = self.pool.lock().unwrap();
        pool.pop_front().unwrap_or_else(|| TensorBuffer::new(self.capacity))
    }

    fn release(&self, buf: TensorBuffer) {
        let mut pool = self.pool.lock().unwrap();
        if pool.len() < 64 {
            pool.push_back(buf);
        }
        // Drop excess buffers automatically
    }
}

fn simulate_inference(buf: &TensorBuffer) -> f32 {
    // Tight loop — CPU-friendly contiguous access
    buf.as_slice().iter().copied().sum::<f32>() / buf.as_slice().len() as f32
}

fn main() {
    let pool = Arc::new(BufferPool::new(16, 512));

    let handles: Vec<_> = (0..8).map(|i| {
        let pool = pool.clone();
        std::thread::spawn(move || {
            let mut buf = pool.acquire();
            let data: Vec<f32> = (0..256).map(|x| x as f32 * 0.01 + i as f32).collect();
            buf.fill(&data);
            let result = simulate_inference(&buf);
            pool.release(buf);
            println!("Thread {} result: {:.4}", i, result);
        })
    }).collect();

    for h in handles { h.join().unwrap(); }
}

SIMD-friendly matrix multiply hint

rust
/// Ensure data is 32-byte aligned for AVX2 autovectorization
#[repr(align(32))]
struct AlignedMatrix {
    rows: usize,
    cols: usize,
    data: Vec<f32>,
}

impl AlignedMatrix {
    fn new(rows: usize, cols: usize) -> Self {
        Self { rows, cols, data: vec![0.0; rows * cols] }
    }

    /// Row-major element access
    #[inline(always)]
    fn get(&self, r: usize, c: usize) -> f32 {
        self.data[r * self.cols + c]
    }

    #[inline(always)]
    fn set(&mut self, r: usize, c: usize, v: f32) {
        self.data[r * self.cols + c] = v;
    }

    /// Naive matmul — rustc will autovectorize the inner loop
    fn matmul(&self, other: &AlignedMatrix) -> AlignedMatrix {
        assert_eq!(self.cols, other.rows);
        let mut out = AlignedMatrix::new(self.rows, other.cols);
        for i in 0..self.rows {
            for k in 0..self.cols {
                let a = self.get(i, k);
                for j in 0..other.cols {
                    let cur = out.get(i, j);
                    out.set(i, j, cur + a * other.get(k, j));
                }
            }
        }
        out
    }
}

fn main() {
    let a = AlignedMatrix { rows: 4, cols: 4, data: (0..16).map(|x| x as f32).collect() };
    let b = AlignedMatrix { rows: 4, cols: 4, data: vec![1.0; 16] };
    let c = a.matmul(&b);
    println!("Result[0][0] = {}", c.get(0, 0));
}

Thread pinning for NUMA systems

rust
// In Cargo.toml: core_affinity = "0.8"
// use core_affinity;
//
// fn pin_worker_to_core(core_id: usize) {
//     let cores = core_affinity::get_core_ids().unwrap();
//     if let Some(core) = cores.get(core_id) {
//         core_affinity::set_for_current(*core);
//     }
// }

Benchmarking checklist

  • [ ] Use criterion for micro-benchmarks; always warm up.
  • [ ] Disable CPU frequency scaling (cpupower frequency-set -g performance).
  • [ ] Compare p50, p95, p99 — not just mean.
  • [ ] Profile with perf stat to find IPC and cache miss rates.
  • [ ] Check LLVM IR with cargo rustc -- --emit=llvm-ir to verify vectorization.

Related reading

Related Guides

Continue in This Topic

More Rust Guides