Rust AI Inference Performance Tuning
Deep dive into performance tuning techniques for AI inference in Rust. Covers batching, SIMD, memory layout, thread pinning, and GPU dispatch strategies to minimize latency.
Topic: Ai Inference
Search intent: High-intent search: "rust ai inference performance"
Rust AI Inference Performance Tuning
Problem
AI inference is one of the most latency-sensitive workloads. A 10ms improvement in p99 latency can mean the difference between a great user experience and timeouts. This guide covers the systematic techniques that Rust enables uniquely well.
Performance tuning hierarchy
Apply optimizations in this order — earlier steps typically yield larger wins:
1. Algorithmic — batch requests, reduce unnecessary copies.
2. Data layout — contiguous memory, avoid pointer indirection.
3. Parallelism — saturate all CPU cores; pipeline stages.
4. SIMD — use packed_simd or autovectorization-friendly code.
5. Allocation — reuse buffers; use arena allocators.
6. I/O — async request handling; zero-copy serialization.
Runnable example — zero-copy tensor buffer pool
use std::sync::{Arc, Mutex};
use std::collections::VecDeque;
/// Reusable fixed-size f32 buffer to avoid repeated allocation
struct TensorBuffer {
data: Vec<f32>,
capacity: usize,
}
impl TensorBuffer {
fn new(capacity: usize) -> Self {
Self {
data: vec![0.0f32; capacity],
capacity,
}
}
fn fill(&mut self, src: &[f32]) {
let len = src.len().min(self.capacity);
self.data[..len].copy_from_slice(&src[..len]);
// Zero-fill remainder
for v in &mut self.data[len..] { *v = 0.0; }
}
fn as_slice(&self) -> &[f32] { &self.data }
}
/// Pool of reusable buffers
struct BufferPool {
pool: Mutex<VecDeque<TensorBuffer>>,
capacity: usize,
}
impl BufferPool {
fn new(pool_size: usize, buffer_capacity: usize) -> Self {
let mut pool = VecDeque::with_capacity(pool_size);
for _ in 0..pool_size {
pool.push_back(TensorBuffer::new(buffer_capacity));
}
Self { pool: Mutex::new(pool), capacity: buffer_capacity }
}
fn acquire(&self) -> TensorBuffer {
let mut pool = self.pool.lock().unwrap();
pool.pop_front().unwrap_or_else(|| TensorBuffer::new(self.capacity))
}
fn release(&self, buf: TensorBuffer) {
let mut pool = self.pool.lock().unwrap();
if pool.len() < 64 {
pool.push_back(buf);
}
// Drop excess buffers automatically
}
}
fn simulate_inference(buf: &TensorBuffer) -> f32 {
// Tight loop — CPU-friendly contiguous access
buf.as_slice().iter().copied().sum::<f32>() / buf.as_slice().len() as f32
}
fn main() {
let pool = Arc::new(BufferPool::new(16, 512));
let handles: Vec<_> = (0..8).map(|i| {
let pool = pool.clone();
std::thread::spawn(move || {
let mut buf = pool.acquire();
let data: Vec<f32> = (0..256).map(|x| x as f32 * 0.01 + i as f32).collect();
buf.fill(&data);
let result = simulate_inference(&buf);
pool.release(buf);
println!("Thread {} result: {:.4}", i, result);
})
}).collect();
for h in handles { h.join().unwrap(); }
}SIMD-friendly matrix multiply hint
/// Ensure data is 32-byte aligned for AVX2 autovectorization
#[repr(align(32))]
struct AlignedMatrix {
rows: usize,
cols: usize,
data: Vec<f32>,
}
impl AlignedMatrix {
fn new(rows: usize, cols: usize) -> Self {
Self { rows, cols, data: vec![0.0; rows * cols] }
}
/// Row-major element access
#[inline(always)]
fn get(&self, r: usize, c: usize) -> f32 {
self.data[r * self.cols + c]
}
#[inline(always)]
fn set(&mut self, r: usize, c: usize, v: f32) {
self.data[r * self.cols + c] = v;
}
/// Naive matmul — rustc will autovectorize the inner loop
fn matmul(&self, other: &AlignedMatrix) -> AlignedMatrix {
assert_eq!(self.cols, other.rows);
let mut out = AlignedMatrix::new(self.rows, other.cols);
for i in 0..self.rows {
for k in 0..self.cols {
let a = self.get(i, k);
for j in 0..other.cols {
let cur = out.get(i, j);
out.set(i, j, cur + a * other.get(k, j));
}
}
}
out
}
}
fn main() {
let a = AlignedMatrix { rows: 4, cols: 4, data: (0..16).map(|x| x as f32).collect() };
let b = AlignedMatrix { rows: 4, cols: 4, data: vec![1.0; 16] };
let c = a.matmul(&b);
println!("Result[0][0] = {}", c.get(0, 0));
}Thread pinning for NUMA systems
// In Cargo.toml: core_affinity = "0.8"
// use core_affinity;
//
// fn pin_worker_to_core(core_id: usize) {
// let cores = core_affinity::get_core_ids().unwrap();
// if let Some(core) = cores.get(core_id) {
// core_affinity::set_for_current(*core);
// }
// }Benchmarking checklist
- [ ] Use
criterionfor micro-benchmarks; always warm up. - [ ] Disable CPU frequency scaling (
cpupower frequency-set -g performance). - [ ] Compare p50, p95, p99 — not just mean.
- [ ] Profile with
perf statto find IPC and cache miss rates. - [ ] Check LLVM IR with
cargo rustc -- --emit=llvm-irto verify vectorization.