LLM Rust Interview Q&A
Top interview questions and answers about building LLM applications in Rust. Covers streaming APIs, prompt design, context management, rate limiting, and system architecture for AI engineers.
Topic: Llm Rust
Search intent: High-intent search: "rust llm interview questions answers"
LLM Rust Interview Q&A
Q1: How do you implement a streaming LLM response proxy in Rust?
Answer: Use Axum's SSE support combined with a reqwest async stream. The key is to forward bytes as they arrive without buffering the full response:
use tokio::sync::mpsc;
use std::time::Duration;
/// Proxy pattern: forward tokens from LLM to client as they arrive
async fn stream_proxy(
prompt: String,
client_tx: mpsc::Sender<String>,
) {
// In production:
// let response = reqwest::Client::new()
// .post(llm_url)
// .json(&request_with_stream_true)
// .send().await.unwrap();
//
// let mut stream = response.bytes_stream();
// while let Some(chunk) = stream.next().await {
// let line = String::from_utf8(chunk.unwrap().to_vec()).unwrap();
// if let Some(token) = parse_sse_token(&line) {
// client_tx.send(token).await.ok();
// }
// }
// Simulated streaming
let tokens = vec!["Rust ", "is ", "great ", "for ", "AI!"];
for token in tokens {
tokio::time::sleep(Duration::from_millis(20)).await;
let _ = client_tx.send(token.to_string()).await;
}
}
#[tokio::main]
async fn main() {
let (tx, mut rx) = mpsc::channel(32);
tokio::spawn(stream_proxy("Why Rust for AI?".to_string(), tx));
while let Some(token) = rx.recv().await {
print!("{}", token);
}
println!();
}The Axum SSE handler wraps this channel in Sse::new(stream).
---
Q2: How do you handle context window limits in a chat application?
Answer: Implement a sliding window that keeps the system prompt and most recent messages, evicting oldest messages when the budget is exceeded:
#[derive(Clone)]
struct Message { role: String, content: String }
struct ContextWindow {
system: String,
messages: Vec<Message>,
max_tokens: usize,
}
impl ContextWindow {
fn token_count(s: &str) -> usize { s.len() / 4 + 4 }
fn total_tokens(&self) -> usize {
Self::token_count(&self.system) +
self.messages.iter().map(|m| Self::token_count(&m.content)).sum::<usize>()
}
fn add(&mut self, role: &str, content: &str) {
self.messages.push(Message { role: role.to_string(), content: content.to_string() });
// Evict oldest pairs until we fit (preserve pairs: user+assistant)
while self.total_tokens() > self.max_tokens && self.messages.len() > 2 {
self.messages.remove(0);
}
}
fn to_send(&self) -> Vec<(&str, &str)> {
std::iter::once(("system", self.system.as_str()))
.chain(self.messages.iter().map(|m| (m.role.as_str(), m.content.as_str())))
.collect()
}
}
fn main() {
let mut ctx = ContextWindow {
system: "You are a Rust expert.".to_string(),
messages: Vec::new(),
max_tokens: 1000,
};
ctx.add("user", "What is ownership?");
ctx.add("assistant", "Ownership ensures each value has exactly one owner...");
ctx.add("user", "What about borrowing?");
println!("Context tokens: ~{}", ctx.total_tokens());
println!("Messages: {}", ctx.to_send().len());
}---
Q3: How do you implement multi-provider fallback?
Answer: Define a provider list with priority order. On failure, try the next provider:
#[derive(Clone)]
struct Provider { name: String, priority: u8 }
async fn call_with_fallback(
providers: &[Provider],
prompt: &str,
) -> Result<String, String> {
let mut sorted = providers.to_vec();
sorted.sort_by_key(|p| p.priority);
let mut last_error = "no providers".to_string();
for provider in &sorted {
match simulate_call(&provider.name, prompt).await {
Ok(response) => return Ok(response),
Err(e) => {
eprintln!("Provider {} failed: {}", provider.name, e);
last_error = e;
}
}
}
Err(last_error)
}
async fn simulate_call(provider: &str, prompt: &str) -> Result<String, String> {
if provider == "openai" { return Err("rate limited".to_string()); }
Ok(format!("{} response to: {}", provider, prompt))
}
#[tokio::main]
async fn main() {
let providers = vec![
Provider { name: "openai".to_string(), priority: 1 },
Provider { name: "anthropic".to_string(), priority: 2 },
Provider { name: "groq".to_string(), priority: 3 },
];
let result = call_with_fallback(&providers, "Explain Rust lifetimes").await;
println!("{:?}", result);
}---
Q4: How do you count tokens accurately without calling the API?
Answer: Use the tiktoken-rs crate for accurate BPE tokenization matching GPT-4 exactly. For a quick estimate:
/// Quick token estimator — ±20% accuracy for English text
fn estimate_tokens(text: &str, model: &str) -> u32 {
// Different models have different tokenization densities
let chars_per_token = match model {
m if m.starts_with("gpt") => 3.8,
m if m.starts_with("claude") => 4.0,
_ => 4.0,
};
let word_tokens = text.split_whitespace().count() as f32 * 1.3;
let char_tokens = text.len() as f32 / chars_per_token;
// Take the average of word-based and char-based estimates
((word_tokens + char_tokens) / 2.0) as u32 + 3 // +3 for message overhead
}
fn main() {
let system = "You are a helpful Rust programming assistant.";
let user = "How do I implement a thread-safe singleton in Rust?";
let total = estimate_tokens(system, "gpt-4o") + estimate_tokens(user, "gpt-4o");
println!("~{} tokens | cost ~${:.6}", total, total as f64 * 5.0 / 1_000_000.0);
}For production, use tiktoken-rs to avoid exceeding context limits.
---
Q5: What are the key metrics for an LLM gateway?
Answer:
| Metric | Purpose | Alert threshold |
|---|---|---|
| llm_ttft_ms | Time to first token — UX | > 500ms |
| llm_total_latency_ms | End-to-end latency | > 10s |
| llm_tokens_used_total | Cost tracking | Rate limit warning |
| llm_error_rate | Provider reliability | > 1% |
| llm_cache_hit_rate | Cache efficiency | < 30% (investigate) |
| llm_cost_usd_per_hour | Budget control | Alert on anomaly |