LLM Rust Interview Q&A

Q1: How do you implement a streaming LLM response proxy in Rust?

Answer: Use Axum's SSE support combined with a reqwest async stream. The key is to forward bytes as they arrive without buffering the full response:

rust

use tokio::sync::mpsc;
use std::time::Duration;

/// Proxy pattern: forward tokens from LLM to client as they arrive
async fn stream_proxy(
    prompt: String,
    client_tx: mpsc::Sender<String>,
) {
    // In production:
    // let response = reqwest::Client::new()
    //     .post(llm_url)
    //     .json(&request_with_stream_true)
    //     .send().await.unwrap();
    //
    // let mut stream = response.bytes_stream();
    // while let Some(chunk) = stream.next().await {
    //     let line = String::from_utf8(chunk.unwrap().to_vec()).unwrap();
    //     if let Some(token) = parse_sse_token(&line) {
    //         client_tx.send(token).await.ok();
    //     }
    // }

    // Simulated streaming
    let tokens = vec!["Rust ", "is ", "great ", "for ", "AI!"];
    for token in tokens {
        tokio::time::sleep(Duration::from_millis(20)).await;
        let _ = client_tx.send(token.to_string()).await;
    }
}

#[tokio::main]
async fn main() {
    let (tx, mut rx) = mpsc::channel(32);
    tokio::spawn(stream_proxy("Why Rust for AI?".to_string(), tx));

    while let Some(token) = rx.recv().await {
        print!("{}", token);
    }
    println!();
}

The Axum SSE handler wraps this channel in Sse::new(stream).

---

Q2: How do you handle context window limits in a chat application?

Answer: Implement a sliding window that keeps the system prompt and most recent messages, evicting oldest messages when the budget is exceeded:

rust

#[derive(Clone)]
struct Message { role: String, content: String }

struct ContextWindow {
    system: String,
    messages: Vec<Message>,
    max_tokens: usize,
}

impl ContextWindow {
    fn token_count(s: &str) -> usize { s.len() / 4 + 4 }

    fn total_tokens(&self) -> usize {
        Self::token_count(&self.system) +
        self.messages.iter().map(|m| Self::token_count(&m.content)).sum::<usize>()
    }

    fn add(&mut self, role: &str, content: &str) {
        self.messages.push(Message { role: role.to_string(), content: content.to_string() });
        // Evict oldest pairs until we fit (preserve pairs: user+assistant)
        while self.total_tokens() > self.max_tokens && self.messages.len() > 2 {
            self.messages.remove(0);
        }
    }

    fn to_send(&self) -> Vec<(&str, &str)> {
        std::iter::once(("system", self.system.as_str()))
            .chain(self.messages.iter().map(|m| (m.role.as_str(), m.content.as_str())))
            .collect()
    }
}

fn main() {
    let mut ctx = ContextWindow {
        system: "You are a Rust expert.".to_string(),
        messages: Vec::new(),
        max_tokens: 1000,
    };
    ctx.add("user", "What is ownership?");
    ctx.add("assistant", "Ownership ensures each value has exactly one owner...");
    ctx.add("user", "What about borrowing?");

    println!("Context tokens: ~{}", ctx.total_tokens());
    println!("Messages: {}", ctx.to_send().len());
}

---

Q3: How do you implement multi-provider fallback?

Answer: Define a provider list with priority order. On failure, try the next provider:

rust

#[derive(Clone)]
struct Provider { name: String, priority: u8 }

async fn call_with_fallback(
    providers: &[Provider],
    prompt: &str,
) -> Result<String, String> {
    let mut sorted = providers.to_vec();
    sorted.sort_by_key(|p| p.priority);

    let mut last_error = "no providers".to_string();
    for provider in &sorted {
        match simulate_call(&provider.name, prompt).await {
            Ok(response) => return Ok(response),
            Err(e) => {
                eprintln!("Provider {} failed: {}", provider.name, e);
                last_error = e;
            }
        }
    }
    Err(last_error)
}

async fn simulate_call(provider: &str, prompt: &str) -> Result<String, String> {
    if provider == "openai" { return Err("rate limited".to_string()); }
    Ok(format!("{} response to: {}", provider, prompt))
}

#[tokio::main]
async fn main() {
    let providers = vec![
        Provider { name: "openai".to_string(), priority: 1 },
        Provider { name: "anthropic".to_string(), priority: 2 },
        Provider { name: "groq".to_string(), priority: 3 },
    ];
    let result = call_with_fallback(&providers, "Explain Rust lifetimes").await;
    println!("{:?}", result);
}

---

Q4: How do you count tokens accurately without calling the API?

Answer: Use the tiktoken-rs crate for accurate BPE tokenization matching GPT-4 exactly. For a quick estimate:

rust

/// Quick token estimator — ±20% accuracy for English text
fn estimate_tokens(text: &str, model: &str) -> u32 {
    // Different models have different tokenization densities
    let chars_per_token = match model {
        m if m.starts_with("gpt") => 3.8,
        m if m.starts_with("claude") => 4.0,
        _ => 4.0,
    };

    let word_tokens = text.split_whitespace().count() as f32 * 1.3;
    let char_tokens = text.len() as f32 / chars_per_token;
    // Take the average of word-based and char-based estimates
    ((word_tokens + char_tokens) / 2.0) as u32 + 3 // +3 for message overhead
}

fn main() {
    let system = "You are a helpful Rust programming assistant.";
    let user = "How do I implement a thread-safe singleton in Rust?";

    let total = estimate_tokens(system, "gpt-4o") + estimate_tokens(user, "gpt-4o");
    println!("~{} tokens | cost ~${:.6}", total, total as f64 * 5.0 / 1_000_000.0);
}

For production, use tiktoken-rs to avoid exceeding context limits.

---

Q5: What are the key metrics for an LLM gateway?

Answer:

| Metric | Purpose | Alert threshold |

|---|---|---|

| llm_ttft_ms | Time to first token — UX | > 500ms |

| llm_total_latency_ms | End-to-end latency | > 10s |

| llm_tokens_used_total | Cost tracking | Rate limit warning |

| llm_error_rate | Provider reliability | > 1% |

| llm_cache_hit_rate | Cache efficiency | < 30% (investigate) |

| llm_cost_usd_per_hour | Budget control | Alert on anomaly |

LLM Rust Interview Q&A

LLM Rust Interview Q&A

Q1: How do you implement a streaming LLM response proxy in Rust?

Q2: How do you handle context window limits in a chat application?

Q3: How do you implement multi-provider fallback?

Q4: How do you count tokens accurately without calling the API?

Q5: What are the key metrics for an LLM gateway?

Related reading

Related Guides

Building LLM Applications with Rust

LLM API Gateway in Rust

Continue in This Topic

LLM Rust Decision Matrix

LLM Rust Maintainability

More Rust Guides

Building LLM Applications with Rust

LLM API Gateway in Rust

LLM Rust Anti-Patterns

LLM Rust Benchmarking

LLM Rust Decision Matrix

LLM Rust Maintainability