RRust By Example

LLM Rust Interview Q&A

Top interview questions and answers about building LLM applications in Rust. Covers streaming APIs, prompt design, context management, rate limiting, and system architecture for AI engineers.

Topic: Llm Rust

Search intent: High-intent search: "rust llm interview questions answers"

LLM Rust Interview Q&A

Q1: How do you implement a streaming LLM response proxy in Rust?

Answer: Use Axum's SSE support combined with a reqwest async stream. The key is to forward bytes as they arrive without buffering the full response:

rust
use tokio::sync::mpsc;
use std::time::Duration;

/// Proxy pattern: forward tokens from LLM to client as they arrive
async fn stream_proxy(
    prompt: String,
    client_tx: mpsc::Sender<String>,
) {
    // In production:
    // let response = reqwest::Client::new()
    //     .post(llm_url)
    //     .json(&request_with_stream_true)
    //     .send().await.unwrap();
    //
    // let mut stream = response.bytes_stream();
    // while let Some(chunk) = stream.next().await {
    //     let line = String::from_utf8(chunk.unwrap().to_vec()).unwrap();
    //     if let Some(token) = parse_sse_token(&line) {
    //         client_tx.send(token).await.ok();
    //     }
    // }

    // Simulated streaming
    let tokens = vec!["Rust ", "is ", "great ", "for ", "AI!"];
    for token in tokens {
        tokio::time::sleep(Duration::from_millis(20)).await;
        let _ = client_tx.send(token.to_string()).await;
    }
}

#[tokio::main]
async fn main() {
    let (tx, mut rx) = mpsc::channel(32);
    tokio::spawn(stream_proxy("Why Rust for AI?".to_string(), tx));

    while let Some(token) = rx.recv().await {
        print!("{}", token);
    }
    println!();
}

The Axum SSE handler wraps this channel in Sse::new(stream).

---

Q2: How do you handle context window limits in a chat application?

Answer: Implement a sliding window that keeps the system prompt and most recent messages, evicting oldest messages when the budget is exceeded:

rust
#[derive(Clone)]
struct Message { role: String, content: String }

struct ContextWindow {
    system: String,
    messages: Vec<Message>,
    max_tokens: usize,
}

impl ContextWindow {
    fn token_count(s: &str) -> usize { s.len() / 4 + 4 }

    fn total_tokens(&self) -> usize {
        Self::token_count(&self.system) +
        self.messages.iter().map(|m| Self::token_count(&m.content)).sum::<usize>()
    }

    fn add(&mut self, role: &str, content: &str) {
        self.messages.push(Message { role: role.to_string(), content: content.to_string() });
        // Evict oldest pairs until we fit (preserve pairs: user+assistant)
        while self.total_tokens() > self.max_tokens && self.messages.len() > 2 {
            self.messages.remove(0);
        }
    }

    fn to_send(&self) -> Vec<(&str, &str)> {
        std::iter::once(("system", self.system.as_str()))
            .chain(self.messages.iter().map(|m| (m.role.as_str(), m.content.as_str())))
            .collect()
    }
}

fn main() {
    let mut ctx = ContextWindow {
        system: "You are a Rust expert.".to_string(),
        messages: Vec::new(),
        max_tokens: 1000,
    };
    ctx.add("user", "What is ownership?");
    ctx.add("assistant", "Ownership ensures each value has exactly one owner...");
    ctx.add("user", "What about borrowing?");

    println!("Context tokens: ~{}", ctx.total_tokens());
    println!("Messages: {}", ctx.to_send().len());
}

---

Q3: How do you implement multi-provider fallback?

Answer: Define a provider list with priority order. On failure, try the next provider:

rust
#[derive(Clone)]
struct Provider { name: String, priority: u8 }

async fn call_with_fallback(
    providers: &[Provider],
    prompt: &str,
) -> Result<String, String> {
    let mut sorted = providers.to_vec();
    sorted.sort_by_key(|p| p.priority);

    let mut last_error = "no providers".to_string();
    for provider in &sorted {
        match simulate_call(&provider.name, prompt).await {
            Ok(response) => return Ok(response),
            Err(e) => {
                eprintln!("Provider {} failed: {}", provider.name, e);
                last_error = e;
            }
        }
    }
    Err(last_error)
}

async fn simulate_call(provider: &str, prompt: &str) -> Result<String, String> {
    if provider == "openai" { return Err("rate limited".to_string()); }
    Ok(format!("{} response to: {}", provider, prompt))
}

#[tokio::main]
async fn main() {
    let providers = vec![
        Provider { name: "openai".to_string(), priority: 1 },
        Provider { name: "anthropic".to_string(), priority: 2 },
        Provider { name: "groq".to_string(), priority: 3 },
    ];
    let result = call_with_fallback(&providers, "Explain Rust lifetimes").await;
    println!("{:?}", result);
}

---

Q4: How do you count tokens accurately without calling the API?

Answer: Use the tiktoken-rs crate for accurate BPE tokenization matching GPT-4 exactly. For a quick estimate:

rust
/// Quick token estimator — ±20% accuracy for English text
fn estimate_tokens(text: &str, model: &str) -> u32 {
    // Different models have different tokenization densities
    let chars_per_token = match model {
        m if m.starts_with("gpt") => 3.8,
        m if m.starts_with("claude") => 4.0,
        _ => 4.0,
    };

    let word_tokens = text.split_whitespace().count() as f32 * 1.3;
    let char_tokens = text.len() as f32 / chars_per_token;
    // Take the average of word-based and char-based estimates
    ((word_tokens + char_tokens) / 2.0) as u32 + 3 // +3 for message overhead
}

fn main() {
    let system = "You are a helpful Rust programming assistant.";
    let user = "How do I implement a thread-safe singleton in Rust?";

    let total = estimate_tokens(system, "gpt-4o") + estimate_tokens(user, "gpt-4o");
    println!("~{} tokens | cost ~${:.6}", total, total as f64 * 5.0 / 1_000_000.0);
}

For production, use tiktoken-rs to avoid exceeding context limits.

---

Q5: What are the key metrics for an LLM gateway?

Answer:

| Metric | Purpose | Alert threshold |

|---|---|---|

| llm_ttft_ms | Time to first token — UX | > 500ms |

| llm_total_latency_ms | End-to-end latency | > 10s |

| llm_tokens_used_total | Cost tracking | Rate limit warning |

| llm_error_rate | Provider reliability | > 1% |

| llm_cache_hit_rate | Cache efficiency | < 30% (investigate) |

| llm_cost_usd_per_hour | Budget control | Alert on anomaly |

Related reading

Related Guides

Continue in This Topic

More Rust Guides