LLM Rust Decision Matrix
How to choose the right LLM integration approach for Rust projects: OpenAI API vs Anthropic vs local models, streaming vs batch, managed vs self-hosted inference.
Topic: Llm Rust
Search intent: High-intent search: "rust llm comparison openai anthropic local"
LLM Rust Decision Matrix
Provider comparison
| Factor | OpenAI | Anthropic | Google | Local (Ollama) | Self-hosted (candle) |
|---|---|---|---|---|---|
| Best model quality | GPT-4o | Claude 3.5 Sonnet | Gemini 1.5 Pro | Llama3, Mistral | Any GGUF/ONNX model |
| Cost | $$$ | $$$ | $$ | Free | Infra cost only |
| Latency | 200ms–5s | 300ms–8s | 200ms–6s | 50ms–3s | 20ms–2s (depends on hw) |
| Rate limits | Strict tiered | Strict tiered | Generous | No limits | No limits |
| Data privacy | Sent to OpenAI | Sent to Anthropic | Sent to Google | Local only | Local only |
| Context window | 128k tokens | 200k tokens | 1M tokens | 8k–128k | Model-dependent |
| Streaming | SSE | SSE | SSE | SSE | Custom |
| Rust crate | async-openai | manual reqwest | google-generativelanguage1 | ollama-rs | candle |
Decision flowchart
Data privacy requirement (GDPR, HIPAA)?
└── Yes → Local/self-hosted only (Ollama or candle)
Need best quality for reasoning tasks?
└── Claude 3.5 Sonnet or GPT-4o
High volume, cost-sensitive?
└── GPT-4o-mini or Groq (fastest inference)
Long context (>100k tokens)?
└── Gemini 1.5 Pro or Claude 3.5 Sonnet
Offline capability required?
└── candle (compile into binary) or Ollama
Fine-tuned domain model?
└── Self-hosted via ort/candle
Prototyping quickly?
└── OpenAI (most mature Rust tooling)Runnable example — provider-agnostic abstraction
use serde::{Deserialize, Serialize};
use std::time::Duration;
#[derive(Debug, Clone, Serialize, Deserialize)]
struct Message { role: String, content: String }
#[derive(Debug)]
struct CompletionResult {
content: String,
provider: String,
model: String,
input_tokens: u32,
output_tokens: u32,
latency_ms: u64,
}
/// Unified interface for all LLM providers
trait LlmProvider: Send + Sync {
fn provider_name(&self) -> &str;
fn default_model(&self) -> &str;
}
/// OpenAI provider configuration
struct OpenAiProvider { api_key: String }
impl LlmProvider for OpenAiProvider {
fn provider_name(&self) -> &str { "openai" }
fn default_model(&self) -> &str { "gpt-4o-mini" }
}
/// Anthropic provider configuration
struct AnthropicProvider { api_key: String }
impl LlmProvider for AnthropicProvider {
fn provider_name(&self) -> &str { "anthropic" }
fn default_model(&self) -> &str { "claude-3-haiku-20240307" }
}
/// Local Ollama provider
struct OllamaProvider { base_url: String }
impl LlmProvider for OllamaProvider {
fn provider_name(&self) -> &str { "ollama" }
fn default_model(&self) -> &str { "llama3.2" }
}
/// Simulate completion (replace with actual HTTP calls per provider)
async fn complete(
provider: &dyn LlmProvider,
messages: &[Message],
max_tokens: u32,
) -> Result<CompletionResult, String> {
let start = std::time::Instant::now();
// Provider-specific latency simulation
let latency = match provider.provider_name() {
"openai" => 300,
"anthropic" => 400,
"ollama" => 80,
_ => 200,
};
tokio::time::sleep(Duration::from_millis(latency)).await;
let content = format!(
"[{}] Response: {}",
provider.provider_name(),
messages.last().map(|m| &m.content[..20.min(m.content.len())]).unwrap_or("")
);
Ok(CompletionResult {
content,
provider: provider.provider_name().to_string(),
model: provider.default_model().to_string(),
input_tokens: messages.iter().map(|m| m.content.len() as u32 / 4).sum(),
output_tokens: max_tokens / 4,
latency_ms: start.elapsed().as_millis() as u64,
})
}
#[tokio::main]
async fn main() {
let providers: Vec<Box<dyn LlmProvider>> = vec![
Box::new(OpenAiProvider { api_key: "sk-test".to_string() }),
Box::new(AnthropicProvider { api_key: "ant-test".to_string() }),
Box::new(OllamaProvider { base_url: "http://localhost:11434".to_string() }),
];
let messages = vec![Message {
role: "user".to_string(),
content: "Explain Rust async in one sentence.".to_string(),
}];
for provider in &providers {
match complete(provider.as_ref(), &messages, 100).await {
Ok(result) => println!(
"[{:10}] {}ms | {}/{} tokens | {}",
result.provider, result.latency_ms,
result.input_tokens, result.output_tokens,
result.content
),
Err(e) => println!("[{}] Error: {}", provider.provider_name(), e),
}
}
}Cost comparison for 1M daily requests (100 input + 100 output tokens)
| Provider | Model | Daily cost estimate |
|---|---|---|
| OpenAI | gpt-4o | ~$1,500 |
| OpenAI | gpt-4o-mini | ~$60 |
| Anthropic | claude-3-haiku | ~$50 |
| Groq | llama3-8b | ~$10 |
| Self-hosted | Llama3 8B (1 GPU) | ~$5 infra |