RRust By Example

LLM Rust Decision Matrix

How to choose the right LLM integration approach for Rust projects: OpenAI API vs Anthropic vs local models, streaming vs batch, managed vs self-hosted inference.

Topic: Llm Rust

Search intent: High-intent search: "rust llm comparison openai anthropic local"

LLM Rust Decision Matrix

Provider comparison

| Factor | OpenAI | Anthropic | Google | Local (Ollama) | Self-hosted (candle) |

|---|---|---|---|---|---|

| Best model quality | GPT-4o | Claude 3.5 Sonnet | Gemini 1.5 Pro | Llama3, Mistral | Any GGUF/ONNX model |

| Cost | $$$ | $$$ | $$ | Free | Infra cost only |

| Latency | 200ms–5s | 300ms–8s | 200ms–6s | 50ms–3s | 20ms–2s (depends on hw) |

| Rate limits | Strict tiered | Strict tiered | Generous | No limits | No limits |

| Data privacy | Sent to OpenAI | Sent to Anthropic | Sent to Google | Local only | Local only |

| Context window | 128k tokens | 200k tokens | 1M tokens | 8k–128k | Model-dependent |

| Streaming | SSE | SSE | SSE | SSE | Custom |

| Rust crate | async-openai | manual reqwest | google-generativelanguage1 | ollama-rs | candle |

Decision flowchart

rust
Data privacy requirement (GDPR, HIPAA)?
└── YesLocal/self-hosted only (Ollama or candle)

Need best quality for reasoning tasks?
└── Claude 3.5 Sonnet or GPT-4o

High volume, cost-sensitive?
└── GPT-4o-mini or Groq (fastest inference)

Long context (>100k tokens)?
└── Gemini 1.5 Pro or Claude 3.5 Sonnet

Offline capability required?
└── candle (compile into binary) or Ollama

Fine-tuned domain model?
└── Self-hosted via ort/candle

Prototyping quickly?
└── OpenAI (most mature Rust tooling)

Runnable example — provider-agnostic abstraction

rust
use serde::{Deserialize, Serialize};
use std::time::Duration;

#[derive(Debug, Clone, Serialize, Deserialize)]
struct Message { role: String, content: String }

#[derive(Debug)]
struct CompletionResult {
    content: String,
    provider: String,
    model: String,
    input_tokens: u32,
    output_tokens: u32,
    latency_ms: u64,
}

/// Unified interface for all LLM providers
trait LlmProvider: Send + Sync {
    fn provider_name(&self) -> &str;
    fn default_model(&self) -> &str;
}

/// OpenAI provider configuration
struct OpenAiProvider { api_key: String }
impl LlmProvider for OpenAiProvider {
    fn provider_name(&self) -> &str { "openai" }
    fn default_model(&self) -> &str { "gpt-4o-mini" }
}

/// Anthropic provider configuration
struct AnthropicProvider { api_key: String }
impl LlmProvider for AnthropicProvider {
    fn provider_name(&self) -> &str { "anthropic" }
    fn default_model(&self) -> &str { "claude-3-haiku-20240307" }
}

/// Local Ollama provider
struct OllamaProvider { base_url: String }
impl LlmProvider for OllamaProvider {
    fn provider_name(&self) -> &str { "ollama" }
    fn default_model(&self) -> &str { "llama3.2" }
}

/// Simulate completion (replace with actual HTTP calls per provider)
async fn complete(
    provider: &dyn LlmProvider,
    messages: &[Message],
    max_tokens: u32,
) -> Result<CompletionResult, String> {
    let start = std::time::Instant::now();

    // Provider-specific latency simulation
    let latency = match provider.provider_name() {
        "openai" => 300,
        "anthropic" => 400,
        "ollama" => 80,
        _ => 200,
    };
    tokio::time::sleep(Duration::from_millis(latency)).await;

    let content = format!(
        "[{}] Response: {}",
        provider.provider_name(),
        messages.last().map(|m| &m.content[..20.min(m.content.len())]).unwrap_or("")
    );

    Ok(CompletionResult {
        content,
        provider: provider.provider_name().to_string(),
        model: provider.default_model().to_string(),
        input_tokens: messages.iter().map(|m| m.content.len() as u32 / 4).sum(),
        output_tokens: max_tokens / 4,
        latency_ms: start.elapsed().as_millis() as u64,
    })
}

#[tokio::main]
async fn main() {
    let providers: Vec<Box<dyn LlmProvider>> = vec![
        Box::new(OpenAiProvider { api_key: "sk-test".to_string() }),
        Box::new(AnthropicProvider { api_key: "ant-test".to_string() }),
        Box::new(OllamaProvider { base_url: "http://localhost:11434".to_string() }),
    ];

    let messages = vec![Message {
        role: "user".to_string(),
        content: "Explain Rust async in one sentence.".to_string(),
    }];

    for provider in &providers {
        match complete(provider.as_ref(), &messages, 100).await {
            Ok(result) => println!(
                "[{:10}] {}ms | {}/{} tokens | {}",
                result.provider, result.latency_ms,
                result.input_tokens, result.output_tokens,
                result.content
            ),
            Err(e) => println!("[{}] Error: {}", provider.provider_name(), e),
        }
    }
}

Cost comparison for 1M daily requests (100 input + 100 output tokens)

| Provider | Model | Daily cost estimate |

|---|---|---|

| OpenAI | gpt-4o | ~$1,500 |

| OpenAI | gpt-4o-mini | ~$60 |

| Anthropic | claude-3-haiku | ~$50 |

| Groq | llama3-8b | ~$10 |

| Self-hosted | Llama3 8B (1 GPU) | ~$5 infra |

Related reading

Related Guides

Continue in This Topic

More Rust Guides