Local Inference with Ollama

Ollama lets you run large language models on your own machine — no API key, no data leaving the network boundary. synwire-llm-ollama implements the same BaseChatModel and Embeddings traits as the OpenAI provider, so switching is a one-line change.

When to use Ollama:

  • Privacy-sensitive workloads (data must not leave the machine)
  • Air-gapped environments
  • Development and testing without API costs
  • Experimenting with open-weight models (Llama 3, Mistral, Gemma, Phi)

📖 Rust note: A trait is Rust's equivalent of an interface. BaseChatModel is a trait — because both ChatOllama and ChatOpenAI implement it, you can store either behind a Box<dyn BaseChatModel> and swap them without changing any other code.

Prerequisites

  1. Install Ollama from https://ollama.com
  2. Pull a model:
ollama pull llama3.2
  1. Confirm it is running:
ollama run llama3.2 "hello"

Ollama listens on http://localhost:11434 by default.

Add the dependency

[dependencies]
synwire-llm-ollama = "0.1"
tokio = { version = "1", features = ["full"] }

Basic invoke

use synwire_llm_ollama::ChatOllama;
use synwire_core::language_models::chat::BaseChatModel;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let model = ChatOllama::builder()
        .model("llama3.2")
        .build()?;

    let result = model.invoke("What is the Rust borrow checker?").await?;
    println!("{}", result.content);
    Ok(())
}

Streaming

📖 Rust note: async fn and .await let this code run concurrently without blocking a thread. StreamExt::next().await yields each chunk as the model generates it — you see output appear progressively rather than waiting for the full response.

use synwire_llm_ollama::ChatOllama;
use synwire_core::language_models::chat::BaseChatModel;
use futures_util::StreamExt;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let model = ChatOllama::builder()
        .model("llama3.2")
        .build()?;

    let mut stream = model.stream("Explain ownership in Rust step by step.").await?;
    while let Some(chunk) = stream.next().await {
        print!("{}", chunk?.content);
    }
    println!();
    Ok(())
}

Local RAG with OllamaEmbeddings

Use a local embedding model so that retrieval-augmented generation never sends data to an external API:

ollama pull nomic-embed-text
use synwire_llm_ollama::{ChatOllama, OllamaEmbeddings};
use synwire_core::embeddings::Embeddings;
use synwire_core::language_models::chat::BaseChatModel;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let embeddings = OllamaEmbeddings::builder()
        .model("nomic-embed-text")
        .build()?;

    // Embed your documents
    let docs = vec![
        "Rust ownership means each value has exactly one owner.".to_string(),
        "The borrow checker enforces ownership rules at compile time.".to_string(),
    ];
    let vectors = embeddings.embed_documents(docs).await?;
    println!("Embedded {} documents, dimension {}", vectors.len(), vectors[0].len());

    // Embed a query
    let query_vec = embeddings.embed_query("what is ownership?").await?;
    println!("Query vector dimension: {}", query_vec.len());

    // Use vectors with your vector store, then answer with the chat model:
    let model = ChatOllama::builder().model("llama3.2").build()?;
    let answer = model.invoke("Given context about Rust ownership, explain it simply.").await?;
    println!("{}", answer.content);

    Ok(())
}

See Getting Started: RAG for a complete retrieval-augmented generation example.

Swapping from OpenAI to Ollama

Store the model as Box<dyn BaseChatModel> — swap by changing the constructor:

#![allow(unused)]
fn main() {
use synwire_core::language_models::chat::BaseChatModel;

fn build_model() -> anyhow::Result<Box<dyn BaseChatModel>> {
    if std::env::var("USE_LOCAL").is_ok() {
        // Local: no API key required
        Ok(Box::new(
            synwire_llm_ollama::ChatOllama::builder().model("llama3.2").build()?
        ))
    } else {
        // Cloud: reads OPENAI_API_KEY from environment
        Ok(Box::new(
            synwire_llm_openai::ChatOpenAI::builder()
                .model("gpt-4o")
                .api_key_env("OPENAI_API_KEY")
                .build()?
        ))
    }
}
}

All downstream code that calls model.invoke(...) or model.stream(...) is unchanged.

Builder options

MethodDefaultDescription
.model(name)Required. Any model pulled via ollama pull
.base_url(url)http://localhost:11434Ollama server address
.temperature(f32)model defaultSampling temperature
.top_k(u32)model defaultTop-k sampling
.top_p(f32)model defaultTop-p (nucleus) sampling
.num_predict(i32)model defaultMax tokens to generate (-1 for unlimited)
.timeout(Duration)5 minutesRequest timeout

See also