synwire-embeddings-local: Local Embedding and Reranking Models

synwire-embeddings-local provides CPU-based text embedding and cross-encoder reranking, backed by fastembed-rs and ONNX Runtime. No API keys, no network calls at inference time, no data leaves the machine.

Models

ComponentModelParametersOutputPurpose
LocalEmbeddingsBAAI/bge-small-en-v1.533M384-dim f32 vectorBi-encoder: fast similarity search
LocalRerankerBAAI/bge-reranker-base110MRelevance scoreCross-encoder: accurate re-scoring

Both models are downloaded from Hugging Face Hub on first use and cached locally by fastembed. Subsequent constructions load from cache with no network access.

Bi-encoder vs cross-encoder

The two models serve complementary roles in a two-stage retrieval pipeline:

graph LR
    Q["Query"] --> E1["Embed query<br/>(bi-encoder)"]
    E1 --> S["Vector similarity<br/>top-k candidates"]
    S --> R["Rerank<br/>(cross-encoder)"]
    R --> F["Final results"]

Bi-encoder (LocalEmbeddings): Encodes query and documents independently into fixed-size vectors. Similarity is computed via cosine distance. This is fast (embeddings are precomputed for documents) but less accurate because the model never sees query and document together.

Cross-encoder (LocalReranker): Takes a (query, document) pair as input and produces a single relevance score. This is more accurate because the model attends to both texts jointly, but slower because it must run inference for every candidate. Hence it is used only on the top-k results from the bi-encoder.

Thread safety and async integration

fastembed's inference is synchronous and CPU-bound. To avoid blocking the Tokio async runtime, both LocalEmbeddings and LocalReranker:

  1. Wrap the underlying model in Arc<T>, making it safely shareable across tasks.
  2. Run all inference on Tokio's blocking thread pool via tokio::task::spawn_blocking.
// Simplified view of the embed_query implementation:
let model = Arc::clone(&self.model);
let owned = text.to_owned();
tokio::task::spawn_blocking(move || model.embed(vec![owned], None)).await

This pattern keeps the async event loop responsive even during heavy inference workloads.

Implementing the core traits

LocalEmbeddings implements synwire_core::embeddings::Embeddings:

MethodInputOutput
embed_documents&[String]Vec<Vec<f32>> (batch)
embed_query&strVec<f32> (single vector)

LocalReranker implements synwire_core::rerankers::Reranker:

MethodInputOutput
rerankquery, &[Document], top_nVec<Document> (re-ordered)

Both return Result<T, SynwireError> — embedding failures are mapped to SynwireError::Embedding(EmbeddingError::Failed { message }).

Error handling

Error typeCause
LocalEmbeddingsError::InitModel download failure or ONNX load error
LocalRerankerError::InitSame, for the reranker model
EmbeddingError::FailedInference panicked or returned no results

Construction errors (::new()) are separate from runtime errors. Construction may fail due to network issues (first download) or corrupted cache files. Runtime errors indicate ONNX inference failures or task panics.

Performance characteristics

OperationTypical latency (CPU)Notes
Model construction50–200 ms (cached)First-ever: download ~30 MB
embed_query1–5 ms per querySingle text, 384-dim output
embed_documents~2 ms per document (batch)Batching amortises overhead
rerank5–20 ms per candidateCross-encoder is heavier

These are order-of-magnitude figures on a modern x86 CPU. Actual performance depends on text length, CPU architecture, and available cores.

See also