Reranking in RAG — the precise second pass

In a RAG pipeline, the first thing you do is retrieve candidate documents — with embeddings or keyword search. That’s fast, but not always precise: you tend to get back chunks that are topically in the neighborhood, where the genuinely best-matching one isn’t necessarily at the top.

Reranking is the second stage that fixes the ordering. It takes each query–document pair and looks at it more carefully — re-scoring the candidates so the truly relevant ones rise. It catches subtle relevance signals a plain embedding search misses.

Ranker = fast first pass. Reranker = precise second pass. Use both and your retrieval gets meaningfully more accurate.

Why it actually works: bi-encoder vs cross-encoder

This is the part worth understanding, because it explains both the precision and the cost.

Embedding models are bi-encoders: the query and each document are encoded separately into vectors, and you compare them with cosine similarity. The win is speed and scale — you precompute every document vector once and search millions cheaply. The catch is that the model never sees the query and the document together, so it can’t reason about fine-grained interactions between them. Topically close, sometimes literally wrong.

A reranker is a cross-encoder: it feeds the query and one candidate through the model jointly, so self-attention can weigh every word of the query against every word of the document. That’s a much truer judgment of relevance and intent. The price: you can’t precompute anything — every query–document pair is a fresh forward pass — so it’s far too expensive to run over the whole corpus.

The resolution is the funnel. Cheap bi-encoder retrieval narrows millions of documents to a small candidate set; the expensive cross-encoder only ever re-scores that handful. Fast where it needs to be fast, precise where it needs to be precise.

The usual shape of the pipeline

query
  ↓  dense (embeddings) + optional sparse (BM25), merged with RRF
top 50–200 candidates
  ↓  cross-encoder reranker
top 5–10
  ↓  LLM

Retrieve broad, rerank narrow, hand the LLM only the few chunks that survived both passes.

What’s out there

Python reference library: rerankers (Answer.AI) — a single interface over many reranker types, so you can swap models without rewriting your pipeline. Worth designing around, because this space moves fast.
Hosted APIs: Voyage AI (Anthropic’s reference provider for embeddings — it also ships dedicated reranker models), plus Cohere Rerank, Pinecone, and newer entrants like ZeroEntropy’s Zerank. One gap to note: OpenAI’s standard API has no dedicated reranker endpoint.
Self-hosted: cross-encoder models like BGE-reranker, ms-marco MiniLM, Jina, or mxbai, served via llama.cpp or a text-embeddings server. Some reranker models even show up in ollama’s search — but heads-up, ollama has no rerank endpoint, so you’d serve them yourself rather than call a /rerank route.

A word of caution on picking one: the reranker leaderboards genuinely shuffle with every model release, so wire yours in behind a one-line config swap and measure on your data rather than trusting vendor benchmarks.

Worth reading

rerankers (Answer.AI) — the library
Voyage AI reranker docs
🇫🇷 Reranking : tout savoir sur cette technique du RAG — Blent
What is a reranker and do I need one? — ZeroEntropy

If your RAG answers are vague even though retrieval “looks right,” the fix is usually not a fancier embedding model — it’s a reranker. It’s the cheapest precision upgrade in the whole pipeline: one extra pass over a handful of candidates, and the right chunk finally lands on top. 🎯