Artificial intelligence

How Do BM25 and RAG Get Information Differently?

When you type a query into a search engine, something has to decide which documents are really important – and how to rank them. BM25 (Best Matching 25)the algorithm that powers search engines like Elasticsearch and Lucene, has been the leading answer to that question for decades.

It scores by looking at three things: how often your query words appear in the document, how rare those words are in the entire collection, and whether the document is unusually long. The smart part is that BM25 doesn’t reward keyword concentration – a word that appears 20 times doesn’t qualify a document 20 times, because of word frequency stuffing. But BM25 has an important blind spot: it only matches the words you typed, not what you meant. Search “finding similar content without overlapping words” and BM25 returns an empty view.

This is exactly the gap that Retrieval-Augmented Generation (RAG) with vector embedding is designed to complement – by matching meaning, not just keywords. In this article, we’ll reveal how each method works, where each wins, and why production systems use both together.

How BM25 works

At its core, BM25 assigns a relative score to all documents in a set for a given query, and ranks the documents by that score. For each term in your question, BM25 asks three questions: How many times does this word appear in the document? How rare is this word in all documents? And is this document unusually long? The final score is the sum of the weighted responses to all questions.

The term half frequency is where the BM25 excels. Rather than counting raw events, it works saturation – points increase rapidly at first but decrease as frequency increases. A term that appears 5 times contributes more than a term that appears once, but a word that appears 50 times contributes less than one that appears 20 times. This is controlled by a parameter k₁ (usually set between 1.2 and 2.0). Put it down and the saturation kicks in quickly; set it higher and the green frequency is more important. This single design choice is what makes the BM25 resistant to keyword stuffing – repeating a word a hundred times in a document won’t score points.

Standard length and IDF

The second tuning parameter, b (typically 0.75), controls how much the document length is penalized. A longer document naturally contains more words, so it’s more likely to include time for your question — not because it’s more relevant, but because it’s longer. BM25 compares the length of each document to the average document length in the collection and scales the term frame score accordingly. Setting up b = 0 disables this sentence completely; b = 1 full normality applies.

Finally, The IDF (Inverse Document Frequency) ensures that rare words have more weight than common ones. If the word “retrieval” appears in only 3 out of 10,000 documents, a strong signal of relevance when compared. If the word “the” it appears in every 10,000, comparing it with is almost meaningless. The IDF is the one that makes BM25 pay attention to the words that actually distinguish between texts. One important caveat: because BM25 only works on term frequency, it has no knowledge of word order, context, or meaning – matching “bank” to the question about finances and “bank” in the text about rivers it looks like BM25. That vocabulary limitation is fundamental, not a tuning problem.

BM25 and vector search answer the same question – What documents are related to this question? – but through fundamentally different lenses. BM25 is a keyword matching algorithm: it looks for the exact words that appear in your query within each document, finds them based on frequency and occurrence, and ranks accordingly. It has no language understanding — it sees text as a bag of tokens, not meaning.

Vector search, in contrast, transforms both the query and each document into dense numeric vectors using an embedding model, and then finds documents whose vectors point in the same direction as the query vector – measured by cosine similarity. This means that vector search can be equivalent “cardiac arrest” in the document about “heart failure” even if there are no overlapping words, because the embedding model has learned that these words live close together in the semantic space.

The tradeoff works: BM25 requires no model, no GPU, and no API call – it’s fast, simple, and fully self-explanatory. Vector search requires model embedding at index time and query time, adds latency and cost, and produces results that are difficult to interpret. No one is the best; they fail in opposite ways, which is why hybrid search – combining the two – has become the norm.

Comparing BM25 and Vector Search in Python

Dependent installation

pip install rank_bm25 openai numpy
import math
import re
import numpy as np
from collections import Counter
from rank_bm25 import BM25Okapi
from openai import OpenAI
import os
from getpass import getpass 
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

Defining the Corpus

Before comparing BM25 with vector search, we need a shared knowledge base to search. We cover 12 short text passages covering a wide range of topics – Python, machine learning, BM25, transformers, embedded, RAG, databases, and more. The topics are intentionally different: some pieces are closely related (BM25 and TF-IDF, embedding and cosine similarity), while others are completely unrelated (PostgreSQL, Django). This variation is what makes the comparison meaningful – a good retrieval method should show the right bits and ignore the noise.

This corpus serves as our stand in a real document store. In the RAG production pipeline, these components will come from segmenting and cleaning real documents – PDFs, wikis, knowledge bases. Here, we keep them short and manual so that retrieval behavior is easy to track and reason about.

CHUNKS = [
    # 0
    "Python is a high-level, interpreted programming language known for its simple and readable syntax. "
    "It supports multiple programming paradigms including procedural, object-oriented, and functional programming.",
 
    # 1
    "Machine learning is a subset of artificial intelligence that enables systems to learn from data "
    "without being explicitly programmed. Common algorithms include linear regression, decision trees, and neural networks.",
 
    # 2
    "BM25 stands for Best Match 25. It is a bag-of-words retrieval function used by search engines "
    "to rank documents based on the query terms appearing in each document. "
    "BM25 uses term frequency and inverse document frequency with length normalization.",
 
    # 3
    "Transformer architecture introduced the self-attention mechanism, which allows the model to weigh "
    "the importance of different words in a sentence regardless of their position. "
    "BERT and GPT are both based on the Transformer architecture.",
 
    # 4
    "Vector embeddings represent text as dense numerical vectors in a high-dimensional space. "
    "Similar texts are placed closer together. This allows semantic search -- finding documents "
    "that mean the same thing even if they use different words.",
 
    # 5
    "TF-IDF stands for Term Frequency-Inverse Document Frequency. It reflects how important a word is "
    "to a document relative to the entire corpus. Rare words get higher scores than common ones like 'the'.",
 
    # 6
    "Retrieval-Augmented Generation (RAG) combines a retrieval system with a language model. "
    "The retriever finds relevant documents; the generator uses them as context to produce an answer. "
    "This reduces hallucinations and allows the model to cite sources.",
 
    # 7
    "Django is a high-level Python web framework that encourages rapid development and clean, pragmatic design. "
    "It includes an ORM, authentication system, and admin panel out of the box.",
 
    # 8
    "Cosine similarity measures the angle between two vectors. A score of 1 means identical direction, "
    "0 means orthogonal, and -1 means opposite. It is commonly used to compare text embeddings.",
 
    # 9
    "Gradient descent is an optimization algorithm used to minimize a loss function by iteratively "
    "moving in the direction of the steepest descent. It is the backbone of training neural networks.",
 
    # 10
    "PostgreSQL is an open-source relational database known for its robustness and support for advanced "
    "SQL features like window functions, CTEs, and JSON storage.",
 
    # 11
    "Sparse retrieval methods like BM25 rely on exact keyword matches and fail when the query uses "
    "synonyms or paraphrases not present in the document. Dense retrieval using embeddings handles "
    "this by matching semantic meaning rather than surface form.",
]
 
print(f"Corpus loaded: {len(CHUNKS)} chunks")
for i, c in enumerate(CHUNKS):
    print(f"  [{i:02d}] {c[:75]}...")

Building a BM25 Retriever

With the defined corpus, we can create a BM25 index. The process has two steps: tokenization and indexing. The token function truncates the text and separates it from any non-numeric characters — so that “TF-IDF” is [“tf”, “idf”] and “bag of words” becomes [“bag”, “of”, “words”]. This is deliberately simple: BM25 is a bag of words model, so there is no feature, no stop removal, and no language pre-processing. All words are treated as independent symbols.

Once the entire passage is tokenized, BM25Okapi constructs an index — computing document length, average document length, and IDF scores for all unique terms in the corpus. This happened once in the beginning. During a query, bm25_search tokenizes the incoming query in the same way, calls get_scores to calculate the relative BM25 scores for all chunks in parallel, then filters and returns the top-k results. The check logic below uses a test query to verify that the index is valid before we move on to the embedding finder.

def tokenize(text: str) -> list[str]:
    """Lowercase and split on non-alphanumeric characters."""
    return re.findall(r'w+', text.lower())
 
# Build BM25 index over the corpus
tokenized_corpus = [tokenize(chunk) for chunk in CHUNKS]
bm25 = BM25Okapi(tokenized_corpus)
 
def bm25_search(query: str, top_k: int = 3) -> list[dict]:
    """Return top-k chunks ranked by BM25 score."""
    tokens = tokenize(query)
    scores = bm25.get_scores(tokens)
    ranked = np.argsort(scores)[::-1][:top_k]
    return [
        {"chunk_id": int(i), "score": round(float(scores[i]), 4), "text": CHUNKS[i]}
        for i in ranked
    ]
 
# Quick sanity check
results = bm25_search("how does BM25 rank documents", top_k=3)
print("BM25 test -- query: 'how does BM25 rank documents'")
for r in results:
    print(f"  [{r['chunk_id']}] score={r['score']}  {r['text'][:70]}...")

Building an Embedding Returner

The embedding detector works differently from the BM25 in every step. Instead of counting tokens, it converts each chunk into a vector of dense numbers – a list of 1,536 numbers – using OpenAI’s 3-small text embedding model. Each number represents a dimension in the semantic space, and pieces that mean the same things end up with vectors pointing in the same directions, regardless of the words they use.

The indexing step calls the embedding API once for each slice and stores the resulting vectors in memory. This is the main cost difference from BM25: building a BM25 index is pure arithmetic on your own machine, while building an embedding index requires one API call per block and generates vectors that you need to store. For 12 episodes this is a small thing; with a million episodes, this becomes a real infrastructure decision.

During the query, embedding_search embeds the incoming query using the same model – this is important, the query and the chunks must live in the same vector – and then calculates the cosine similarity between the query vector and the rest of the stored chunk vector. Cosine similarity measures the angle between two vectors: 1 points mean the same direction, 0 means completely unrelated, and opposite values ​​mean the opposite direction. The fragments are then averaged according to this score and returned to the top-k. The same mental test question from the BM25 section applies here as well, so you can see the first direct comparison between the two methods in the same installation.

EMBED_MODEL = "text-embedding-3-small"
 
def get_embedding(text: str) -> np.ndarray:
    response = client.embeddings.create(model=EMBED_MODEL, input=text)
    return np.array(response.data[0].embedding)
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
 
# Embed all chunks once (this is the "index build" step in RAG)
print("Building embedding index... (one API call per chunk)")
chunk_embeddings = [get_embedding(chunk) for chunk in CHUNKS]
print(f"Done. Each embedding has {len(chunk_embeddings[0])} dimensions.")
 
def embedding_search(query: str, top_k: int = 3) -> list[dict]:
    """Return top-k chunks ranked by cosine similarity to the query embedding."""
    query_emb = get_embedding(query)
    scores = [cosine_similarity(query_emb, emb) for emb in chunk_embeddings]
    ranked = np.argsort(scores)[::-1][:top_k]
    return [
        {"chunk_id": int(i), "score": round(float(scores[i]), 4), "text": CHUNKS[i]}
        for i in ranked
    ]
# Quick sanity check
results = embedding_search("how does BM25 rank documents", top_k=3)
print("nEmbedding test -- query: 'how does BM25 rank documents'")
for r in results:
    print(f"  [{r['chunk_id']}] score={r['score']}  {r['text'][:70]}...")

Side by Side Comparison Function

This is the core of the test. The compare function runs the same query on both detectors at the same time and prints the results in a two-column layout – BM25 on the left, embedding on the right – so the difference is immediately visible in the same place of the rank.

def compare(query: str, top_k: int = 3):
    bm25_results    = bm25_search(query, top_k)
    embed_results   = embedding_search(query, top_k)
 
    print(f"n{'═'*70}")
    print(f"  QUERY: "{query}"")
    print(f"{'═'*70}")
 
    print(f"n  {'BM25 (keyword)':<35}  {'Embedding RAG (semantic)'}")
    print(f"  {'─'*33}  {'─'*33}")
 
    for rank, (b, e) in enumerate(zip(bm25_results, embed_results), 1):
        b_preview = b['text'][:55].replace('n', ' ')
        e_preview = e['text'][:55].replace('n', ' ')
        same = "⬅ same" if b['chunk_id'] == e['chunk_id'] else ""
        print(f"  #{rank} [{b['chunk_id']:02d}] {b['score']:.4f}  {b_preview}...")
        print(f"     [{e['chunk_id']:02d}] {e['score']:.4f}  {e_preview}...  {same}")
        print()
compare("BM25 term frequency inverse document frequency")
compare("what is RAG and why does it reduce hallucinations")
compare("cosine similarity between vectors")

Check out The complete Notebook is here. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I am very interested in Data Science, especially Neural Networks and its application in various fields.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button