TF-IDF and Bag of Words: Why Keyword Search Still Works
How term frequency and inverse document frequency turn documents into weighted vectors, powering search engines and text classification long before deep learning arrived.
Terminology
| Term | Definition |
|---|---|
| Bag of Words (BoW) | A document representation that counts word occurrences while discarding word order entirely |
| Term Frequency (TF) | How often a term appears in a single document, often normalized by document length |
| Document Frequency (DF) | The number of documents in the corpus that contain a given term |
| Inverse Document Frequency (IDF) | A weight that penalizes terms appearing in many documents: $\text{IDF}(t) = \log\frac{N}{\text{DF}(t)}$ where $N$ is the total number of documents |
| TF-IDF | The product $\text{TF}(t,d) \times \text{IDF}(t)$, giving high weight to terms that are frequent in a document but rare across the corpus |
| Document Vector | A numerical vector representing a document in a $V$-dimensional space where each dimension corresponds to a vocabulary term |
| Cosine Similarity | A similarity metric between two vectors computed as $\frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \cdot ||\vec{b}||}$, ranging from 0 (orthogonal) to 1 (identical direction) |
| Stop Words | Extremely common words ("the", "is", "and") that carry little discriminative value and are often removed before analysis |
| Sparse Vector | A vector where most entries are zero, stored efficiently by recording only non-zero positions and values |
What & Why
Before word embeddings and transformers, the dominant way to represent text for machines was to count words. The Bag of Words model throws away word order and simply tallies how many times each word appears. This sounds crude, but it works surprisingly well for tasks like document retrieval, spam detection, and topic classification.
The problem with raw counts is that common words dominate. "The" appears in every English document, so a high count of "the" tells you nothing about what a document is about. TF-IDF fixes this by weighting each term by how rare it is across the corpus. A word that appears frequently in one document but rarely elsewhere gets a high TF-IDF score, making it a strong signal for that document's topic.
This is why keyword search still works. When you type a query into a search engine, the system computes TF-IDF vectors for your query and every document, then ranks documents by cosine similarity. The math is simple, the computation is fast, and the results are good enough that TF-IDF remains a component of modern search systems even alongside neural approaches.
How It Works
Bag of Words
Given a vocabulary of V terms, each document becomes a vector of length V where entry i is the count of term i in that document. For a corpus of 3 documents with vocabulary ["cat", "sat", "dog", "ran"]:
- Doc 1: "the cat sat" -> [1, 1, 0, 0]
- Doc 2: "the dog ran" -> [0, 0, 1, 1]
- Doc 3: "the cat ran" -> [1, 0, 0, 1]
(Stop words like "the" are removed before vectorization.)
TF-IDF Weighting
Raw counts overweight frequent terms. TF-IDF reweights each entry:
Document Similarity via Cosine
To find documents similar to a query, compute the cosine similarity between their TF-IDF vectors:
Cosine similarity ignores vector magnitude (document length) and focuses on direction (term distribution), making it robust to documents of different lengths.
Complexity Analysis
| Operation | Time | Space | Notes |
|---|---|---|---|
| Build BoW vectors | $O(N)$ | $O(D \cdot V)$ | $N$ = total tokens, $D$ = documents, $V$ = vocab size |
| Compute IDF for all terms | $O(V)$ | $O(V)$ | One pass over document frequency counts |
| TF-IDF transform | $O(D \cdot \bar{L})$ | $O(D \cdot \bar{L})$ | $\bar{L}$ = average non-zero entries per document (sparse) |
| Cosine similarity (two docs) | $O(\bar{L})$ | $O(1)$ | Dot product over sparse vectors |
| Query against all docs | $O(D \cdot \bar{L})$ | $O(D)$ | Inverted index reduces this to $O(k)$ for $k$ matching docs |
In practice, TF-IDF vectors are extremely sparse. A document with 200 unique words in a vocabulary of 100,000 has 99.8% zero entries. Sparse storage (inverted indices) makes both storage and retrieval efficient.
Implementation
ALGORITHM ComputeTFIDF(corpus)
INPUT: corpus: list of documents (each a list of tokens)
OUTPUT: tfidfMatrix: map of (docIndex, term) -> tfidf score, idf: map of term -> idf value
BEGIN
N <- LENGTH(corpus)
df <- empty map // document frequency per term
tfPerDoc <- empty list // one map per document
// Pass 1: compute TF and DF
FOR docIdx FROM 0 TO N - 1 DO
tf <- empty map
seen <- empty set
FOR EACH token IN corpus[docIdx] DO
tf[token] <- tf[token] + 1
END FOR
// Normalize TF by document length
docLen <- LENGTH(corpus[docIdx])
FOR EACH (term, count) IN tf DO
tf[term] <- count / docLen
IF term NOT IN seen THEN
df[term] <- df[term] + 1
ADD term TO seen
END IF
END FOR
APPEND tf TO tfPerDoc
END FOR
// Pass 2: compute IDF
idf <- empty map
FOR EACH (term, docCount) IN df DO
idf[term] <- LOG(N / docCount)
END FOR
// Pass 3: compute TF-IDF
tfidfMatrix <- empty map
FOR docIdx FROM 0 TO N - 1 DO
FOR EACH (term, tfVal) IN tfPerDoc[docIdx] DO
tfidfMatrix[(docIdx, term)] <- tfVal * idf[term]
END FOR
END FOR
RETURN tfidfMatrix, idf
END
ALGORITHM CosineSimilarity(vecA, vecB)
INPUT: vecA: sparse map of term -> weight, vecB: sparse map of term -> weight
OUTPUT: similarity: float in [0, 1]
BEGIN
dotProduct <- 0
normA <- 0
normB <- 0
FOR EACH (term, weight) IN vecA DO
normA <- normA + weight * weight
IF term IN vecB THEN
dotProduct <- dotProduct + weight * vecB[term]
END IF
END FOR
FOR EACH (term, weight) IN vecB DO
normB <- normB + weight * weight
END FOR
IF normA = 0 OR normB = 0 THEN
RETURN 0
END IF
RETURN dotProduct / (SQRT(normA) * SQRT(normB))
END
Real-World Applications
- Search engines: TF-IDF (or BM25, its probabilistic cousin) remains the backbone of document retrieval, used by Elasticsearch, Solr, and Lucene to rank search results
- Email spam filtering: Naive Bayes classifiers trained on BoW features were among the first effective spam filters, and TF-IDF weighting improved their accuracy
- Document clustering: K-means on TF-IDF vectors groups similar documents together for topic discovery, news aggregation, and duplicate detection
- Keyword extraction: Terms with the highest TF-IDF scores in a document are strong candidates for keywords and tags
- Recommendation systems: Content-based recommenders compute TF-IDF similarity between item descriptions to suggest similar products or articles
- Legal and medical search: Domain-specific search systems use TF-IDF with custom stop word lists and domain vocabularies to surface relevant case law or clinical studies
Key Takeaways
- Bag of Words represents documents as word count vectors, discarding order but retaining enough signal for many classification and retrieval tasks
- TF-IDF reweights raw counts by penalizing terms that appear in many documents, surfacing the words that actually distinguish one document from another
- Cosine similarity between TF-IDF vectors measures document similarity independent of length, making it the standard metric for text retrieval
- TF-IDF vectors are extremely sparse, enabling efficient storage via inverted indices and fast retrieval even over millions of documents
- Despite being decades old, TF-IDF and its variant BM25 remain core components of modern search engines, often used alongside neural re-rankers