Notes/TF-IDF and Bag of Words: Why Keyword Search Still Works

TF-IDF and Bag of Words: Why Keyword Search Still Works

How term frequency and inverse document frequency turn documents into weighted vectors, powering search engines and text classification long before deep learning arrived.

2025-10-10AI-Synthesized from Personal Notes

Computational LinguisticsTf IdfBag Of WordsNlp

Terminology

Term	Definition
Bag of Words (BoW)	A document representation that counts word occurrences while discarding word order entirely
Term Frequency (TF)	How often a term appears in a single document, often normalized by document length
Document Frequency (DF)	The number of documents in the corpus that contain a given term
Inverse Document Frequency (IDF)	A weight that penalizes terms appearing in many documents: $\text{IDF}(t) = \log\frac{N}{\text{DF}(t)}$ where $N$ is the total number of documents
TF-IDF	The product $\text{TF}(t,d) \times \text{IDF}(t)$, giving high weight to terms that are frequent in a document but rare across the corpus
Document Vector	A numerical vector representing a document in a $V$-dimensional space where each dimension corresponds to a vocabulary term
Cosine Similarity	A similarity metric between two vectors computed as $\frac{\vec{a} \cdot \vec{b}}{\|\|\vec{a}\|\| \cdot \|\|\vec{b}\|\|}$, ranging from 0 (orthogonal) to 1 (identical direction)
Stop Words	Extremely common words ("the", "is", "and") that carry little discriminative value and are often removed before analysis
Sparse Vector	A vector where most entries are zero, stored efficiently by recording only non-zero positions and values

What & Why

Before word embeddings and transformers, the dominant way to represent text for machines was to count words. The Bag of Words model throws away word order and simply tallies how many times each word appears. This sounds crude, but it works surprisingly well for tasks like document retrieval, spam detection, and topic classification.

The problem with raw counts is that common words dominate. "The" appears in every English document, so a high count of "the" tells you nothing about what a document is about. TF-IDF fixes this by weighting each term by how rare it is across the corpus. A word that appears frequently in one document but rarely elsewhere gets a high TF-IDF score, making it a strong signal for that document's topic.

This is why keyword search still works. When you type a query into a search engine, the system computes TF-IDF vectors for your query and every document, then ranks documents by cosine similarity. The math is simple, the computation is fast, and the results are good enough that TF-IDF remains a component of modern search systems even alongside neural approaches.

How It Works

Bag of Words

Given a vocabulary of $V$ terms, each document becomes a vector of length $V$ where entry $i$ is the count of term $i$ in that document. For a corpus of 3 documents with vocabulary ["cat", "sat", "dog", "ran"]:

Doc 1: "the cat sat" -> [1, 1, 0, 0]
Doc 2: "the dog ran" -> [0, 0, 1, 1]
Doc 3: "the cat ran" -> [1, 0, 0, 1]

(Stop words like "the" are removed before vectorization.)

TF-IDF Weighting

Raw counts overweight frequent terms. TF-IDF reweights each entry:

$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) = \text{TF}(t, d) \times \log\frac{N}{\text{DF}(t)}$

Document Similarity via Cosine

To find documents similar to a query, compute the cosine similarity between their TF-IDF vectors:

$\text{cos}(\vec{q}, \vec{d}) = \frac{\sum_{i} q_i \cdot d_i}{\sqrt{\sum_{i} q_i^2} \cdot \sqrt{\sum_{i} d_i^2}}$

Cosine similarity ignores vector magnitude (document length) and focuses on direction (term distribution), making it robust to documents of different lengths.

Complexity Analysis

Operation	Time	Space	Notes
Build BoW vectors	$O(N)$	$O(D \cdot V)$	$N$ = total tokens, $D$ = documents, $V$ = vocab size
Compute IDF for all terms	$O(V)$	$O(V)$	One pass over document frequency counts
TF-IDF transform	$O(D \cdot \bar{L})$	$O(D \cdot \bar{L})$	$\bar{L}$ = average non-zero entries per document (sparse)
Cosine similarity (two docs)	$O(\bar{L})$	$O(1)$	Dot product over sparse vectors
Query against all docs	$O(D \cdot \bar{L})$	$O(D)$	Inverted index reduces this to $O(k)$ for $k$ matching docs

In practice, TF-IDF vectors are extremely sparse. A document with 200 unique words in a vocabulary of 100,000 has 99.8% zero entries. Sparse storage (inverted indices) makes both storage and retrieval efficient.

Implementation

ALGORITHM ComputeTFIDF(corpus)
INPUT: corpus: list of documents (each a list of tokens)
OUTPUT: tfidfMatrix: map of (docIndex, term) -> tfidf score, idf: map of term -> idf value

BEGIN
  N <- LENGTH(corpus)
  df <- empty map        // document frequency per term
  tfPerDoc <- empty list  // one map per document

  // Pass 1: compute TF and DF
  FOR docIdx FROM 0 TO N - 1 DO
    tf <- empty map
    seen <- empty set
    FOR EACH token IN corpus[docIdx] DO
      tf[token] <- tf[token] + 1
    END FOR
    // Normalize TF by document length
    docLen <- LENGTH(corpus[docIdx])
    FOR EACH (term, count) IN tf DO
      tf[term] <- count / docLen
      IF term NOT IN seen THEN
        df[term] <- df[term] + 1
        ADD term TO seen
      END IF
    END FOR
    APPEND tf TO tfPerDoc
  END FOR

  // Pass 2: compute IDF
  idf <- empty map
  FOR EACH (term, docCount) IN df DO
    idf[term] <- LOG(N / docCount)
  END FOR

  // Pass 3: compute TF-IDF
  tfidfMatrix <- empty map
  FOR docIdx FROM 0 TO N - 1 DO
    FOR EACH (term, tfVal) IN tfPerDoc[docIdx] DO
      tfidfMatrix[(docIdx, term)] <- tfVal * idf[term]
    END FOR
  END FOR

  RETURN tfidfMatrix, idf
END

ALGORITHM CosineSimilarity(vecA, vecB)
INPUT: vecA: sparse map of term -> weight, vecB: sparse map of term -> weight
OUTPUT: similarity: float in [0, 1]

BEGIN
  dotProduct <- 0
  normA <- 0
  normB <- 0

  FOR EACH (term, weight) IN vecA DO
    normA <- normA + weight * weight
    IF term IN vecB THEN
      dotProduct <- dotProduct + weight * vecB[term]
    END IF
  END FOR

  FOR EACH (term, weight) IN vecB DO
    normB <- normB + weight * weight
  END FOR

  IF normA = 0 OR normB = 0 THEN
    RETURN 0
  END IF

  RETURN dotProduct / (SQRT(normA) * SQRT(normB))
END

Real-World Applications

Search engines: TF-IDF (or BM25, its probabilistic cousin) remains the backbone of document retrieval, used by Elasticsearch, Solr, and Lucene to rank search results
Email spam filtering: Naive Bayes classifiers trained on BoW features were among the first effective spam filters, and TF-IDF weighting improved their accuracy
Document clustering: K-means on TF-IDF vectors groups similar documents together for topic discovery, news aggregation, and duplicate detection
Keyword extraction: Terms with the highest TF-IDF scores in a document are strong candidates for keywords and tags
Recommendation systems: Content-based recommenders compute TF-IDF similarity between item descriptions to suggest similar products or articles
Legal and medical search: Domain-specific search systems use TF-IDF with custom stop word lists and domain vocabularies to surface relevant case law or clinical studies

Key Takeaways

Bag of Words represents documents as word count vectors, discarding order but retaining enough signal for many classification and retrieval tasks
TF-IDF reweights raw counts by penalizing terms that appear in many documents, surfacing the words that actually distinguish one document from another
Cosine similarity between TF-IDF vectors measures document similarity independent of length, making it the standard metric for text retrieval
TF-IDF vectors are extremely sparse, enabling efficient storage via inverted indices and fast retrieval even over millions of documents
Despite being decades old, TF-IDF and its variant BM25 remain core components of modern search engines, often used alongside neural re-rankers

2025-10-11

Word Embeddings: How King Minus Man Plus Woman Equals Queen

From one-hot vectors to Word2Vec and GloVe, exploring how the distributional hypothesis lets us learn dense vector representations where arithmetic on words captures meaning.

2025-10-12

Sequence Labeling: POS Tagging, NER, and the Viterbi Algorithm

How Hidden Markov Models and BIO tagging turn the problem of labeling every word in a sentence into a tractable dynamic programming problem, powering part-of-speech tagging and named entity recognition.

2025-10-13

Parsing and Syntax Trees: Where Compilers and NLP Share Roots

From constituency parsing to dependency parsing and the CYK algorithm, exploring how machines discover the grammatical structure of sentences using the same ideas that power programming language compilers.