Notes/Semantic Similarity and Sentence Embeddings: From Cosine Distance to RAG Pipelines

Semantic Similarity and Sentence Embeddings: From Cosine Distance to RAG Pipelines

How BERT embeddings and sentence transformers measure meaning similarity between texts, enabling semantic search, duplicate detection, and retrieval-augmented generation.

2025-10-14AI-Synthesized from Personal Notes

Computational LinguisticsSemantic SimilaritySentence EmbeddingsNlp

Terminology

Term	Definition
Semantic Similarity	A measure of how close two pieces of text are in meaning, independent of surface-level word overlap
Sentence Embedding	A fixed-length dense vector representing the meaning of an entire sentence, typically 384 to 1024 dimensions
Cosine Similarity	$\frac{\vec{a} \cdot \vec{b}}{\|\|\vec{a}\|\| \cdot \|\|\vec{b}\|\|}$: measures the angle between two vectors, ranging from -1 (opposite) to 1 (identical direction)
BERT	Bidirectional Encoder Representations from Transformers: a pre-trained language model that produces contextualized word embeddings
Sentence Transformer	A BERT-based model fine-tuned with a siamese or triplet network to produce sentence embeddings optimized for similarity comparison
Siamese Network	A neural architecture with two identical sub-networks that process two inputs independently, then compare their outputs
RAG (Retrieval-Augmented Generation)	A pattern where a language model retrieves relevant documents via semantic search before generating an answer, grounding responses in external knowledge
Vector Database	A database optimized for storing and querying high-dimensional vectors using approximate nearest neighbor (ANN) search
ANN (Approximate Nearest Neighbor)	Algorithms (HNSW, IVF, LSH) that find near-optimal nearest neighbors in sub-linear time by trading exactness for speed

What & Why

Keyword search fails when the query and the answer use different words. Searching for "how to fix a flat tire" will not match a document titled "changing a punctured wheel" even though they mean the same thing. Semantic similarity solves this by comparing meaning rather than surface tokens.

The key enabler is sentence embeddings: mapping entire sentences (or paragraphs) to dense vectors where semantically similar texts land close together. Early approaches averaged word embeddings, but this loses word order and context. BERT produces contextualized embeddings, but comparing two sentences requires running both through the model simultaneously, which is $O(n^2)$ for $n$ candidates. Sentence transformers solve this by producing independent embeddings that can be compared with a simple cosine similarity, enabling real-time semantic search over millions of documents.

This technology powers the RAG pattern that makes modern LLMs useful: instead of relying solely on training data, the model retrieves relevant documents from a vector database and uses them as context for generation. Semantic similarity is the retrieval engine behind this.

How It Works

From Word Embeddings to Sentence Embeddings

Averaging Word2Vec or GloVe vectors for all words in a sentence gives a crude sentence embedding. It works for short, simple sentences but fails when word order matters: "dog bites man" and "man bites dog" produce identical averages.

BERT produces token-level embeddings that are contextualized (the vector for "bank" differs in "river bank" vs "bank account"), but BERT was not designed to produce a single sentence vector. The common hack of using the [CLS] token embedding performs poorly for similarity tasks.

Sentence Transformers (SBERT)

Sentence transformers (Reimers and Gurevych, 2019) fine-tune BERT with a siamese architecture:

Pass sentence A through BERT, pool the output to get embedding $\vec{a}$
Pass sentence B through the same BERT, pool to get embedding $\vec{b}$
Train with a contrastive or triplet loss that pushes similar pairs together and dissimilar pairs apart

After training, each sentence can be embedded independently. Comparing 10,000 sentences requires 10,000 forward passes (one per sentence) plus 10,000 cosine similarity computations, not 50 million pairwise BERT inferences.

The RAG Pipeline

RAG combines semantic retrieval with generative models:

Offline: embed all documents and store vectors in a vector database
Online: embed the user query, find the $k$ nearest document vectors
Concatenate retrieved documents with the query as context for the LLM
The LLM generates an answer grounded in the retrieved evidence

This lets the model access up-to-date information without retraining.

Complexity Analysis

Operation	Time	Space	Notes
Sentence embedding (one sentence)	$O(L \cdot d^2)$	$O(d)$	$L$ = token count, $d$ = model dimension (transformer forward pass)
Cosine similarity (two vectors)	$O(d)$	$O(1)$	$d$ = embedding dimension (384 to 1024)
Brute-force nearest neighbor	$O(N \cdot d)$	$O(N \cdot d)$	$N$ = number of stored vectors
ANN search (HNSW)	$O(\log N \cdot d)$	$O(N \cdot d)$	Sub-linear query time with graph-based index
Cross-encoder (BERT pairwise)	$O(N \cdot L^2 \cdot d)$	$O(d)$	Must run BERT for every query-document pair, infeasible at scale

For a vector database with $N = 10{,}000{,}000$ documents at $d = 384$ dimensions, brute-force search requires $\sim 3.8 \times 10^9$ floating-point operations per query. HNSW reduces this to roughly $10{,}000$ distance computations per query, a $10^6\times$ speedup.

Implementation

ALGORITHM BuildSemanticIndex(documents, encoder)
INPUT: documents: list of N text strings, encoder: sentence transformer model
OUTPUT: index: vector database with N embeddings, docMap: map of vectorId -> document

BEGIN
  embeddings <- CREATE matrix [N x d]
  docMap <- empty map

  FOR i FROM 0 TO N - 1 DO
    tokens <- TOKENIZE(documents[i])
    hiddenStates <- encoder.FORWARD(tokens)    // [seqLen x d]
    // Mean pooling over token positions
    sentenceVec <- MEAN(hiddenStates, axis=0)   // [d]
    sentenceVec <- sentenceVec / NORM(sentenceVec)  // L2 normalize
    embeddings[i] <- sentenceVec
    docMap[i] <- documents[i]
  END FOR

  index <- BUILD_HNSW_INDEX(embeddings)
  RETURN index, docMap
END

ALGORITHM SemanticSearch(query, index, docMap, encoder, k)
INPUT: query: string, index: HNSW index, docMap: map, encoder: model, k: integer
OUTPUT: results: list of (document, score) pairs, sorted by similarity

BEGIN
  tokens <- TOKENIZE(query)
  hiddenStates <- encoder.FORWARD(tokens)
  queryVec <- MEAN(hiddenStates, axis=0)
  queryVec <- queryVec / NORM(queryVec)

  // ANN search returns k nearest neighbor IDs and distances
  (ids, distances) <- index.SEARCH(queryVec, k)

  results <- empty list
  FOR i FROM 0 TO k - 1 DO
    score <- 1 - distances[i]   // convert distance to similarity
    APPEND (docMap[ids[i]], score) TO results
  END FOR

  RETURN results
END

ALGORITHM RAGPipeline(query, index, docMap, encoder, llm, k)
INPUT: query: string, index: HNSW index, docMap: map, encoder: model,
       llm: language model, k: number of documents to retrieve
OUTPUT: answer: generated string grounded in retrieved context

BEGIN
  retrievedDocs <- SemanticSearch(query, index, docMap, encoder, k)

  context <- ""
  FOR EACH (doc, score) IN retrievedDocs DO
    context <- context + "\n---\n" + doc
  END FOR

  prompt <- "Answer the question using the provided context.\n\n"
           + "Context:" + context + "\n\n"
           + "Question: " + query + "\n\nAnswer:"

  answer <- llm.GENERATE(prompt)
  RETURN answer
END

Real-World Applications

Semantic search engines: Products like Google Search, Bing, and internal enterprise search use sentence embeddings to match queries to documents by meaning, not just keywords
RAG-powered chatbots: Customer support bots retrieve relevant knowledge base articles via semantic search before generating answers, reducing hallucination
Duplicate detection: Stack Overflow, Quora, and support ticket systems use semantic similarity to flag duplicate questions, even when phrased differently
Plagiarism detection: Academic tools compare sentence embeddings to detect paraphrased content that keyword-based systems miss
Recommendation systems: Content platforms embed articles and user queries into the same space, recommending semantically related content
Legal document review: Law firms use semantic search to find relevant precedents across millions of case documents, replacing manual keyword searches

Key Takeaways

Semantic similarity measures meaning closeness between texts, solving the vocabulary mismatch problem that defeats keyword search
Sentence transformers produce independent embeddings for each text, enabling comparison via cheap cosine similarity instead of expensive pairwise model inference
Vector databases with ANN indices (HNSW, IVF) make semantic search over millions of documents feasible in milliseconds
RAG pipelines combine semantic retrieval with generative models, grounding LLM outputs in retrieved evidence and reducing hallucination
The two-stage retrieve-then-rerank pattern (fast bi-encoder retrieval followed by accurate cross-encoder reranking) balances speed and quality in production systems

2025-10-15

Attention and Self-Attention Through the Linguistics Lens

How the attention mechanism lets transformers learn which words relate to which, why positional encoding replaces word order, and what self-attention looks like from a linguist's perspective.

2025-10-16

Machine Translation: From Encoder-Decoder to Why Google Translate Got Good Overnight

How sequence-to-sequence models, attention, beam search, and the BLEU score transformed machine translation from a rule-based curiosity into a system that handles 100+ languages.

2025-10-17

Text Generation and Decoding Strategies: How ChatGPT Picks the Next Token

From greedy decoding to top-k, top-p nucleus sampling, and temperature scaling, exploring the algorithms that turn a probability distribution over tokens into coherent, creative text.