Semantic Similarity and Sentence Embeddings: From Cosine Distance to RAG Pipelines
How BERT embeddings and sentence transformers measure meaning similarity between texts, enabling semantic search, duplicate detection, and retrieval-augmented generation.
Terminology
| Term | Definition |
|---|---|
| Semantic Similarity | A measure of how close two pieces of text are in meaning, independent of surface-level word overlap |
| Sentence Embedding | A fixed-length dense vector representing the meaning of an entire sentence, typically 384 to 1024 dimensions |
| Cosine Similarity | $\frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \cdot ||\vec{b}||}$: measures the angle between two vectors, ranging from -1 (opposite) to 1 (identical direction) |
| BERT | Bidirectional Encoder Representations from Transformers: a pre-trained language model that produces contextualized word embeddings |
| Sentence Transformer | A BERT-based model fine-tuned with a siamese or triplet network to produce sentence embeddings optimized for similarity comparison |
| Siamese Network | A neural architecture with two identical sub-networks that process two inputs independently, then compare their outputs |
| RAG (Retrieval-Augmented Generation) | A pattern where a language model retrieves relevant documents via semantic search before generating an answer, grounding responses in external knowledge |
| Vector Database | A database optimized for storing and querying high-dimensional vectors using approximate nearest neighbor (ANN) search |
| ANN (Approximate Nearest Neighbor) | Algorithms (HNSW, IVF, LSH) that find near-optimal nearest neighbors in sub-linear time by trading exactness for speed |
What & Why
Keyword search fails when the query and the answer use different words. Searching for "how to fix a flat tire" will not match a document titled "changing a punctured wheel" even though they mean the same thing. Semantic similarity solves this by comparing meaning rather than surface tokens.
The key enabler is sentence embeddings: mapping entire sentences (or paragraphs) to dense vectors where semantically similar texts land close together. Early approaches averaged word embeddings, but this loses word order and context. BERT produces contextualized embeddings, but comparing two sentences requires running both through the model simultaneously, which is O(n^2) for n candidates. Sentence transformers solve this by producing independent embeddings that can be compared with a simple cosine similarity, enabling real-time semantic search over millions of documents.
This technology powers the RAG pattern that makes modern LLMs useful: instead of relying solely on training data, the model retrieves relevant documents from a vector database and uses them as context for generation. Semantic similarity is the retrieval engine behind this.
How It Works
From Word Embeddings to Sentence Embeddings
Averaging Word2Vec or GloVe vectors for all words in a sentence gives a crude sentence embedding. It works for short, simple sentences but fails when word order matters: "dog bites man" and "man bites dog" produce identical averages.
BERT produces token-level embeddings that are contextualized (the vector for "bank" differs in "river bank" vs "bank account"), but BERT was not designed to produce a single sentence vector. The common hack of using the [CLS] token embedding performs poorly for similarity tasks.
Sentence Transformers (SBERT)
Sentence transformers (Reimers and Gurevych, 2019) fine-tune BERT with a siamese architecture:
- Pass sentence A through BERT, pool the output to get embedding
\vec{a} - Pass sentence B through the same BERT, pool to get embedding
\vec{b} - Train with a contrastive or triplet loss that pushes similar pairs together and dissimilar pairs apart
After training, each sentence can be embedded independently. Comparing 10,000 sentences requires 10,000 forward passes (one per sentence) plus 10,000 cosine similarity computations, not 50 million pairwise BERT inferences.
The RAG Pipeline
RAG combines semantic retrieval with generative models:
- Offline: embed all documents and store vectors in a vector database
- Online: embed the user query, find the
knearest document vectors - Concatenate retrieved documents with the query as context for the LLM
- The LLM generates an answer grounded in the retrieved evidence
This lets the model access up-to-date information without retraining.
Complexity Analysis
| Operation | Time | Space | Notes |
|---|---|---|---|
| Sentence embedding (one sentence) | $O(L \cdot d^2)$ | $O(d)$ | $L$ = token count, $d$ = model dimension (transformer forward pass) |
| Cosine similarity (two vectors) | $O(d)$ | $O(1)$ | $d$ = embedding dimension (384 to 1024) |
| Brute-force nearest neighbor | $O(N \cdot d)$ | $O(N \cdot d)$ | $N$ = number of stored vectors |
| ANN search (HNSW) | $O(\log N \cdot d)$ | $O(N \cdot d)$ | Sub-linear query time with graph-based index |
| Cross-encoder (BERT pairwise) | $O(N \cdot L^2 \cdot d)$ | $O(d)$ | Must run BERT for every query-document pair, infeasible at scale |
For a vector database with $N = 10{,}000{,}000$ documents at $d = 384$ dimensions, brute-force search requires $\sim 3.8 \times 10^9$ floating-point operations per query. HNSW reduces this to roughly $10{,}000$ distance computations per query, a $10^6\times$ speedup.
Implementation
ALGORITHM BuildSemanticIndex(documents, encoder)
INPUT: documents: list of N text strings, encoder: sentence transformer model
OUTPUT: index: vector database with N embeddings, docMap: map of vectorId -> document
BEGIN
embeddings <- CREATE matrix [N x d]
docMap <- empty map
FOR i FROM 0 TO N - 1 DO
tokens <- TOKENIZE(documents[i])
hiddenStates <- encoder.FORWARD(tokens) // [seqLen x d]
// Mean pooling over token positions
sentenceVec <- MEAN(hiddenStates, axis=0) // [d]
sentenceVec <- sentenceVec / NORM(sentenceVec) // L2 normalize
embeddings[i] <- sentenceVec
docMap[i] <- documents[i]
END FOR
index <- BUILD_HNSW_INDEX(embeddings)
RETURN index, docMap
END
ALGORITHM SemanticSearch(query, index, docMap, encoder, k)
INPUT: query: string, index: HNSW index, docMap: map, encoder: model, k: integer
OUTPUT: results: list of (document, score) pairs, sorted by similarity
BEGIN
tokens <- TOKENIZE(query)
hiddenStates <- encoder.FORWARD(tokens)
queryVec <- MEAN(hiddenStates, axis=0)
queryVec <- queryVec / NORM(queryVec)
// ANN search returns k nearest neighbor IDs and distances
(ids, distances) <- index.SEARCH(queryVec, k)
results <- empty list
FOR i FROM 0 TO k - 1 DO
score <- 1 - distances[i] // convert distance to similarity
APPEND (docMap[ids[i]], score) TO results
END FOR
RETURN results
END
ALGORITHM RAGPipeline(query, index, docMap, encoder, llm, k)
INPUT: query: string, index: HNSW index, docMap: map, encoder: model,
llm: language model, k: number of documents to retrieve
OUTPUT: answer: generated string grounded in retrieved context
BEGIN
retrievedDocs <- SemanticSearch(query, index, docMap, encoder, k)
context <- ""
FOR EACH (doc, score) IN retrievedDocs DO
context <- context + "\n---\n" + doc
END FOR
prompt <- "Answer the question using the provided context.\n\n"
+ "Context:" + context + "\n\n"
+ "Question: " + query + "\n\nAnswer:"
answer <- llm.GENERATE(prompt)
RETURN answer
END
Real-World Applications
- Semantic search engines: Products like Google Search, Bing, and internal enterprise search use sentence embeddings to match queries to documents by meaning, not just keywords
- RAG-powered chatbots: Customer support bots retrieve relevant knowledge base articles via semantic search before generating answers, reducing hallucination
- Duplicate detection: Stack Overflow, Quora, and support ticket systems use semantic similarity to flag duplicate questions, even when phrased differently
- Plagiarism detection: Academic tools compare sentence embeddings to detect paraphrased content that keyword-based systems miss
- Recommendation systems: Content platforms embed articles and user queries into the same space, recommending semantically related content
- Legal document review: Law firms use semantic search to find relevant precedents across millions of case documents, replacing manual keyword searches
Key Takeaways
- Semantic similarity measures meaning closeness between texts, solving the vocabulary mismatch problem that defeats keyword search
- Sentence transformers produce independent embeddings for each text, enabling comparison via cheap cosine similarity instead of expensive pairwise model inference
- Vector databases with ANN indices (HNSW, IVF) make semantic search over millions of documents feasible in milliseconds
- RAG pipelines combine semantic retrieval with generative models, grounding LLM outputs in retrieved evidence and reducing hallucination
- The two-stage retrieve-then-rerank pattern (fast bi-encoder retrieval followed by accurate cross-encoder reranking) balances speed and quality in production systems