You’ve turned text into numbers. Every sentence is now a point in 768-dimensional space. Cool.
But how do you know which points are near each other? When someone searches “how to cook biryani,” how does the system know that a document about “Hyderabadi dum biryani recipe” is closer than one about “history of Indian railways”?
You need a distance metric โ a mathematical ruler for meaning. And the one you pick changes everything about your search quality.
๐ง
The Three Rulers
There are three main ways to measure similarity between vectors. Each sees the world differently.
1. Cosine Similarity โ The Angle Measurer
Cosine similarity measures the angle between two vectors. It doesn’t care about length โ only direction.
B (0.8, 0.6)
/
/ ฮธ = small angle โ high similarity
/
/
Origin --------โ A (0.4, 0.3)
Formula:
cosine_similarity(A, B) = (A ยท B) / (||A|| ร ||B||)
In plain English: take the dot product, divide by both magnitudes. Result is between -1 and 1:
| Score | Meaning |
|---|---|
| 1.0 | Identical direction (same meaning) |
| 0.0 | Perpendicular (unrelated) |
| -1.0 | Opposite direction (opposite meaning) |
Why it works for embeddings: Embedding models produce vectors where direction encodes meaning and magnitude can vary based on text length. Cosine similarity ignores magnitude, focusing purely on what the text means.
python
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Example
a = np.array([0.4, 0.3])
b = np.array([0.8, 0.6]) # Same direction, different magnitude
print(cosine_similarity(a, b)) # 1.0 โ identical meaning!
2. Dot Product โ The Magnitude-Aware Scorer
The dot product (also called inner product) multiplies corresponding elements and sums them:
dot(A, B) = AโรBโ + AโรBโ + ... + AโรBโ
Unlike cosine similarity, dot product cares about both direction AND magnitude. A longer vector gets a higher score.
python
a = np.array([0.4, 0.3]) b = np.array([0.8, 0.6]) c = np.array([4.0, 3.0]) # Same direction as a, 10x longer print(np.dot(a, b)) # 0.5 print(np.dot(a, c)) # 2.5 โ higher because c is longer!
When magnitude matters: If you embed products and the vector magnitude correlates with popularity or importance, dot product naturally boosts popular items. Some recommendation systems exploit this.
Pro tip: If your vectors are normalized (length = 1), cosine similarity and dot product give identical results. Most embedding models output normalized vectors โ so the choice often doesn’t matter.
3. Euclidean Distance โ The Straight-Line Measurer
Euclidean distance is the straight-line distance between two points โ what you learned in school:
distance = โ((Aโ-Bโ)ยฒ + (Aโ-Bโ)ยฒ + ... + (Aโ-Bโ)ยฒ)
B โข
| \
| \ distance = 5
| \
A โข---+
3 4 (3-4-5 triangle)
Important: Lower distance = more similar. This is the opposite of cosine similarity where higher = more similar.
python
def euclidean_distance(a, b):
return np.linalg.norm(a - b)
a = np.array([1.0, 2.0])
b = np.array([4.0, 6.0])
print(euclidean_distance(a, b)) # 5.0
โก
Head-to-Head Comparison
| Metric | Range | Higher = | Cares About Magnitude? | Speed | Most Used In |
|---|---|---|---|---|---|
| Cosine Similarity | [-1, 1] | More similar | No | Fast | Semantic search, RAG |
| Dot Product | (-โ, +โ) | More similar | Yes | Fastest | Recommendations, ranking |
| Euclidean Distance | [0, +โ) | Less similar | Yes | Medium | Clustering, anomaly detection |
๐ง
Visual Intuition: When Each Metric Disagrees
Here’s the scenario that makes the difference concrete:
^ dim 2
|
C โข | โข B (far but same direction as A)
| /
| /
| /
A โข
+--------------------> dim 1
- Aย = “I like cricket”
- Bย = “I like cricket I like cricket I like cricket” (same meaning, longer text)
- Cย = “I enjoy tennis” (different but nearby)
| Metric | A vs B | A vs C | Winner for A’s nearest neighbor |
|---|---|---|---|
| Cosine | 1.0 (perfect) | 0.85 | B |
| Euclidean | 4.2 (far) | 1.1 (close) | C |
| Dot Product | 3.8 (high) | 0.9 (low) | B |
Cosine and dot product say B is more similar (same meaning). Euclidean says C is more similar (physically closer in space). For text search, cosine is almost always right.
๐
The Normalized Vector Trick
Here’s a secret most tutorials skip: if you normalize your vectors to unit length, all three metrics become equivalent for ranking.
python
def normalize(v):
return v / np.linalg.norm(v)
a_norm = normalize(a)
b_norm = normalize(b)
# These now give the SAME ranking:
cosine = np.dot(a_norm, b_norm) # cosine similarity
dot = np.dot(a_norm, b_norm) # dot product (same!)
euclidean = np.linalg.norm(a_norm - b_norm) # euclidean (inverse, but same ranking)
Most modern embedding models (OpenAI, Cohere, nomic-embed) return normalized vectors by default. So in practice, cosine similarity = dot product, and you can use whichever your vector database optimizes for. Qdrant, for example, is fastest with dot product on normalized vectors.
๐ฏ
The Curse of Dimensionality
One thing that breaks your intuition: in high dimensions, everything looks equally far apart.
In 2D, if you throw 100 random points in a box, some will cluster. In 768 dimensions? The distances between random points converge to nearly the same value.
python
# Demonstration
import numpy as np
dims = [2, 10, 100, 768]
for d in dims:
points = np.random.randn(1000, d)
distances = []
for i in range(100):
for j in range(i+1, 100):
distances.append(np.linalg.norm(points[i] - points[j]))
print(f"Dims={d:4d} Mean dist: {np.mean(distances):.2f} "
f"Std: {np.std(distances):.2f} Ratio: {np.std(distances)/np.mean(distances):.3f}")
Dims= 2 Mean dist: 1.38 Std: 0.52 Ratio: 0.377 Dims= 10 Mean dist: 4.32 Std: 0.65 Ratio: 0.150 Dims= 100 Mean dist: 14.01 Std: 0.71 Ratio: 0.051 Dims= 768 Mean dist: 39.12 Std: 0.72 Ratio: 0.018
At 768 dimensions, the standard deviation is just 1.8% of the mean โ everything is roughly the same distance apart. This is why cosine similarity works better than Euclidean in high dimensions โ it ignores the magnitude problem and focuses on angle, which still varies meaningfully.
๐ก
Which Metric Should You Pick?
Decision tree:
Are your vectors normalized?
โโโ Yes โ Use dot product (fastest, equivalent to cosine)
โโโ No
โโโ Semantic search / RAG โ Cosine similarity
โโโ Recommendations with popularity โ Dot product
โโโ Clustering / anomaly detection โ Euclidean distance
The 90% answer: Use cosine similarity. If your vectors are normalized (most are), use dot product for speed. You’ll rarely need Euclidean distance for text applications.
๐ฌ
Try It Yourself
python
import numpy as np
# Three "embeddings" (simplified to 5D)
king = np.array([ 0.8, 0.3, -0.2, 0.9, 0.1])
queen = np.array([ 0.75, 0.35, -0.15, 0.85, 0.15])
banana = np.array([-0.1, 0.9, 0.7, -0.3, 0.2])
def all_metrics(a, b):
cos = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
dot = np.dot(a, b)
euc = np.linalg.norm(a - b)
return f"Cosine: {cos:.3f} Dot: {dot:.3f} Euclidean: {euc:.3f}"
print("King vs Queen: ", all_metrics(king, queen))
# Cosine: 0.993 Dot: 1.428 Euclidean: 0.137
print("King vs Banana:", all_metrics(king, banana))
# Cosine: -0.179 Dot: -0.240 Euclidean: 1.927
King and queen: cosine 0.99 (nearly identical meaning). King and banana: cosine -0.18 (unrelated). The math captures meaning.
๐ก๏ธ
Practical Takeaways
- Cosine similarity is the default for textย โ it measures direction (meaning) and ignores magnitude (text length)
- Normalized vectors make metrics interchangeableย โ most embedding models normalize by default
- Dot product is fastestย โ prefer it when vectors are normalized
- Euclidean distance struggles in high dimensionsย โ the curse of dimensionality makes distances converge
- Your vector database choice mattersย โ each DB optimizes for specific metrics
- When in doubt, use cosineย โ it’s the most robust for semantic similarity