Vector Similarity: How Machines Measure “Closeness” of Meaning

On: April 7, 2026

You’ve turned text into numbers. Every sentence is now a point in 768-dimensional space. Cool.

But how do you know which points are near each other? When someone searches “how to cook biryani,” how does the system know that a document about “Hyderabadi dum biryani recipe” is closer than one about “history of Indian railways”?

You need a distance metric โ€” a mathematical ruler for meaning. And the one you pick changes everything about your search quality.



๐Ÿง 

The Three Rulers

There are three main ways to measure similarity between vectors. Each sees the world differently.

1. Cosine Similarity โ€” The Angle Measurer

Cosine similarity measures the angle between two vectors. It doesn’t care about length โ€” only direction.

                  B (0.8, 0.6)
                /
              /  ฮธ = small angle โ†’ high similarity
            /
          /
Origin --------โ†’ A (0.4, 0.3)

Formula:

cosine_similarity(A, B) = (A ยท B) / (||A|| ร— ||B||)

In plain English: take the dot product, divide by both magnitudes. Result is between -1 and 1:

ScoreMeaning
1.0Identical direction (same meaning)
0.0Perpendicular (unrelated)
-1.0Opposite direction (opposite meaning)

Why it works for embeddings: Embedding models produce vectors where direction encodes meaning and magnitude can vary based on text length. Cosine similarity ignores magnitude, focusing purely on what the text means.

python

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example
a = np.array([0.4, 0.3])
b = np.array([0.8, 0.6])  # Same direction, different magnitude
print(cosine_similarity(a, b))  # 1.0 โ€” identical meaning!
2. Dot Product โ€” The Magnitude-Aware Scorer

The dot product (also called inner product) multiplies corresponding elements and sums them:

dot(A, B) = Aโ‚ร—Bโ‚ + Aโ‚‚ร—Bโ‚‚ + ... + Aโ‚™ร—Bโ‚™

Unlike cosine similarity, dot product cares about both direction AND magnitude. A longer vector gets a higher score.

python

a = np.array([0.4, 0.3])
b = np.array([0.8, 0.6])
c = np.array([4.0, 3.0])  # Same direction as a, 10x longer

print(np.dot(a, b))  # 0.5
print(np.dot(a, c))  # 2.5 โ€” higher because c is longer!

When magnitude matters: If you embed products and the vector magnitude correlates with popularity or importance, dot product naturally boosts popular items. Some recommendation systems exploit this.

Pro tip: If your vectors are normalized (length = 1), cosine similarity and dot product give identical results. Most embedding models output normalized vectors โ€” so the choice often doesn’t matter.

3. Euclidean Distance โ€” The Straight-Line Measurer

Euclidean distance is the straight-line distance between two points โ€” what you learned in school:

distance = โˆš((Aโ‚-Bโ‚)ยฒ + (Aโ‚‚-Bโ‚‚)ยฒ + ... + (Aโ‚™-Bโ‚™)ยฒ)
     B โ€ข
     |  \
     |   \ distance = 5
     |    \
     A โ€ข---+
       3    4  (3-4-5 triangle)

Important: Lower distance = more similar. This is the opposite of cosine similarity where higher = more similar.

python

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

a = np.array([1.0, 2.0])
b = np.array([4.0, 6.0])
print(euclidean_distance(a, b))  # 5.0


โšก

Head-to-Head Comparison
MetricRangeHigher =Cares About Magnitude?SpeedMost Used In
Cosine Similarity[-1, 1]More similarNoFastSemantic search, RAG
Dot Product(-โˆž, +โˆž)More similarYesFastestRecommendations, ranking
Euclidean Distance[0, +โˆž)Less similarYesMediumClustering, anomaly detection


๐Ÿ”ง

Visual Intuition: When Each Metric Disagrees

Here’s the scenario that makes the difference concrete:

           ^ dim 2
           |
     C โ€ข   |        โ€ข B (far but same direction as A)
           |      /
           |    /
           |  /
           A โ€ข
           +--------------------> dim 1
  • Aย = “I like cricket”
  • Bย = “I like cricket I like cricket I like cricket” (same meaning, longer text)
  • Cย = “I enjoy tennis” (different but nearby)
MetricA vs BA vs CWinner for A’s nearest neighbor
Cosine1.0 (perfect)0.85B
Euclidean4.2 (far)1.1 (close)C
Dot Product3.8 (high)0.9 (low)B

Cosine and dot product say B is more similar (same meaning). Euclidean says C is more similar (physically closer in space). For text search, cosine is almost always right.



๐Ÿ“Š

The Normalized Vector Trick

Here’s a secret most tutorials skip: if you normalize your vectors to unit length, all three metrics become equivalent for ranking.

python

def normalize(v):
    return v / np.linalg.norm(v)

a_norm = normalize(a)
b_norm = normalize(b)

# These now give the SAME ranking:
cosine = np.dot(a_norm, b_norm)          # cosine similarity
dot = np.dot(a_norm, b_norm)              # dot product (same!)
euclidean = np.linalg.norm(a_norm - b_norm)  # euclidean (inverse, but same ranking)

Most modern embedding models (OpenAI, Cohere, nomic-embed) return normalized vectors by default. So in practice, cosine similarity = dot product, and you can use whichever your vector database optimizes for. Qdrant, for example, is fastest with dot product on normalized vectors.



๐ŸŽฏ

The Curse of Dimensionality

One thing that breaks your intuition: in high dimensions, everything looks equally far apart.

In 2D, if you throw 100 random points in a box, some will cluster. In 768 dimensions? The distances between random points converge to nearly the same value.

python

# Demonstration
import numpy as np

dims = [2, 10, 100, 768]
for d in dims:
    points = np.random.randn(1000, d)
    distances = []
    for i in range(100):
        for j in range(i+1, 100):
            distances.append(np.linalg.norm(points[i] - points[j]))
    print(f"Dims={d:4d}  Mean dist: {np.mean(distances):.2f}  "
          f"Std: {np.std(distances):.2f}  Ratio: {np.std(distances)/np.mean(distances):.3f}")
Dims=   2  Mean dist: 1.38  Std: 0.52  Ratio: 0.377
Dims=  10  Mean dist: 4.32  Std: 0.65  Ratio: 0.150
Dims= 100  Mean dist: 14.01  Std: 0.71  Ratio: 0.051
Dims= 768  Mean dist: 39.12  Std: 0.72  Ratio: 0.018

At 768 dimensions, the standard deviation is just 1.8% of the mean โ€” everything is roughly the same distance apart. This is why cosine similarity works better than Euclidean in high dimensions โ€” it ignores the magnitude problem and focuses on angle, which still varies meaningfully.



๐Ÿ’ก

Which Metric Should You Pick?

Decision tree:

Are your vectors normalized?
โ”œโ”€โ”€ Yes โ†’ Use dot product (fastest, equivalent to cosine)
โ””โ”€โ”€ No
    โ”œโ”€โ”€ Semantic search / RAG โ†’ Cosine similarity
    โ”œโ”€โ”€ Recommendations with popularity โ†’ Dot product
    โ””โ”€โ”€ Clustering / anomaly detection โ†’ Euclidean distance

The 90% answer: Use cosine similarity. If your vectors are normalized (most are), use dot product for speed. You’ll rarely need Euclidean distance for text applications.



๐Ÿ”ฌ

Try It Yourself

python

import numpy as np

# Three "embeddings" (simplified to 5D)
king   = np.array([ 0.8,  0.3, -0.2,  0.9,  0.1])
queen  = np.array([ 0.75, 0.35, -0.15, 0.85, 0.15])
banana = np.array([-0.1,  0.9,  0.7, -0.3,  0.2])

def all_metrics(a, b):
    cos = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    dot = np.dot(a, b)
    euc = np.linalg.norm(a - b)
    return f"Cosine: {cos:.3f}  Dot: {dot:.3f}  Euclidean: {euc:.3f}"

print("King vs Queen: ", all_metrics(king, queen))
# Cosine: 0.993  Dot: 1.428  Euclidean: 0.137

print("King vs Banana:", all_metrics(king, banana))
# Cosine: -0.179  Dot: -0.240  Euclidean: 1.927

King and queen: cosine 0.99 (nearly identical meaning). King and banana: cosine -0.18 (unrelated). The math captures meaning.



๐Ÿ›ก๏ธ

Practical Takeaways
  1. Cosine similarity is the default for textย โ€” it measures direction (meaning) and ignores magnitude (text length)
  2. Normalized vectors make metrics interchangeableย โ€” most embedding models normalize by default
  3. Dot product is fastestย โ€” prefer it when vectors are normalized
  4. Euclidean distance struggles in high dimensionsย โ€” the curse of dimensionality makes distances converge
  5. Your vector database choice mattersย โ€” each DB optimizes for specific metrics
  6. When in doubt, use cosineย โ€” it’s the most robust for semantic similarity

Pratik Shinde

Pratik Shinde is a DevOps and Cloud professional based in Pune, Maharashtra, India, with hands-on experience in building and managing scalable systems. He has a strong working background in DevOps, Kubernetes, and cloud platforms, along with practical exposure to artificial intelligence and machine learning concepts. Pratik actively explores emerging areas such as Generative AI and AI Agents, focusing on real-world applications rather than theory. Through his blog, pratikshinde.online, he shares clear, practical insights to help beginners and professionals understand AI, ML, and modern cloud technologies. He also shares knowledge and learning resources on platforms like LinkedIn and other social channels, aiming to simplify complex topics and make them accessible to a wider audience.

Leave a Comment