Webiphany.com: interested in the quieter viewpoints

What is the cosine distance between Sean and Shawn?

Written on 2024-04-30

Hi, I'm Chris Dawson a dad and writer from Portland, OR, now living in Florida. My book is Building Tools with GitHub from O'Reilly. I'm an inventor, have started several companies and worked for several non-startups like Apple and eBay. I am sometimes available for hire as a consultant or part time contributor.

This post is about embeddings, a foundational concept to understand for machine learning and AI. After reading this post, you’ll know all about embeddings, as well as cosine distance calculations. That’s just some bonus learning right there. Don’t worry, your brain can handle it. Once read you’ll know how to reason about them, be able to do the calculations by hand if you wanted to, and realize those two things are not that hard to understand. Embeddings are everywhere these days. You’re encountering them constantly, and they’re very useful in AI, especially with RAG (retrieval augmented generation.) I’m betting this will be a big topic at the AI.engineer worldsfair in June (use discount code LATENTSPACE to save $100.)

If you get nothing else from this post, scroll all the way down and click on the buttons at the bottom of the post, and then click “start.” Doing that will show you a fun animation of the cosine distance function. However! This is a post worth reading all the way through (trust me, people) because embeddings are important, powerful, and easy to grasp.

Oh, and everything in this post happens inside your browser! There are no remote connections other than to download the embedding model.

A week ago I sent an email that was tangentially related to embeddings. The recipient is someone I highly respect, and someone that I would say is about 2 or 3 degrees away on the Kevin Bacon degree scale from me. I’ve had two conversations with him over video. I would not say we are friends and I would not ask to crash at his place if I visited San Francisco. I think of myself as a Val Kilmer to him (if we are only permitting strict connections through film and not off-broadway).

I made a fatal mistake, however: I misspelled his name in the email invitation. He goes by Shawn, but I wrote Sean. I’m 93% sure this is why he didn’t write back, he is probably so sick of being misnamed into a group with people like Sean Penn, Sean Connery or Sean Young. I’m worried I became a Craig T Nelson to him.

There is a good reason why I made this mistake and it is because my brain stores that information in a meaty embedding form. A few years ago I recall hearing the term embedding and feeling very stupid. But, now I know what an embedding is.. An embedding is just a vector, which said even more plainly, is just a list of numbers. That list of numbers is generated from a model that takes some textual input and converts it into this list of numbers, and those numbers represent some semantic meaning. That might not sound interesting, but if you have two lists of numbers, then you can do a fun calculation that tells you how close those list of numbers are. And, that means you can tell how semantically close two pieces of text are from each other. And, in my brain, Sean and Shawn are stored as slimy unordered meaty chunked up embeddings, and my bio-GPU (Gloppy Poor-Understandinger) calculates the distance between those globs as a “very close” result.

For example, enter a phrase into this text box, and then click the button to generate an embedding for the phrase.

The code to do this in JavaScript is really simple!

# npm add @xenova/transformer
import { pipeline, env } from '@xenova/transformers';
env.allowLocalModels=false; # load remote models
const extractor = await pipeline( 'feature-extraction', 
                            'Xenova/all-MiniLM-L6-v2');      
const { data } = await extractor( text, 
                     { pooling: 'mean', 
                       normalize: true });

This embedding is 384 dimensions, standard for embeddings used in many AI applications. What does that mean for semantic encoding? Well, (and someone smarter than me can correct me if I’m wrong,) dimensions categorize the semantics into distinct facets. The more dimensions you have, the more facets. For example, if you had only four dimensions and you were talking about clothing, you could say one facet is the color of the clothing, one is the intricacy, one is the formality, and one is the fabric. Then you could categorize, say, a tuxedo with one set of numbers, a swimsuit with a different set of numbers and a wedding gown with an entirely different set of numbers. Then you’d be able to see which of those clothing items are more similar to the others by looking at the dimensions you’ve assigned.

What does it mean this embedding is in 384 dimensions? We humans normally have a hard time thinking past four dimensions (length, width, height and space/time). Let’s constrain ourselves to only two dimensions. We could imagine a model that only encodes into two dimensions, and we can easily plot that with Cartesian planes that we learned in middle school (see below.) By clicking on the buttons to change coordinates, you can see how the cosine distance changes related to how close the vectors are on the graph. Pretty neat.

Cosine Distance: 0.73

[-0.22, 0.81] and [0.23, 0.44]

[0.34, -0.12] and [-0.12, 0.67]

[-1, 0] and [-0.9, 0.2]

[-1, -1] and [1, 1]

You can see how the closer the vectors are, the higher the similarity score. When the vectors are in diametrically opposite positions, the score is -1. When they are almost in the exact same position, the score is almost 1. That means the embeddings, the semantic meanings, are very closely related.

It begs the question: how do you calculate that distance value? To do that, use a distance function. There are many of these. Hamming Distance, Tanimoto Coefficient. The most common one however in the AI/ML world is the cosine distance function. Also, even without those functions it’s really easy to calculate in code, even by hand.

In math notation, it looks like this:

$cosine similarity$

Honestly, I hate that. I think it is much easier to read code than that. So, here is the code:

function cosineSimilarity(A, B) {
    var dotproduct = 0;
    var mA = 0;
    var mB = 0;

    for(var i = 0; i < A.length; i++) {
        dotproduct += A[i] * B[i];
        mA += A[i] * A[i];
        mB += B[i] * B[i];
    }

    mA = Math.sqrt(mA);
    mB = Math.sqrt(mB);
    similarity = dotproduct / (mA * mB);
    return similarity;
}

By the way, python numpy makes this really simple:

A = np.array(embedding1)
B = np.array(embedding2)
cosine_sim = np.dot(A,B)/(norm(A)*norm(B))

Python and numpy are simple and powerful, and the JavaScript code is not that complex either. To put it into plain English, that algorithm is such: if you have two vectors with the exact same dimensions then iterate over them. For each item of A and B, multiple them, and save that value into the dotproduct. Square the A-nth item, and save that. Square the B-nth item, and save that. Then, find the square root of the sum of the A-nth items and find the square root of the B-nth items. Then, the similarity score is the dotproduct divided by the product of the results of the squares. I think the code is easier to read than my explanation or the math notation, but perhaps this helped that special someone out there who needed it spelled out.

Now, the result of a cosine distance function is a number where the range is between -1 and 1. The higher it is, the “closer” semantically the phrases are. For example, 0.6 value or above usually indicates semantically similar phrases (in my non-scientific observations).

The best way for me to prove I understand a concept like “cosine similarity” is to make something with it, so what follows is an animated version of the algorithm. For this animation, imagine that you took a 384 dimension embedding and converted it into a physical stack of coins, where each coin has the embedding value printed on the front. Then, drop the stack of coins on a table, with the coins closer to the negative value landing to the left, and to the right if they are closer to the positive, all the while maintaining their stack order. This algorithm simulates pulling off the top coin from each stack and running that coin value through the cosine similarity algorithm steps. If you want to see how this is written in JavaScript all of this is written in Svelte (in a svekyll blog You can see the source by clicking on the view source link below to play and tweak it to your heart’s content.

Please generate two embeddings by clicking the "Generate Embedding" buttons.

Remember, 0.6 or above generally indicates semantic similarity (and these phrases are semantically similar as we can see without math.) If you refresh the page and retry with your own phrases that are not semantically proximate, you’ll find values under 0.5.

Wrapping up, embeddings and cosine distance functions are fun and really valuable to understand. If you have comments about this post, please email me at chris at extrastatic.com.