universeodon.com is part of the decentralized social network powered by Mastodon.
Be one with the #fediverse. Join millions of humans building, creating, and collaborating on Mastodon Social Network. Supports 1000 character posts.

Administered by:

Server stats:

3.6K
active users

Learn more

#embedding

0 posts0 participants0 posts today

I’m excited to share my newest blog post, "Don't sure cosine similarity carelessly"

p.migdal.pl/blog/2025/01/dont-

We often rely on cosine similarity to compare embeddings—it's like “duct tape” for vector comparisons. But just like duct tape, it can quietly mask deeper problems. Sometimes, embeddings pick up a “wrong kind” of similarity, matching questions to questions instead of questions to answers or getting thrown off by formatting quirks and typos rather than the text's real meaning.

In my post, I discuss what can go wrong with off-the-shelf cosine similarity and share practical alternatives. If you’ve ever wondered why your retrieval system returns oddly matched items or how to refine your embeddings for more meaningful results, this is for you!
`
I want to thank Max Salamonowicz and Grzegorz Kossakowski for their feedback after my flash talk at the Warsaw AI Breakfast, Rafał Małanij for inviting me to give a talk at the Python Summit, and for all the curious questions at the conference, and LinkedIn.

p.migdal.plDon't use cosine similarity carelesslyCosine similarity - the duct tape of AI. Convenient but often misused. Let's find out how to use it better.

The current relevation that LLMs can’t reason is causing a lot of shade&fraud, but it’s not purely true

An LLM could reason, if you gave it a corpus of sentences (in whichever languages) which explicitly and unambiguously described a whole big bag of causal relationships and outcomes and things that happen because other things happen, and general structures such as that described clearly and formally and without any possibility of confusion

The embeddings which result from such a corpus could well work as a reference source of logic or cause or common sense or reason, about lots of things, and the next step would be to make it so that these embeddings are generalisable so that the common sense of the way life is, can be applied widely (again using vector comparison) so that yes it is possible to apply reason to a LLM, the main thing is that there probably isn’t an emphasis on that kind of descriptive and even prescriptive literature in and among the source learning in the first place – there’ll be a lot, there’ll be some, but I don’t think it was emphasised

By introducing it at the RAG level, and then the embeddings migrating back into the future models, I believe it could be possible to emulate a lot of common sense about the world and the way things are, purely through description of such – after all, the embeddings produced from such a block (a very massive block) of description, as vectors, are only numbers, which is what LLMs are really operating on, just vectors, not words, not tokens, just numbers

Consequently my dreams of applying real-world sensor/actuator ways of learning about the real world and building common sense are probably able to be supplanted just by a rigorous and hefty major project of just describing it instead of actually doing it – but the thing to watch would be in the description itself, it’d have to be as detailed and accurate and wide-ranging as the experiential model would be, and this might be where the difficulty lies, people describing common sense in the world would tend to abbreviate, generalise prematurely, miss things out, misunderstand, and above all, they’ll assume a lot
#AI #LLM #reasoning #CommonSense #vector #embedding

An interesting bioRxiv preprint was shared on the 🐦 site (x.com/strnr/status/18441056669). The paper describes a model to represent cells from large scale scRNA seq atlases using LLMs. Apart from the novelty value one of the main draws should be the ability to map any dataset with no additional data labelling, model training or fine-tuning onto the existing universal cell embedding. biorxiv.org/content/10.1101/20
github.com/snap-stanford/UCE
#scRNAseq #embedding #biology #llm

X (formerly Twitter)Stephen Turner (@strnr) on XUniversal Cell Embeddings: A Foundation Model for Cell Biology https://t.co/zyuhfdVIeB 🧬🖥️ https://t.co/wXjE9YqGxJ

I've created a basic app for searching an aerial photo using text queries. That's right, you can search for "roundabout" or "school playground" on an image of a city and get pretty good results!

Have a play with it here: server1.rtwilson.com/aerial - it's set up with an aerial image of Southampton, UK

Under the hood this uses the SkyCLIP model and the Pinecone vector database.

server1.rtwilson.comAerial Image Embedding Search Demo