📣 DavidHaerer

Embeddings as "Semantic Hashes"

As I understand it, embeddings are semantic representations of input data, such as text or images, in a vector space that maps conceptual meaning to distances. However, this vector space is only meaningful to the model.
To draw an analogy, can we compare the model to a hashing algorithm and the embedding to the hash of the input data? If so, what is the equivalent of SHA256?
How can we make embeddings future-proof and exchangeable between independent parties?

👤 konstruction Accepted Answer ✓

You can already use the embeddings as features (as input) to another model that is then trained only on the embedding vectors. In this sense, they are exchangeable.
This goes even further, as a model sophisticated enough to capture a probability distribution will produce embeddings that encode this distribution (to some extent) so that any two models of that kind produce "equivalent" embeddings that can be transformed into each other. This is an area of active research (in fact, I've just been to a seminar talk about that).
So the answer to the "How can we .." would be: by capturing the distribution, by making the embedding big enough and the training task difficult enough.
Examples of embeddings that are re-used are variants of word2vec, CLIP and CLAP.
As others have already mentioned: the hash analogy would be correct if you think about non-cryptographic hashes, but I doubt that this clarifies anything.

👤 simne

> can we compare the model to a hashing algorithm and the embedding to the hash of the input data? If so, what is the equivalent of SHA256?
No. And no equivalent. Different target.
Crypto- hashes are created for unique representation, even when just one bit changed. Target, to detect if data changed, to protect data from change (intentional or non-intentional).
Vector representation designed, to easy find similarities, so many pieces of data with different bits, will have equal vector repr, or very close.
Good vector repr even consider computationally effective measurement of distance between different pieces of data.
Most problem with vector repr's compatibility, that exists few different algorithms and they use large parameters sets, and at the moment, I have not seen any tries to standardize these parameters sets, because they are very large and expensive to create, and copyright issues.
Also, I don't know exactly, but may exists some patented algorithms.
As example, consider some legal text in English, and it's good translation to French (or other language) - they will be binary totally different, but will be equal in some vector repr.
Unfortunately, conversion from one vector space to other impossible in abstract case.
Because vector spaces are not intersect 100%, so some cases possible in one space are impossible in other.
Second problem, conversions between many-dimensional vector spaces are computationally expensive and not strict.
As example of difficulties of vector spaces conversion, exists anecdote that somebody translated with early automatic translator from Russian to English, and back phrase "the spirit is strong but the flesh is weak", and got result "vodka is good, but the meat is rotten".

👤 specproc

Your suggestion that embeddings are only meaningful to the (presumably generating) model isn't quite right. You can pass them through any suitable model you like (e.g. a logistic regression, kmeans) and get decent results.
You couldn't do that with a hash, as far as I understand it, as hashing doesn't attempt to put similar things together -- quite the opposite.

👤 sargstuff

Serialize the embeddings via ASCII characteriation (0-127 only) using gradient descent[0] (or simplify with LaTex source).

Do a sterioscopic embeding. One eye for meaning, other for distance. Put gis coordinates as g-code/grbl[0] code in a docstore database as a 3d printable bias relief[0]

[0] g-code/GRBL : https://www.libhunt.com/compare-Universal-G-Code-Sender-vs-G...

   Strictly virtual relm, can be shortened to s-expressions/m-expression as part of an L-System equation.  Or just stick with the traditional math equation(s).

[1] : https://www.yeggi.com/q/bias/

👤 tlb

What's the goal behind making them exchangeable? The usual plan is to recalculate embeddings whenever the model changes, or when moving between systems.
You'll almost certain want to update the model over time, as the input distribution changes, to maintain good accuracy. So you need to keep the original source data and recalculate embeddings as needed.

👤 mikewarot

If you want to keep the embeddings relevant, you could train new models with the cosine distance to the current embeddings as part of the cost function, it would then get the new embeddings as close to the current ones as possible, like curve fitting.

👤 persnickety

Hash functions are an analogy that falls apart very quickly.
You can recover the original word from the embedding, but not from the hash.
A hash function will return very distant vectors for very similar inputs. An embedding will return similar ones

Embeddings as "Semantic Hashes"

If you want to keep the embeddings relevant, you could train new models with the cosine distance to the current embeddings as part of the cost function, it would then get the new embeddings as close to the current ones as possible, like curve fitting.

Hash functions are an analogy that falls apart very quickly.You can recover the original word from the embedding, but not from the hash.A hash function will return very distant vectors for very similar inputs. An embedding will return similar ones

Hash functions are an analogy that falls apart very quickly.
You can recover the original word from the embedding, but not from the hash.
A hash function will return very distant vectors for very similar inputs. An embedding will return similar ones