To draw an analogy, can we compare the model to a hashing algorithm and the embedding to the hash of the input data? If so, what is the equivalent of SHA256?
How can we make embeddings future-proof and exchangeable between independent parties?
This goes even further, as a model sophisticated enough to capture a probability distribution will produce embeddings that encode this distribution (to some extent) so that any two models of that kind produce "equivalent" embeddings that can be transformed into each other. This is an area of active research (in fact, I've just been to a seminar talk about that).
So the answer to the "How can we .." would be: by capturing the distribution, by making the embedding big enough and the training task difficult enough.
Examples of embeddings that are re-used are variants of word2vec, CLIP and CLAP.
As others have already mentioned: the hash analogy would be correct if you think about non-cryptographic hashes, but I doubt that this clarifies anything.
No. And no equivalent. Different target.
Crypto- hashes are created for unique representation, even when just one bit changed. Target, to detect if data changed, to protect data from change (intentional or non-intentional).
Vector representation designed, to easy find similarities, so many pieces of data with different bits, will have equal vector repr, or very close.
Good vector repr even consider computationally effective measurement of distance between different pieces of data.
Most problem with vector repr's compatibility, that exists few different algorithms and they use large parameters sets, and at the moment, I have not seen any tries to standardize these parameters sets, because they are very large and expensive to create, and copyright issues.
Also, I don't know exactly, but may exists some patented algorithms.
As example, consider some legal text in English, and it's good translation to French (or other language) - they will be binary totally different, but will be equal in some vector repr.
Unfortunately, conversion from one vector space to other impossible in abstract case.
Because vector spaces are not intersect 100%, so some cases possible in one space are impossible in other.
Second problem, conversions between many-dimensional vector spaces are computationally expensive and not strict.
As example of difficulties of vector spaces conversion, exists anecdote that somebody translated with early automatic translator from Russian to English, and back phrase "the spirit is strong but the flesh is weak", and got result "vodka is good, but the meat is rotten".
You couldn't do that with a hash, as far as I understand it, as hashing doesn't attempt to put similar things together -- quite the opposite.
Do a sterioscopic embeding. One eye for meaning, other for distance. Put gis coordinates as g-code/grbl[0] code in a docstore database as a 3d printable bias relief[0]
[0] g-code/GRBL : https://www.libhunt.com/compare-Universal-G-Code-Sender-vs-G...
Strictly virtual relm, can be shortened to s-expressions/m-expression as part of an L-System equation. Or just stick with the traditional math equation(s).
[1] : https://www.yeggi.com/q/bias/
You'll almost certain want to update the model over time, as the input distribution changes, to maintain good accuracy. So you need to keep the original source data and recalculate embeddings as needed.
You can recover the original word from the embedding, but not from the hash.
A hash function will return very distant vectors for very similar inputs. An embedding will return similar ones