Couldn't you somehow hook this up with hash collision finding algorithms, and produce a (possibly only modicum of improvement) on hash function inversion?
I think this is interesting because a good hash function has an invertible mixing function, but is non-invertible overall. So being able to hack the "trapdoor" or "one way" nature of hash functions using AI would seem like a kind of cool escape hatch on what is supposedly "mathematically possible".
Which maybe means it's wrong to think you can "invert a hash function" with AI somehow, but I think it's interesting to think about.
Especially in cases where you have "content aware hashing" with an assumption of secrecy owing to the one way nature of it, maybe AI could be used to "reconstruct an image" from a compressed representation, in the same way that it is being used to reconstruct imagery from "dimensional embeddings" gleaned from brain measurements, and analogous to how it is being considered for image compression.
I don't know. I'm just riffing out loud, but...anyone wanna continue the riff?
Let's think about hash functions. Not inverting a hash, just applying one.
- A hash function has a bit-perfect map between input and output, "close" is not enough.
- A good hash function is designed to produce a completely different answer if even one input bit is changed. It's not a smooth function, in terms of the input space and the natural topology of the output space. Let me call that bit-sensitivity.
- I guess you could frame a hash as a categorization, but the number of categories would have to be the size of the output space, which is insanely big for any hash worth your time. You'd have to recognize 2^|bits in output|. Good luck!
So ML seems to be a very bad match for implementing the (forward) hash itself.
What about the inverse?
- The bit-perfection requirement is extremely onerous. Either the output of the inverse has the correct hash or it does not.
- At least the space of inputs which match a given hash IS a distribution (there are many inputs which yield the same hash, so there are many valid outputs given an input for the inverse hash). Unfortunately, because the purpose of the hash is to be bit-sensitive, the distribution of hash inputs (inverse outputs) is horribly multimodal.
- You can't generate the output of the inverse in stages, because the bit-sensitivity makes it nigh impossible to say you're meaningfully getting closer to the correct answer.
Also seems to be a bad match for ML.
Finally, some broader implications. Good hashes are hoped to be one-way functions. https://en.wikipedia.org/wiki/One-way_function Inverting one succeeds with very little probability, by definition. If your approach worked it would show that there is SOME algorithm for inverting these functions and thus they are not one-way. That's OK, there's actually no proof that ANY function is one-way. But our current hash functions have survived years of intense scrutiny by human-intelligence experts. Learning to invert one would be a major, exciting development, upending an entire field. Not impossible, but a reason for skepticism.
Here’s one bit of thinking: Currently, everyone is talking about LLM, which are really deep NN. The function that these models try to approximate is, at least partly, a somewhat continuous function. Maybe that’s where the power of these models comes from. They’re able to interpolate some concepts, which seems like a smart of even creative way of combining ideas.
Hashing in contrast is a discrete function. If one bit of the input is flipped, half hash's bits are expected to change. I don’t see a way an AI will learn this.
AI fundimentally can only do a sunset of what educated humans can do given time, it just does what it does cheaper than the humans. All things it enables are a matter of resource and time cost savings, not of other capabilities.
Turns out today that's totally possible but not because the grainy footage had more pixels, but instead AI can make a very accurate guess of what the human looks like in better resolution from a training set of tons of grainy human face to perfect resolution human face. Do we have the training set of what a one way hash result looks like back to the original and way for AI to do the same thing it's done with human faces?
00 -> 0
01 -> 1
10 -> 1
11 -> 0
XOR takes 2 values and returns 1. Undo XOR would take 1 value and return 2 possibilities. Because XOR destroys undo_xor(n) would have to create, like this: undo_xor(0) -> [00, 11]
undo_xor(1) -> [01, 10]
Lets say we wanted to implement undo_sha512(n), lets say we knew our original strings was 1MB. How many 1MB size strings would we likely see for any particular sha512 value?I think it is probably more intuitive with smaller numbers:
Lets say we have a 4 bit string and 1 bit hash.
0000 -> 0
0001 -> 1
0010 -> 1
0011 -> 0
0100 -> 1
0101 -> 0
0110 -> 0
0111 -> 1
1000 -> 1
1001 -> 0
1010 -> 0
1011 -> 1
1100 -> 0
1101 -> 1
1110 -> 1
1111 -> 0
16 possible strings, 2 possible hash values. Each hash value maps to 8 size 4 strings.so undo_lame_1bit_hash(n) looks like this:
undo_lame_1bit_hash(0) -> [0000, 0011, 0101, 0110, 1001, 1010, 1100, 1111]
undo_lame_1bit_hash(1) -> [0001, 0010, 0100, 0111, 1000, 1011, 1101, 1110]
We can say 2^(log2(possible_file_values) - log2(possible_hash_values)) is the idealized number of strings per hash value.2^(log2(16) - log2(2)) = 2^(4-1) = 8
So if we had a 512 bit sized hash, and were hashing 2048 bit strings, we would expect 2^(2048-512) or 2^1536 strings per hash.
If we took a 512 bit sized hash of a 1MB file that would be 2^(1048576-512) possible files per hash. Even if you said "only return files that start with "137 80 78 71 13 10 26 10" (the PNG header), that is still an extraordinary amount of files to "guess" at.
Even more intuitively than this, what is special about a 512 bit hash or a 256 bit hash or a 128 bit hash? Would a hash be less valid at 64 bits? At 32 bits?
How would you think about turning a 32 bit hash back into a file? It is very clear that extreme amounts of information has been irrevocably lost at 32 bits.
What I think is less clear and not intuitive is "reversing" a hash when when the text being hashed is smaller than the hash output size (like a password).
I had this idea once but I never tested it:
- Generate multiple TOTP codes
- Make a large database with the timestamp and their corresponding TOTP codes
- Train a model on that
- Use it to generate MFA keys???
Of course, this wouldn't be possibly an exploit unless you steal a lot of TOTP codes from a user and the timestamps they were generated.But fun to think about nonetheless.
The reason is because a hash is a galois field, or finite field which has no unique solutions.
Turning a file into a string is simple, and only has one correct answer in that direction.
Turning a string into the same file mentioend previously is impossible, because there are multiple correct answers given any string, and the context had been discarded. There is no determinable way to rebuild the context, its a guess.
In one direction you have a unique answer, in the other you have a many to one answer. Determinism which is the property needed by computers to do work requires that there be a 1:1 solution otherwise you run into issues with the halting problem, and other non-deterministic behaviors that don't lend themselves to anything useful.
https://en.wikipedia.org/wiki/Theory_of_computation
Fundamentally if determinism is not present, a turing machine can't solve those classes of problems. Its a fundamental in the theory of computation. Yet so many people subscribe to magical thinking about computers because they do not understand this.
If you could preserve the context for a single file, then you could get the same file back, but special handling would be needed to switch between conflicts with determinism (which isn't possible for turing machines, halting problem) and for a general solution you would have to save the context of every single possible file (an infinite solution requiring an infinite amount of memory).
It can attempt to make an approximation, but that approximation in functions like hashes is so chaotic it will never get close to a general approximation.
It must get infinitely close, and as it works the memory required grows up to the size of the unique domain for the output function, at which point you run into the halting problem where determinism breaks. From that point it grows infinitely from there in a non-deterministic way.
Given the constraints imposed on the design of current day computers, more likely it will simply only map to the first unique answer within the space despite there being other possible answers because after that it has no way to know when it should halt, nor will it ever.
There are limits to the types of problems computers can solve.
Edit: clarified language
Hash inversion is a hard problem. At first it seems very dissimilar to what AI is good for.
It's a mapping between two spaces. But whereas AI is also building an associative table between two space in the most smooth manner, a good hash algorithm, tries to make this table as scrambled as possible.
For bad hash algorithm like differentiable perceptual hash, they can be reverted, using optimization technique like in AI (aka gradient descent), where we use the differentiability property to get a hint for the direction to modify the input so that the output hash is closer to the target hash.
But when the hash function is not differentiable, all seems lost, as the technique can't be applied. So you use the, familiar to Haskell's monad enthusiasts, trick : Lifting !
In order to use the specificity of your hash function, you go back to its definition. You write it as a boolean formula as specified by the program : You now have a SAT problem, where you can apply your SAT-solver, which will use heuristics to partition and explore the solution space more efficiently.
Well you could stop here, but why stop here when you can apply the same trick once : Lift again.
This time we go from boolean formula to continuous space. We replace each bit 0 or 1 by a real number x € [0,1] which is the probability of the bit being 1. And we also have to lift the binary operation to this continuous space. This is equivalent to measuring the voltage in an electrical circuit with transistors and gates that go smoothly from 0 to 1 : We have soften the hard boundaries of a digital circuit, so we can now attack the problem from the inside and we are looking for a continuous vector x € R^n on the boundary of the [0,1]^n hypercube.
So we are now attacking the hash inversion problem using some interior point method, but we have to progressively enforce the hard constraint again so that the softened gates behave like hard digital gates at the end of the optimization process.
I'm guessing you're guessing what we'll be doing next : Lifting.
So up to now our gates are softened, and we are using a cooling adiabatic process to get to the solution. But we have not yet talked about the avalanche characteristic of the hash function. For cryptographic hash function are designed so that one bit of difference in the input result in all the outputs bits being possibly different. Which viewed through the lenses of the transformations we have just made make our problem very badly behaved and chaotic, where numerical precision become an issue.
The other aspect that the hash that we haven't yet talked about is how they map from a big space to a small space. Which is not so different from neural networks that maps pictures to text. And then you can run stable diffusion to get a picture back from your text. Although this one-way ness aspect seem frightening, you just have to hallucinate the information you don't have.
In the context of hash inversion this hallucination means replacing your irreversible soft gates like (Or, And, Maj,...) by reversible soft gates and adding an unknown variable as input, so that now your hash computation is reversible provided this unknown variables that will be degrees of freedom that your optimizer could use.
Now that the gates are reversible, this is not very different than a quantum adiabatic process. Gates themselves are not hardly defined but are parametrized such that they behave at the border like the gates they are representing, but each variation of the gate will allow some freedom to escape from local minimas : aka quantum tunneling : We can now use algorithm like Toshiba's "Simulated Bifurcation Algorithm" which model this chaotic process with differential equations.
But you can lift once more : instead of learning to solve for a single hash function, you can learn to invert all hash function at the same time. By learning an embedding of hash functions into R^n such that hash function that can be inverted using similar heuristics, because they are similar in the computational sense.
And that's how you plant backdoors into hash function.
But you can lift once more : ...