I like to think that I understand LLMs pretty well. Which is why I was so underwhelmed by most of the mainstream "AI" news. But this threw for a loop. As a predictor, how can it model base64? It surely can't just be "pretending" like it does with all other stuff. The precision feels the most wrong to me, it does long random strings perfectly. Why does it then fail at simple arithmetic?
It's not perfect though. I tested it on a few sentences of text and it made a few mistakes. Due to the way that GPT tokenizes the input text, it can't really generalize the pattern, as mappings of text to tokens is somewhat random. It effectively has to learn how to map every unique combination of 3 characters to 4 base64 digits, of which there are up to 2^24=16,777,216 distinct mappings. Otherwise, the number of characters in each token varies, which can also lead to mistakes.
You can use this tool to see how GPT3 maps text to tokens and token IDs: https://platform.openai.com/tokenizer
As an example, the alphabet "abcdefghijklmnopqrstuvwxyz" maps to [39305, 4299, 456, 2926, 41582, 10295, 404, 80, 81, 301, 14795, 86, 5431, 89]. This is what I mean by it's fairly random.
[1] https://arxiv.org/abs/1912.10077
[2] https://www.reddit.com/r/naturalism/comments/1236vzf/on_larg...
How hard would it be for an LLM to convert a string to binary? That's just a lookup table. How hard would it be to remove all spaces and add spaces every 6 bits? And convert that to letters. That's a lookup table again.
[1]: https://pthree.org/2011/04/06/convert-text-to-base-64-by-han...
Also, changing a character in the input only has local changes; it changes at most 2 characters in other encoded byte stream.
In arithmetic, on the other hand, a single character change can have effects at an arbitrary long-range.
OpenAI has an association between Base64 tokens and plain text. It’s likely a pretty high correlation, but like everything, it likely has some unpredictable edge cases.
But it's a harder task to learn because arithmetic doesn't encode information about its solution in preceding context. "9383 + 3545" or "is any of the following numbers a prime, 96885, 66576, 4766 ?" doesn't actually tell you anything that would inform the answer. You go to school and you learn the required set of steps for solving these problems.
On the hand, for "John is smiling so he is _____", the preceding context screams happy as a very likely choice. Preceding context actually helps finding the solution rather than being the equivalent of deadweight.
And you simply didn't understand LLMs as well as you thought you did.
Language models trained on Code reason better, even on benchmarks that have nothing to do with code. https://arxiv.org/abs/2210.07128
Encoding/Decoding Base64 is neat but not particularly mindblowing unless you have some serious misconceptions on what language models are capable of.
Imagine you're a supercomputer and someone feeds you billions and billions and billions of pages of text written by humans.
Then they ask you to compress it really really small.
You can't compress it that small without figuring out a lot of the underlying laws, frameworks, and rules that apply to humans, that apply to the world, and so on.
Compression == coming up with powerful frameworks that condense knowledge.
It's kind of like how with a really powerful set of rules or frameworks in math or physics, you can derive many other things.
As a side note, I suspect GPT-4 has inside its neurons a bunch of powerful frameworks about the world that humans haven't yet discovered.
presumably if you ask it to execute that same code for a input example you provide, it will.
Et viola.