yet, it's mimicking emergent thought quite beautifully. it's shockingly unintuitive how a simple process scaled enormously can lead to this much practical intelligence (practical in the sense that's useful, but it's not the way we think). I'm aware there are multiple layers, filters, processes etc., I'm just talking about the foundation, which is next-token prediction.
when I first heard that it's not predicting words, but parts of words, I immediately saw a red flag. yes, there are compounded words like strawberry (straw + berry) and you can capture meaning at a higher-resolution, but most words are not compounded, and just in general we're trying to simulate meaning instead of 'understanding' it. 'understanding' simply means knowing a man is to a woman what a king is to a queen, but without the need to learn about words and letters (that should be just an interface).
I feel we're yet to discover the "machine code" for ASI. it's like we have no compiler, but we directly interpret code. imagine the speed-ups if we could just spare the processor from understanding our stupid, inefficient language.
I'd really like to see a completely new approach working in the Meaning Space, which transcends the imperfect Language Space. This will require lots of data pre-processing, but it's a fun journey -- basically a parser human-machine and machine-human. I'm sure I'm not the first one thinking about it
so what we've got so far?
https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMR...
A biological neuron doesn't do much. On its own, a simple process. Yet when you put a 100 billion of them together in the right 1000-connected configuration you get a human brain.
I have no idea what I'm talking about, but what you describe is exactly what LLM's do.
Words are tokens that represent concepts. We've found a way to express the relationships between many tokens in a giant web. The tokens are defined by their relationships to each other. Changing the tokens we use probably won't make much more difference than changing the language the LLM is built from.
We could improve the method we use to store and process those relationships, but it will still be fundamentally the same idea: Large webs of inter-related tokens representing concepts.
Turns out this is beautifully represented by embeddings alone!
Citation needed
What's really cool about tokenization is that it breaks down words based on how often parts of the word are used. This helps a lot with understanding different forms of words, like when you add "-ing" to a verb, make words plural, or change tenses. It's like seeing language as a bunch of building blocks.
https://platform.openai.com/tokenizer
Tokens aren't just individual parts of compound words, they're sliced up in a way that's convenient statistically. The tokenizer has each individual character as a token, so it could be purely character-based if desired, it's just easier to compute when some common sequences like "berry" are represented by a single token. Try typing "strawberry" into the tokenizer and see it tokenized as "str", "aw", and "berry".
Also, next token prediction is not stupid. A "sufficiently advanced" next token predictor must be at least as intelligent as a human, if it could predict any humans' next token in any scenario. Obviously, we're not there yet, but there's no reason to think right now that next token prediction will face any sort of limitation. Especially with new models coming out that are seeing better perfomance purely from training them much longer on the same datasets.