What level of reliability would LLMs need to achieve before the products that non-technical people imagine AI powering, would actually work? Would 95% correct be good enough? 99%? 99.99%?
Follow-up question, at what various levels would different products be enabled? Would 95% be good enough for customer service, where as 99.99% would be good enough for a legal advisor?
Are my coworkers right X%? Are we able to work together to move things forward?
AI tools are tools that help us do our job. If they make it easier, faster, or cheaper, and that translates into savings or profit, then it's not hype, it's value creation
Well, I don't know nuffin either.
Especially about %.
I'm still trying to be 80% successful just by showing up ;) But at least I'm here. Don't worry, I'll go away soon.
Break it down and don't try to do too many things at once.
Accuracy, reliability, and hallucination are related but they are 3 different things.
Accuracy problems could reveal a need to narrow existing error bands, reliability is part of robustness which can be through the roof for instance at the same time that accuracy is still shit. Weaknesses here will seem like a lack of intelligence, but focused improvements can be made to perform more intelligently as expected. Without achieving perfection in either way, computer usefulness can be well estimated by those who know the level of intelligence that would be needed for the particular mission-criticality of the task at hand. And those who can verify that the performance of the AI (like any other acceptable software) will not drop below that requirement. At least not without an error message ;)
Or it will likely fail to meet specifications and be rejected.
But hallucination is not related to intelligence levels that can be matched to the criticality of the application.
Hallucination reveals stupidity.
A real show-stopper[0]. Nothing resembling intelligence or lack thereof. Spoils the whole mood even if it's a microdose and nobody would ever know.
For anything even remotely considered mission-critical, the hardware/software combination must be 100% hallucination-free in an ongoing verifiable way. At the other end of the spectrum, orthogonal testing must also show that hallucination must be 0% possible. As we have seen this is so important that nothing less than belt & suspenders will do. You do want to get back to the Moon don't you? And return home safely?
Regardless of how smart your machine gets, you don't want to earn even "token" credit for being more stupid or shifty than you have to.
So many times that's the only thing people want a computer for, so they can harden repetitive tasks in ways that avoid more possibilities for stupid mistakes that humans were capable of making. That machines were capable of completely avoiding. This is one reason they were so welcomed in so many offices to begin with, when the software industry was not nearly as advanced as it is now.
100% repeatable operation has less priority, but since ordinary software is generally accepted as having this since the beginning, a whole lot of the time nothing less will do here either.
Noth. Thing. Less. Than. 100%. If for any reason this is felt to be impossible, just keep doubling the effort until it feels like 200% to everyone for as long as it takes (before deployment of course) while you continue to figure out exactly why it seemed impossible, then successfully correct that before thinking it's going to be as trustworthy as a well-programmed computer has always been.
Once repeatability and hallucination-free operation are securely at 100%, everything else can float as high as you can take it, always being useful along the way without having the possibility of a deceptive or unrealistic undercurrent rearing it's ugly head ever[1]. Should be able to chalk up a hell of a lot more true progress per dollar too, building on a firmer foundation.
It's a computer. For mission-critical uses one of its strengths is the ability to do some things over and over in a much more repeatable and predictable way than a human can achieve. This didn't just speed its adoption, it made it immediately acceptable for so many kinds of mission-critical work. Unexpected behavior is expected to be a red flag even for students, but when it goes as far as hallucination it could be a skull & crossbones and it's too late now.
Doesn't it make sense to build on the inherent strengths that the machine has over the human, rather than merely paper over the weaknesses incompletely no matter how many banknotes you apply?
If you want maximum uptake, your competition is the hallucination level of regular software. There must be no possibility that this potentially inherent defect exists on anything more ambitious, or you can end up looking nothing but stupid to more people than not.
[0] For most people.
[1] Once absolutely hallucination-free in a verifiable way, there are probably user classes that would rave endlessly at "intelligence" levels below 80% of the performance that people are poo-pooing now.