In particular, I have the following questions:
1. What was the product you were working on?
2. Were there any new software engineering challenges that came from working with GPT4 (e.g. output quality, testing, monitoring, etc.)?
1. I've observed multiple products across customers.
1.1 Correcting or filling missing information in structured data. For example a system to suggest corrections to products in a company catalogue (each product category has different schema). Unstructured data is pulled from various websites and optionally from categories retrieved from images. It is then compared against the data and most probable fixes are reported.
Most of the work is done by a few polite prompts to GPT3.5/4 (~5 English sentences in total)
1.2 Better search company data. E.g. a chat bot for internal documentation that can also access internal services in order to answer a question. Same ~5 English sentences to do bulk of the work.
1.3 (non commercial) Endangered language preservation. Building a smart agent that is accessible via chat/hardware (like Alexa/Homepod), that talks in native language can understand and helps to preserve the culture. This is a complex one.
2. Tech stack itself is rather simple. Mostly - GPT, LangChain/LlamaIndex, Vector database with embeddings for memory, plugins for external services and potentially agents to drive workflows.
Output quality, testing, monitoring, scalability etc also don't differ much from operating normal "old-school" ML models. If anything, it feels simpler.
The tricky part is that the entire notion of LLM-driven micro-services is new. Quality of the resulting product largely depends on knowing prompting tricks and following the latest news in an area.
Plus the biggest challenge that customers want to be solved: "How can I ran it on my hardware?"
My biggest challenge using the api so far was that the output is not reliable. Randomly from time to time it outputs notes and comments even tough I asked to only reply in a code block. Also if I rerun exactly the same promt, it can output something completely different (different content is fine, but I teach chatgpt to follow a structure - it works in 90% of the cases just fine). I’m using 3.5 not 4.
The api can be down regularly. This is annoying especially if you have longer conversations. I had a hard time to resume a conversation. I usually restart the whole process.
However, the overall capabilities are mind blowing. The system surprises me very often.
Challenge comes in pricing and getting a good result. Generally longer the prompt, better the results but you have to adjust accordingly.
Also, generally using only GPT-4 doesn’t make sense. Mix and match between 2 different models make sense. (e.g. data extraction can be done with GPT-3.5 but writing a good email should be done with GPT-4)
For those unaware DT produces a highlight-reel video after every software sales meeting.
Not sure if LLM will go on by default. The algorithmic version of DT is super strong, so just generating the scripts with GPT is MUCH worse.
For us the correct usage is to sprinkle in GPT, e.g. to also add a section to the output video which summarizes the user's goals
So far I've seen a ton of cool demos, but not much real life business use cases.
It mainly helps with 2 things:
- allowing engineers to develop their products much faster (especially doing good requirements engineering for now)
- allowing us to demo to/onboard users with data from their specific usecase (prepopulate their trial account)
Hardware engineering at first does not seem like an obvious choice for LLMs, but I think that it will be those vertical solutions that will still surprise us all the most.
Here are some more details, how hardware design gets concretely aided by LLMs: https://assistedeverything.substack.com/p/todays-ai-sucks-at...
https://www.intercom.com/ai-bot
Looks pretty cool from what I’ve seen
2. One software engineering challenges is that ChatGPT often outputs code in markdown blocks. I've had to emphasize in prompts that it should explicitly mark the language. I then got inspired to make it possible to evaluate in place the code that appears in these blocks using a Jupyter kernel, and spent a week making that work (so, e.g., if you type a question into the chatgpt box on the landing page at https://cocalc.com, and code appears in the output, often you can just evaluate it right there). There seem to be endless surprises and challenges though. For example, a few minutes ago I realized that sometimes the giant tracebacks one gets when using Python in Jupyter notebooks are so big (even doing simple things with matplotlib) that they end up resulting in too much truncation: https://github.com/sagemathinc/cocalc/issues/6634
3. I'm mostly using GPT-3.5-turbo rather than GPT4, even though I have a GPT4 api key. Aside from costs, GPT4 takes about 4x as long, which often just feels too long for my use case. The average time for a complete response from GPT-3.5 for my application is about 8 seconds, versus over 30s for GPT4.
We are ctm.app
Even though it's much slower, GPT 4 is way more consistent than 3.5. The OpenAI APIs have had lot of flakiness in the past couple of weeks, we retry requests up to 10 times to work around this