Advice on Transitioning to Speech AI/Voice Interface Engineering

Question

Hi HN Community,I'm an experienced audio programmer working with sound, speech, and music, and I&rsquo;m looking to transition into speech recognition and voice interface engineering. I see this as the next frontier for user experiences, especially with the advent of real-time voice AI systems like GPT-4o voice and other low-latency, intelligent machine listening systems coming to market.My background:- I have 10+ years experience as a software engineer, most recently as Director of Engineering at a startup- I have extensive experience in audio programming with a focus on sound classification, some experiments with speech, and lots more experience with music- I'm self-taught in audio signal processing (completed a Coursera course)- I've built sound classifiers using fast.ai in the past (e.g. environmental sounds, bird sounds) - https://github.com/aquietlife/whisp - https://github.com/aquietlife/flightsoffancy - I have no formal CS or Engineering degree, but strong practical experienceCurrent situation:- I'm engaged in a year-long, self-directed fellowship to dive deep into this field through self-study and mentorship- I'm interested in exploring the origins of bias in speech recognition technology, particularly around marginalized voices (e.g. AAVE in the U.S.), disfluency, and low-resource languages- I'm seeking advice on how to best use this year to prepare for meaningful work/research opportunities- I'm interested in specializing in mitigating risks and biases in speech recognition systemsGoals:- I want to focus on building an end-to-end speech recognition system (ideally with PyTorch)- Ultimately I want to transition into speech recognition and voice interface systems work- I'd like to avoid going back to school full-time- I want to develop skills and knowledge to make an impact in this fieldQuestions:1. What are the most crucial skills/knowledge areas to focus on?2. Are there specific courses, books, or resources you'd recommend?3. How can I build a portfolio that demonstrates competence in this field?4. Are there particular companies or research labs that are good targets for someone with my background?5. What are some common pitfalls to avoid in this transition?6. What are some key challenges and approaches in mitigating bias in speech recognition systems? Are there specific resources or research groups focused on this aspect?Any advice, personal experiences, or insights would be greatly appreciated. Thanks so much for anything you can share to help me along this journey :)(edits: formatting)

sargstuff · Accepted Answer

Underpinnings of speach bias / speach production is something that the field of linguistics handles / provides background for formal structured methods /approaches. (vs. stenography, where concept is to textually record reproduction of sound )