I noticed that his comments add so little to the conversation, that if I could trim his voice out of the podcast, that would increase the quality of it.
I thought there would be some automated way of doing it using ML. I have some experience with CNN on images, but I've never dealt with audio before. Any recommendations?
Take those samples and convert them to a frequency spectrum. For each sample, average (or use max, min, whatever) the values over the time sample. Take bins of values (e.g. 100hz, 120 hz, 140hz), and filter out all values outside of the human speaking range.
What you then have is a training set that is a set of features that are the amplitude of each frequency, and a target of 1 (Lex is speaking) or 0 (Somebody else is speaking).
Use your ML or Deep Learning Algo of choice to see if you can get useful results out of it.