One way to fix this would be to split the page's text into multiple parts and then separately convert them to speech, but that would ruin the flow of speech.
I am curious as what causes this problem? And if there is any way to fix it?
You can also search for postags (and token ids for them) that are especially placed for "pause" audio as they often fix problem with weird transition when you split the sentences.
This repo -> https://github.com/TensorSpeech/TensorFlowTTS was very good few years back.
I wonder if a suitable workaround, until a root cause fix is discovered, may be to cut silences longer than a certain duration from your output, while processing several inputs in parallel so this doesn't risk halting the overall flow if there are several pauses in series.