It's not clear exactly how you would implement it, though. Maybe by recursively dividing the problem into rectangles, directed by the model? E.g. start with full image, train the model to locate the first element of the html, and output an attention mask for that element and the corresponding html tag and maybe style. Then recursively run the model with the attention mask as an input and with the inverted attention mask as the input (two runs), and have it extract the next element of each.
Not sure if that would work, but it seems like it might.
But the hierarchy piece I think is a bit tricker.
I'm really curious to see what the comments come up with.
https://ai.googleblog.com/2021/01/improving-mobile-app-acces...