Vision Models to Parse UI

Question

Hello. I was curious whether there are any existing vision models out there for either object classification of individual UI elements, or object detection of UI elements within their broader context.

smaddox · Accepted Answer

Not that I'm aware of, but that's a very interesting idea. If you had one that maps from ui to html with inline styling, you could automate turning image mock-ups into html with inline styling.
It's not clear exactly how you would implement it, though. Maybe by recursively dividing the problem into rectangles, directed by the model? E.g. start with full image, train the model to locate the first element of the html, and output an attention mask for that element and the corresponding html tag and maybe style. Then recursively run the model with the attention mask as an input and with the inverted attention mask as the input (two runs), and have it extract the next element of each.
Not sure if that would work, but it seems like it might.

carom · Answer

LayoutLM [1] is the closest that I have seen you what you are asking. It is applied to documents but essentially takes positional and visual information into account for text extraction. For example, extracting a total from the line that reads TOTAL - I think this would be the best place to start.1. https://arxiv.org/abs/2204.08387

throwaway2016a · Answer

I too would like to see this! I'm pretty sure it can be accomplished by using screenshots as training data for an object recognition algorithm. As with a lot of machine learning, gathering and tagging the data would be tricky and overfitting to specific design systems could be a big problem.
But the hierarchy piece I think is a bit tricker.
I'm really curious to see what the comments come up with.

felixr · Answer

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding https://arxiv.org/abs/2210.03347https://github.com/google-research/pix2struct

lunixbochs · Answer

Android voice access blogged about how they use a model to detect and classify buttons:https://ai.googleblog.com/2021/01/improving-mobile-app-acces...

jetnew · Answer

I'm currently working on this! Currently using traditional computer vision methods (e.g. canny edge detection) which already works quite well for most websites or applications, but am working towards curating a dataset for deep learning. I'm keen to chat!

sharemywin · Answer

I'm interest too if any one knows of anything.

Vision Models to Parse UI

Hello. I was curious whether there are any existing vision models out there for either object classification of individual UI elements, or object detection of UI elements within their broader context.

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding https://arxiv.org/abs/2210.03347
https://github.com/google-research/pix2struct

Android voice access blogged about how they use a model to detect and classify buttons:
https://ai.googleblog.com/2021/01/improving-mobile-app-acces...

I'm currently working on this! Currently using traditional computer vision methods (e.g. canny edge detection) which already works quite well for most websites or applications, but am working towards curating a dataset for deep learning. I'm keen to chat!

I'm interest too if any one knows of anything.