Where Are the Text+video Models

Question

Hello,As far as I can tell none of the big "AI" players (Openai/Google/Meta) are releasing text+video multimodal models. What I mean by this is a transformer that somehow operates on language tokens as as well as image/video tokens. My intuition says that this should be quite feasible right now: I think by training such a model already with "low resolution/small encoded image dimension" one can give an LLM a better understanding of the behavior of our physical world (lack of this is a thing Le Cun criticizes again and again, so clearly he would try to find a solution).So what are the obvious explanations?1. It doesn't work.Consider https://www.youtube.com/watch?v=XBRQJLy3M2E and https://youtu.be/6x-Xb_uT7ts?si=dVKt72sYRBp5m9E6The way I interpret this is that it is possible with reasonable compute cost (these models are supposed to run in cars) to train models that have a decent understanding of the behavior of the physical world. Orders of magniute better than text-only LLMs. Video+Text Data is also available via subtitles.This leads me to think one of these is the case:2. They exist but I have never heard of them. I know about Gen-2 but that is not good video generation, in my opinion. The videos are more like animatronics. There is rarely any real movement. The dynamics of our physical world are barely present unlike the above examples.3. I am missing something and it doesn't work or isn't useful.4. These models are being tested internally for research purposes but they are not product-worthy so not released.What do you think?

brucethemoose2 · Accepted Answer

There are plenty of multimodel models, like Llava, with image input. Or things like BLIP diffusion which will to image+text to image.Pure text-to-video is very hard, and is being worked on.