Let’s assume you’re a casual user. There are now a range of high-end tools that allow pro video makers to insert video generation models into their workflows. But most people will use this technology in an app or via a website.
You know the drill: “Hey, Gemini, make me a video of a unicorn eating spaghetti. Now make its horn take off like a rocket.” What you get back will be hit or miss, and you’ll typically need to ask the model to take another pass or 10 before you get more or less what you wanted.
So what’s going on under the hood? Why is it hit or miss—and why does it take so much energy?
The latest wave of video generation models are what’s known as latent diffusion transformers. Yes, that’s quite a mouthful. Let’s unpack each part in turn, starting with diffusion.
Read more | MIT TECHNOLOGY REVIEW