Google Muse: Is this the future of image generation?

Jan 15, 2023
2 min read

Could this be the next big discovery in generative AI?

Google AI has released a research paper detailing Muse, a new Text-To-Image Generation model that uses masked generative transformers to produce high-quality images at a faster rate than rival models like DALL-E 2 and Imagen.

The groundbreaking technology behind Muse: Masked Generative Transformers

A generative transformer is a type of deep learning model that can generate new data, such as text, images or audio, based on a given input or description. The key feature of these models is their ability to handle long-term dependencies and generate realistic and coherent outputs. However, when it comes to image generation, these models have a limitation, they tend to generate the same images when fed with similar input, which can make it difficult for the model to learn to generate diverse images.

The Masked Generative Transformer is an advanced version of the generative transformer that addresses this limitation by using a technique called "masking". In this technique, the model is trained to predict randomly masked image tokens, which means that some parts of the image are hidden from the model while it's generating the output. This forces the model to be more creative and generate diverse images, rather than just reproducing the same images it has seen before.

In addition, this approach allows the model to learn to generate images based on the text embedding of a large language model which has already been trained, this brings the added advantage of allowing the model to understand the text input and generate images that match the description.

This new development has the potential to revolutionize the field of image generation, and has already been implemented by companies like Google AI, who claim that their new model, Muse, can generate high-quality images at a faster rate than rival models like DALL-E 2 and Imagen.

Impressively, Muse is trained to predict randomly masked image tokens using the text embedding from a large language model and uses a 900 million parameter model to create visuals.

Google claims that with a TPUv4 chip, images can be generated in as little as 0.5 seconds, as opposed to 9.1 seconds using Imagen. The research also states that training Muse models with varying sizes and conditioning on a pre-trained large language model is crucial for generating photorealistic, high-quality images.

The recent release of Google AI's research paper on Muse, a new Text-to-Image Generation model using masked generative transformers, is a clear indication of the rapid advancements in the field of Artificial Intelligence. Muse's ability to generate high-quality images at a faster rate than rival models, while also incorporating a deep understanding of language, is a testament to the potential of AI to not only understand and process human language, but also to create new forms of media.

As Muse develops further, we can look forward to witness the exciting potential of AI in creating new forms of media.