Imagen uses text-conditional super-resolution diffusion models to upsample the 64圆4 image into a 256x2x1024.Ĭompared to NVIDIA's GauGAN2 method from last fall, Imagen is significantly improved in terms of flexibility and results. A 'conditional diffusion model' then maps the text embedding into a small 64圆4 image. Imagen works by taking a natural language text input, like, 'A Golden Retriever dog wearing a blue checkered beret and red dotted turtleneck,' and then using a frozen T5-XXL encoder to turn that input text into embeddings. 'The Toronto skyline with Google brain logo written in fireworks.' You can read about the full testing results in Google's research paper. Imagen also bests DALL-E 2 and other competing text-to-image methods among human raters. Despite not being trained using COCO, Imagen still performed well here too. Using a standard measure, FID, Google Imagen outpaces Open AI's DALL-E 2 with a score of 7.27 using the COCO dataset. Google's results are extremely, perhaps even scarily, impressive.
‘A blue jay standing on a large basket of rainbow macarons.' Credit: GoogleĪbout a month after OpenAI announced DALL-E 2, its latest AI system to create images from text, Google has continued the AI 'space race' with its own text-to-image diffusion model, Imagen.