This AI Generates Photos Using Only Text Captions as a Guide

Sep 30, 2020

DL Cade

Researchers at the Allen Institute for Artificial Intelligence (AI2) have created a machine learning algorithm that can produce images using only text captions as its guide. The results are somewhat terrifying… but if you can look past the nightmare fuel, this creation represents an important step forward in the study of AI and imaging.

Unlike some of the genuinely mind-blowing machine learning algorithms we’ve shared in the past—see here, here, and here—this creation is more of a proof-of-concept experiment. The idea was to take a well-established computer vision model that can caption photos based on what it “sees” in the image, and reverse it: producing an AI that can generate images from captions, instead of the other way around.

This is a fascinating area of study and, as MIT Technology Review points out, it shows in real terms how limited these computer vision algorithms really are. While even a small child can do both of these things readily—describe an image in words, or conjure a mental picture of an image based on those words—when the Allen Institute researchers tried to generate a photo from a text caption using a model called LXMERT, it generated nonsense in return.

So they set out to modify LXMERT and created X-LXMERT. And while the results that X-LXMERT generates given a text caption aren’t exactly “coherent,” they’re not “nonsense” either—the general idea is usually there. Here are some example images created by the researchers using various captions:

And here are a few examples we generated by plugging various captions into a live demo they created using their model:

A giraffe standing on dirt ground near a tree.

A zebra walking on a road with two cars approaching.

A full view of a home office with toys on the desk

A woman attempting to ski on a flat hill

The above are all based on captions provided by the researchers, and most of them seem to at least contain the major concepts in each description. However, when we tried to create totally new captions based on more esoteric concepts like “photographer,” “photography studio,” or even the word “camera,” the results fell apart:

While the results from and limitations of X-LXMERT probably don’t inspire either awe or the fear of the impending AI revolution, the groundbreaking masking technique that the researchers developed is an important first step in teaching an AI to “fill in the blanks” that any text description inherently leaves out.

This will eventually lead to better image recognition and computer vision, which can only help improve tasks that actually matter to the readers of this site. In other words: the better a computer is at understanding what you mean when you describe an image or image editing task, the more complex the tasks it will be able to perform on that image.

To learn more about this creation or see some more creepy AI-generated images, read the full research paper here or check out an interactive live demo of the model at this link.

(via DPReview)