You've probably heard that an image is value a thousand words. But can a big language model (LLM) form an image if it has never seen pictures before?
As it seems, language models trained solely on text have a solid understanding of the visual world. They can write image rendering code to generate complex scenes with intriguing objects and compositions—and even when that knowledge isn't used properly, LLMs can refine their images. Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) observed this after they asked language models to self-correct their code for various images, with the systems improving their easy clip art drawings with each query.
The visual knowledge of those language models is gained from the way in which concepts like shapes and colours are described on the net, whether in language or code. When users are given an instruction like “Draw a parrot within the jungle,” they stimulate the LLM to take into consideration what it has previously read in descriptions. To assess how much visual knowledge LLMs have, the CSAIL team developed an “eye test” for LLMs: Using their “Visual Aptitude Dataset,” they tested the models' abilities to attract, recognize, and self-correct these concepts. The researchers collected each final draft of those illustrations and trained a pc vision system to acknowledge the content of real photographs.
“We are essentially training a visible system without directly using visual data,” says Tamar Rott Shaham, co-leader of the study and postdoctoral fellow in Electrical Engineering and Computer Science (EECS) at MIT at CSAIL. “Our team interrogated language models to write down image rendering codes that generate data for us, after which trained the image processing system to guage natural images. We were inspired by the query of how visual concepts are represented by other media, reminiscent of text. To express their visual knowledge, LLMs can use code as a standard ground between text and image.”
To create this dataset, the researchers first queried the models to generate code for various shapes, objects, and scenes. They then compiled this code to render easy digital illustrations reminiscent of a row of bicycles. This shows that LLMs understand spatial relationships well enough to attract the two-wheelers in a horizontal row. As one other example, the model generated a car-shaped cake by combining two random concepts. The language model also produced a glowing lightbulb, indicating its ability to create visual effects.
“Our work shows that an LLM (without multimodal pre-training) knows quite a bit greater than it seems if you ask it to create a picture,” says Pratyusha Sharma, co-lead, EECS PhD student and CSAIL member. “Let's say you ask it to attract a chair. The model knows other things about that piece of furniture that it might not have rendered immediately, so users can ask the model to enhance the visual representation it creates with each iteration. Surprisingly, the model can iteratively enrich the drawing by improving the rendering code to a big extent.”
The researchers collected these illustrations and used them to coach a pc vision system that may recognize objects in real photos (regardless that it has never seen one). Using this synthetic, text-generated data as its only reference point, the system outperforms other procedurally generated image datasets trained on authentic photos.
The CSAIL team believes that it is also useful to mix the hidden visual knowledge of LLMs with the artistic capabilities of other AI tools, reminiscent of diffusion models. Systems like Midjourney sometimes lack the know-how to consistently optimize the finer details of a picture, making it difficult for them to handle requests reminiscent of reducing the variety of cars imaged or placing one object behind one other. If an LLM were to stipulate the specified change to the diffusion model prematurely, the resulting edit may be more satisfactory.
The irony, as Rott Shaham and Sharma admit, is that LLMs sometimes fail to acknowledge the identical concepts that they will draw. This was evident when the models misidentified human recreations of images within the dataset. Such different representations of the visual world were likely the trigger for the language models' misinterpretations.
While the models had difficulty perceiving these abstract representations, they demonstrated the creativity to attract the identical concepts in a different way every time. When the researchers asked the LLMs to attract concepts reminiscent of strawberries and arcades multiple times, they produced images from different angles with different shapes and colours. This suggests that the models can have actual mental images of visual concepts (slightly than reciting examples they’d seen before).
The CSAIL team believes this procedure could provide a basis for evaluating how well a generative AI model can train a pc vision system. In addition, the researchers need to expand the tasks they use to challenge language models. As for his or her recent study, the MIT group points out that they wouldn’t have access to the training set of the LLMs they used, making it difficult to further investigate the origin of their visual knowledge. In the longer term, they intend to explore training a fair higher vision model by having the LLM work with it directly.
Sharma and Rott Shaham are on the paper by former CSAIL members Stephanie Fu '22, MNG '23, and EECS graduate students Manel Baradad, Adrián RodrĂguez-Muñoz '22, and Shivam Duggal, all of whom are CSAIL members; in addition to MIT Associate Professor Phillip Isola and Professor Antonio Torralba. Their work was supported partially by a grant from the MIT-IBM Watson AI Lab, a LaCaixa Fellowship, the Zuckerman STEM Leadership Program, and the Viterbi Fellowship. They are presenting their paper this week on the IEEE/CVF Computer Vision and Pattern Recognition Conference.