Whether you're describing the sound of your broken automotive engine or meowing like your neighbor's cat, imitating sounds together with your voice could be a helpful method to convey an idea when words aren't enough.
Vocal imitation is the aural equivalent of quickly scribbling an image to convey something you've seen—except that as a substitute of using a pencil for instance an image, you employ your vocal tract to specific a sound. This could appear difficult, nevertheless it's something all of us do intuitively: To experience it for yourself, try using your voice to mimic the sound of an ambulance siren, a crow, or the ringing of a bell.
Inspired by the cognitive science of how we communicate, researchers on the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed an AI system that may produce human-like vocal imitations without training and without ever having “heard” a human voice impression before.
To achieve this, the researchers designed their system to generate and interpret sounds in the same method to how we do. They began by constructing a model of the human vocal tract that simulates how vibrations of the larynx are shaped by the throat, tongue and lips. They then used a cognitively inspired AI algorithm to regulate this model of the vocal tract and cause it to mimic, taking into consideration the context-specific ways people select to speak sounds.
The model can effectively record many sounds from the world and create a human-like imitation of them – including sounds equivalent to the rustling of leaves, the hiss of a snake and the siren of an approaching ambulance. Their model may also be run in reverse to guess real-world sounds from human vocal imitations, just like how some computer vision systems can retrieve high-quality images based on sketches. For example, the model can accurately distinguish the sound of a human imitating a cat's “meow” from its “hiss.”
In the long run, this model could potentially result in more intuitive “imitation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even methods to assist students learn recent languages.
The co-lead authors — MIT CSAIL graduate students Kartik Chandra SM '23 and Karima Ma and undergraduate researcher Matthew Caren — note that computer graphics researchers have long recognized that realism is never the final word goal of visual expression. For example, an abstract painting or a toddler's chalk drawing may be just as expressive as a photograph.
“In recent a long time, advances in sketching algorithms have led to recent tools for artists, advances in AI and computer vision, and even a deeper understanding of human cognition,” notes Chandra. “Just as a sketch is an abstract, non-photorealistic representation of a picture, our method captures the abstract, non-phonorealistic way people express the sounds they hear. In this manner we learn in regards to the means of auditory abstraction.”
“The goal of this project was to know and computationally model vocal imitation, which we consider to be a sort of auditory equivalent of sketching within the visual domain,” says Caren.
The Art of Imitation, in three parts
The team developed three increasingly nuanced versions of the model to match it to human vocal imitations. First, they created a basic model that simply aimed to create imitations that were as just like real-world sounds as possible – but that model didn't fit human behavior thoroughly.
The researchers then designed a second “communicative” model. According to Caren, this model takes into consideration what’s characteristic of a sound for a listener. For example, you’d probably imitate the sound of a motorboat by mimicking the rumble of its engine, since that’s its most distinctive auditory feature, even when it isn’t the loudest aspect of the sound (in comparison with, for instance, the splashing of water). This second model resulted in imitations that were higher than the bottom model, however the team wanted to enhance it even further.
To take their method one step further, the researchers added a final layer of reasoning to the model. “Singing imitations can sound different depending on the trouble. It takes time and energy to provide sounds which can be absolutely precise,” says Chandra. The researchers' full model takes this into consideration by attempting to avoid utterances which can be very fast, loud, or high or low, which persons are less more likely to use in a conversation. The result: more human-like imitations that closely match most of the selections people make when imitating the identical sounds.
After creating this model, the team conducted a behavioral experiment to see whether AI or human-generated vocal imitations were perceived as higher by human judges. Notably, participants within the experiment generally preferred the AI ​​model 25 percent of the time, and as much as 75 percent for an imitation of a motorboat and 50 percent for an imitation of a gunshot.
Towards more expressive sound technology
Caren, who’s obsessed with technology for music and art, imagines that this model could help artists higher communicate sounds to computer systems and help filmmakers and other content creators create AI sounds which can be more nuanced are tailored to a particular context. It could also allow a musician to quickly search a database of sounds by imitating a sound that’s difficult to explain in a text prompt, for instance.
Meanwhile, Caren, Chandra and Ma are studying the implications of their model in other areas, including the event of language, the way in which young children learn to talk, and even the imitative behavior of birds equivalent to parrots and songbirds.
The team still has loads of work to do with the present version of its model: it has problems with some consonants like “z,” which led to inaccurate impressions of some feels like the buzzing of bees. They also cannot yet imitate the way in which humans imitate speech, music, or sounds which can be imitated in another way in numerous languages, equivalent to a heartbeat.
Stanford University linguistics professor Robert Hawkins says the language is stuffed with onomatopoeia and words that mimic but don’t fully reproduce the things they describe, equivalent to the “meow” sound, which is analogous to the sound made by cats only very imprecisely similar. “The processes that take us from the sound of an actual cat to a word like 'meow' reveal much in regards to the complex interplay of physiology, social thought and communication within the evolution of language,” says Hawkins, who was not involved in CSAIL -Research. “This model represents an exciting step toward formalizing and testing theories of those processes and shows that each physical constraints imposed by the human vocal tract and social pressures imposed by communication are required to clarify the prevalence of vocal imitation.”
Caren, Chandra and Ma co-wrote the paper with two other CSAIL members: Jonathan Ragan-Kelley, associate professor within the Department of Electrical Engineering and Computer Science at MIT, and Joshua Tenenbaum, professor of brain and cognitive sciences at MIT and Center for Brains, Heads and Machines Member. Her work was supported partly by the Hertz Foundation and the National Science Foundation. It was launched at SIGGRAPH Asia in early December.