HomeIndustriesNYU researchers are developing a groundbreaking AI speech synthesis system

NYU researchers are developing a groundbreaking AI speech synthesis system

A team of researchers at New York University has made advances in neural speech decoding, bringing us closer to a future during which individuals who have lost the flexibility to talk can regain their voice.

The studypublished in , introduces a novel deep learning framework that accurately translates brain signals into comprehensible language.

People with brain injuries on account of strokes, degenerative diseases or physical trauma can use these systems to speak by decoding their thoughts or intended language from neural signals.

The NYU team's system features a deep learning model that maps electrocorticography (ECoG) signals to a set of interpretable speech features, corresponding to: B. pitch, loudness and the spectral content of speech sounds.

ECoG data captures the essential elements of speech production and allows the system to generate a compact representation of the intended speech.

The second stage features a neural speech synthesizer that converts the extracted speech features into an audible spectrogram, which may then be converted right into a speech waveform.

This waveform can eventually be converted into natural-sounding synthetic speech.

This is how the study works

This study involves training an AI model that may power a speech synthesizer so that folks with speech loss can speak using only electrical impulses from their brains.

This is how it really works intimately:

1. Collect brain data

The first step is to gather the raw data required to coach the speech decoding model. Researchers worked with 48 participants who underwent neurosurgery for epilepsy.

During the study, these participants were asked to read tons of of sentences aloud while their brain activity was recorded using ECoG grids.

These grids are placed directly on the surface of the brain and capture electrical signals from the brain regions involved in speech production.

2. Mapping brain signals to language

Using speech data, the researchers developed a complicated AI model that maps the recorded brain signals to specific speech characteristics, corresponding to pitch, loudness, and the unique frequencies that make up different speech sounds.

3. Synthesis of language from features

The third step involves converting the speech features extracted from brain signals back into audible speech.

The researchers used a special speech synthesizer that takes the extracted features and generates a spectrogram – a visible representation of the speech sounds.

4. Evaluation of the outcomes

The researchers compared the language produced by their model with the participants' original language.

They used objective metrics to measure the similarity between the 2 and located that the language produced closely matched the content and rhythm of the unique.

5. Testing recent words

To make sure that the model can handle recent words that it has not seen before, certain words were intentionally omitted through the model's training phase after which the model's performance on these unknown words was tested.

The model's ability to accurately decode even recent words demonstrates its potential for generalization and processing of varied language patterns.

NYU's speech synthesis system. Source: Nature (open access)

The top section of the diagram above describes a process for converting brain signals into speech. First, a decoder converts these signals into speech parameters over time. A synthesizer then creates sound images (spectrograms) from these parameters. Another tool converts these images back into sound waves.

The section below discusses a system that helps train the brain signal decoder by imitating speech. It records a sound image, converts it into speech parameters and uses it to create a brand new sound image. This a part of the system learns from actual speech sounds to enhance.

After training, only the highest process is required to convert brain signals into speech.

A key advantage of the NYU system is its ability to realize high-quality speech decoding without the necessity for ultra-high-density electrode arrays, that are impractical for long-term use.

Essentially, it offers a lighter, portable solution.

Another achievement is the successful decoding of language from each the left and right hemispheres of the brain, which is essential for patients with brain damage on one side of the brain.

Using AI to convert thoughts into speech

The NYU study builds on previous research on neural speech decoding and brain-computer interfaces (BCIs).

In 2023, a team on the University of California, San Francisco made this possible for a paralyzed stroke survivor construct sentences at a rate of 78 words per minute using a BCI that synthesized each vocalizations and facial expressions from brain signals.

Other recent studies have examined using AI to interpret various facets of human considering based on brain activity. Researchers have demonstrated the flexibility to generate images, text and even music from fMRI and EEG data.

For example one Studied on the University of Helsinki used EEG signals to guide a generative adversarial network (GAN) in creating facial images that matched participants' thoughts.

Meta AI too developed a technology to partially decode what someone heard using non-invasively collected brain waves.

Opportunities and challenges

The NYU method uses more widely available and clinically useful electrodes than previous methods, making it more accessible.

While that is exciting, there are major obstacles to beat if we’re to witness widespread adoption.

On the one hand, collecting high-quality brain data is a posh and time-consuming undertaking. Individual differences in brain activity make generalization difficult, meaning a model trained for one group of participants may not work well for one more.

Still, the NYU study represents a step in that direction by demonstrating high-precision speech decoding using lighter electrode arrays.

Looking forward, the NYU team desires to refine their models for real-time speech decoding to bring us closer to the last word goal of enabling natural, fluid conversations for individuals with speech disabilities.

They also need to adapt the system to implantable wireless devices that might be utilized in on a regular basis life.


Please enter your comment!
Please enter your name here

Must Read