MIT researchers present generative AI for databases

July 8, 2024

190

A brand new tool makes it easier for database users to perform complex statistical analyses of tabular data without having to know what is occurring behind the scenes.

GenSQL, a generative AI system for databases, could help users make predictions, detect anomalies, guess missing values, fix errors, or generate synthetic data with just a couple of keystrokes.

For example, if the system were used to research the medical data of a patient who has a history of hypertension, it could determine a blood pressure value that’s low for that patient but otherwise inside the conventional range.

GenSQL mechanically integrates a tabular dataset and a generative probabilistic AI model that may account for uncertainty and adapt decision-making based on recent data.

Additionally, GenSQL may be used to generate and analyze synthetic data that mimics the true data in a database. This may be particularly useful in situations where sensitive data can’t be shared, resembling patient records, or when real data is sparse.

This recent tool relies on SQL, a programming language for creating and manipulating databases that was introduced within the late Nineteen Seventies and is utilized by hundreds of thousands of developers world wide.

“Historically, SQL taught the business world what a pc could do. They didn't have to jot down custom programs, they only asked questions of a database in a high-level language. We imagine that as we move from just querying data to questioning models and data, we’d like an analog language that teaches people the interrelated questions they will ask a pc that has a probabilistic model of the info,” says Vikash Mansinghka, lead writer of a Paper introducing GenSQL and principal research scientist and director of the Probabilistic Computing Project within the Department of Brain and Cognitive Sciences at MIT.

When researchers compared GenSQL to common AI-based approaches to data evaluation, they found that it was not only faster but in addition produced more accurate results. Importantly, the probabilistic models utilized by GenSQL are explainable, allowing users to read and manipulate them.

“When you take a look at the info and take a look at to seek out meaningful patterns using some easy statistical rules, you could miss necessary interactions. You need to capture in a model the correlations and dependencies of the variables, which may be quite complicated. With GenSQL, we wish to enable numerous users to question their data and their model without having to know all the main points,” adds lead writer Mathieu Huot, a scientist within the Department of Brain and Cognitive Sciences and member of the Probabilistic Computing Project.

Contributors to the paper include Matin Ghavami and Alexander Lew, MIT graduate students, Cameron Freer, a research scientist, Ulrich Schaechtel and Zane Shelby of Digital Garage, Martin Rinard, an MIT professor within the Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and Feras Saad, an assistant professor at Carnegie Mellon University. The research was recently presented on the ACM Conference on Programming Language Design and Implementation.

Combining models and databases

SQL stands for Structured Query Language and is a programming language for storing and manipulating information in a database. SQL allows users to question data using keywords, for instance by summing, filtering or grouping database entries.

However, querying a model can provide deeper insights because models can capture what data means to a person. For example, a developer wondering if she's underpaid is more likely to be more taken with what salary data means to her personally than in trends from database records.

The researchers noticed that SQL didn’t provide an efficient method to integrate probabilistic AI models. At the identical time, approaches that use probabilistic models for reasoning don’t support complex database queries.

To fill this gap, they developed GenSQL, which allows querying each a dataset and a probabilistic model using an easy but powerful formal programming language.

A GenSQL user uploads their data and their probability model, which the system mechanically integrates. They can then run queries on data that also receive input from the probability model running within the background. This not only allows for more complex queries, but can even provide more accurate answers.

For example, a question in GenSQL is likely to be: “How likely is it that a developer from Seattle knows the Rust programming language?” Looking only on the correlation between columns in a database can miss subtle dependencies. By incorporating a probabilistic model, more complex interactions may be captured.

Additionally, the probabilistic models GenSQL uses are auditable, allowing users to see what data the model is using to make decisions. Additionally, these models provide calibrated uncertainty measures together with each response.

For example, with this calibrated uncertainty, if the model is queried for the expected outcomes of various cancer treatments for a patient from a minority group underrepresented within the dataset, GenSQL will tell the user that the uncertainty is and the way uncertain it’s, reasonably than overconfidently endorsing the improper treatment.

Faster and more accurate results

To evaluate GenSQL, the researchers compared their system to common baseline methods that use neural networks. GenSQL was between 1.7 and 6.8 times faster than these approaches, executing most queries in a couple of milliseconds while producing more accurate results.

They also applied GenSQL in two case studies: in a single, the system identified mislabeled data from clinical trials, and in the opposite, it generated accurate synthetic data that captured complex relationships in genomics.

Next, the researchers plan to use GenSQL more broadly to run large-scale models of human populations. With GenSQL, they will generate synthetic data to make inferences about things like health and salary, while controlling what information is utilized in the evaluation.

They also intend to make GenSQL more user-friendly and powerful by adding recent optimizations and automation to the system. In the long run, the researchers need to allow users to ask natural language queries in GenSQL. Their goal is to eventually develop a ChatGPT-like AI expert which you could seek advice from about any database and that bases its answers on GenSQL queries.

This research is funded partly by the Defense Advanced Research Projects Agency (DARPA), Google, and the Siegel Family Foundation.

MIT researchers present generative AI for databases

LEAVE A REPLY Cancel reply

Must Read

OpenAI's hunger for data raises privacy concerns

OpenAI extends o1 AI models to enterprise and education, competing directly with Anthropic

AI within the doctor’s office: GPs turn to ChatGPT and other tools for diagnoses

This week in AI: Why OpenAI's o1 is changing AI regulation

Amazon releases video generator – but just for ads

New AI JetPack accelerates the startup process

The climate costs of technology and TikTok’s court date

Latest articles

OpenAI's hunger for data raises privacy concerns

OpenAI extends o1 AI models to enterprise and education, competing directly with Anthropic

AI within the doctor’s office: GPs turn to ChatGPT and other tools for diagnoses

Our Newsletter

MIT researchers present generative AI for databases

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter