Study: Data sets for training large language models are sometimes not transparent

August 30, 2024

205

To train more powerful large language models, researchers use large datasets that mix disparate data from hundreds of web sources.

But when these records are repeatedly combined into multiple collections, essential details about their origins and the constraints on their use is usually lost or confused.

Not only does this raise legal and ethical concerns, but it could possibly also impact a model's performance. For example, if a dataset is miscategorized, someone training a machine learning model for a particular task may inadvertently use data that was not designed for that task.

In addition, data from unknown sources may contain biases that cause a model to make unfair predictions when applied.

To improve data transparency, a team of interdisciplinary researchers from MIT and other institutes launched a scientific review of greater than 1,800 text records on popular hosting sites. They found that greater than 70 percent of those records were missing license information and about 50 percent contained erroneous information.

Building on these findings, they developed a user-friendly tool called Data Lineage Explorer that robotically generates easy-to-read summaries of a dataset's creators, sources, licenses, and permitted uses.

“These kinds of tools might help regulators and practitioners make informed decisions in regards to the use of AI and advance the responsible development of AI,” says Alex “Sandy” Pentland, professor at MIT, head of the Human Dynamics Group within the MIT Media Lab, and co-author of a brand new open-access Paper in regards to the project.

The Data Provenance Explorer could help AI users construct simpler models by allowing them to pick training datasets that fit their model's intended purpose. In the long term, this might improve the accuracy of AI models in real-world situations, corresponding to evaluating loan applications or answering customer inquiries.

“One of the most effective ways to grasp the capabilities and limitations of an AI model is to grasp what data it was trained on. When there may be misattribution and confusion about where the info got here from, you’ve got a serious transparency problem,” says Robert Mahari, a doctoral student within the MIT Human Dynamics Group, a law student at Harvard Law School and co-lead creator of the paper.

In addition to Mahari and Pentland, co-author Shayne Longpre, a doctoral student within the Media Lab, Sara Hooker, who leads the Cohere for AI research lab, and other researchers at MIT, the University of California at Irvine, the University of Lille in France, the University of Colorado at Boulder, Olin College, Carnegie Mellon University, Contextual AI, ML Commons, and Tidelift are working on the study. The research is published today in .

Focus on fine-tuning

Researchers often use a method called fine-tuning to enhance the capabilities of a big language model deployed on a particular task, corresponding to query answering. To fine-tune, they create fastidiously curated datasets designed to extend a model's performance on that one task.

The MIT researchers focused on these fine-tuning datasets, which are sometimes developed by researchers, academic organizations or corporations and licensed for specific uses.

When crowdsourcing platforms aggregate such datasets into larger collections that practitioners can use for fine-tuning, a few of the original license information is usually lost.

“These licenses ought to be essential and so they ought to be enforceable,” says Mahari.

For example, if the licensing terms of a dataset are incorrect or incomplete, someone may invest loads of money and time developing a model that they might later must remove because some training data accommodates private information.

“People may find yourself training models without even understanding the capabilities, concerns or risks of those models that ultimately arise from the info,” adds Longpre.

At the outset of this study, the researchers formally defined data provenance as the mix of a dataset's origin, its creation and licensing, and its characteristics. On this basis, they developed a structured auditing procedure to trace the info provenance of greater than 1,800 text dataset collections from popular online repositories.

After determining that greater than 70 percent of those records contained “unspecified” licenses that not noted loads of information, the researchers worked backwards to fill within the gaps. Through their efforts, they reduced the variety of records with “unspecified” licenses to around 30 percent.

Their work also found that the right licenses were often more restrictive than those assigned by the repositories.

In addition, they found that just about all the dataset creators were concentrated in the worldwide north, which could limit a model's capabilities when trained to be used in a distinct region. For example, a Turkish-language dataset created predominantly by people within the US and China won’t include culturally significant features, Mahari explains.

“We almost give ourselves the illusion that the info sets are more diverse than they really are,” he says.

Interestingly, the researchers also found a dramatic increase in restrictions on datasets created in 2023 and 2024. This could also be because of concerns amongst academics that their datasets could possibly be used for unintended business purposes.

A user-friendly tool

To enable others to acquire this information without manual review, the researchers developed the Data Provenance Explorer. In addition to sorting and filtering data sets in response to specific criteria, the tool also allows users to download a knowledge provenance map that gives a concise, structured overview of the info set characteristics.

“We hope this shouldn’t be only a step toward understanding the landscape, but in addition helps people make more informed decisions about what data they train with in the long run,” says Mahari.

In the long run, the researchers plan to expand their evaluation to look at data provenance for multimodal data, including video and voice. They also plan to look at how the terms of use of internet sites that function data sources are reflected in datasets.

As a part of their research, in addition they engage with regulators to debate their findings and the particular copyright implications of knowledge tuning.

“We need data provenance and transparency from the beginning when people create and publish these datasets to make it easier for others to derive these insights,” says Longpre.

“Many proposed policies assume that we will accurately assign and discover licenses related to data. This work first shows that this shouldn’t be the case after which significantly improves the available provenance information,” says Stella Biderman, managing director of EleutherAI, who was not involved on this work. “Section 3 also includes relevant legal discussions. This could be very precious for machine learning practitioners outside of corporations large enough to have their very own legal teams. Many individuals who need to construct AI systems for the general public good are currently quietly struggling to work out find out how to take care of data licensing since the web shouldn’t be designed to make data provenance easy to work out.”

Study: Data sets for training large language models are sometimes not transparent

LEAVE A REPLY Cancel reply

Must Read

Jensen Huang, CEO of Nvidia, sings as a processor in Nintendo Switch 2

We have “resulted in additional of the machines,” says Quant Fund Titan Cliff Asness

Your AI models fail in production -here is the Fix model selection

“Vibe coding” is the brand new DIY

Phonelys recent AI agents reached 99% accuracy – and customers cannot say that they should not human

Epic games reveal the state of the unreality for 2025

Meta agreed 20 years to purchase production from the Illinois atomic power plant

Latest articles

Jensen Huang, CEO of Nvidia, sings as a processor in Nintendo Switch 2

We have “resulted in additional of the machines,” says Quant Fund Titan Cliff Asness

Your AI models fail in production -here is the Fix model selection

Our Newsletter

Study: Data sets for training large language models are sometimes not transparent

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter