HomeArtificial IntelligenceMeta and Google researchers' recent data curation method could transform self-supervised learning

Meta and Google researchers' recent data curation method could transform self-supervised learning

As AI researchers and corporations compete to coach greater and higher machine learning models, curating suitable datasets is becoming increasingly difficult.

To solve this problem, researchers from Meta-AI, GoogleINRIA and Université Paris Saclay have introduced a recent technic for mechanically curating high-quality datasets for self-supervised learning (SSL).

Their method uses embedding models and clustering algorithms to curate large, diverse, and balanced datasets without the necessity for manual annotation.

Balanced data sets in self-supervised learning

Self-supervised learning has turn into a cornerstone of contemporary AI, powering large language models, visual encoders, and even domain-specific applications reminiscent of medical imaging.

Unlike supervised learning, which requires each training example to be annotated, SSL trains models on unlabeled data, allowing each models and datasets to be scaled using raw data.

However, data quality is critical to the performance of SSL models. Data sets randomly collected from the Internet should not evenly distributed.

This signifies that some dominant concepts occupy a big a part of the dataset, while others are less common. This skewed distribution can bias the model toward the common concepts and stop generalization to unknown examples.

“Datasets for self-supervised learning needs to be large, diverse, and balanced,” the researchers write. “Data curation for SSL due to this fact involves constructing datasets with all of those properties. We propose to construct such datasets by choosing balanced subsets of enormous online data stores.”

Currently, a variety of manual effort goes into curating balanced datasets for SSL. Although manual curation is just not as time-consuming as labeling each training example, it continues to be a bottleneck that hinders training models at scale.

Automatic dataset curation

To address this challenge, the researchers propose an automatic curation technique that creates balanced training datasets from raw data.

Their approach uses embedding models and clustering-based algorithms to rebalance the information in order that less frequent/rare concepts are highlighted in comparison with predominant concepts.

First, a feature extraction model computes the embeddings of all data points. Embeddings are numerical representations of the semantic and conceptual features of assorted data reminiscent of images, audio, and text.

Next, the researchers use k-meana well-liked clustering algorithm that randomly scatters data points after which groups them in accordance with similarity, calculating a brand new mean for every group or cluster, thus creating groups of related examples.

However, classical K-means clustering tends to create more groups for ideas which are overrepresented within the dataset.

To solve this problem and create balanced clusters, the researchers apply a multi-level hierarchical K-Means approach that builds a tree of information clusters from the underside up.

In this approach, in each recent clustering phase, k-means can be concurrently applied to the clusters obtained within the immediate clustering phase. The algorithm uses a sampling technique to make sure that concepts are well represented at each level of the clusters.

Hierarchical K-Means data curation (Source: arxiv)

This is clever since it allows clustering and K-Means each horizontally between essentially the most recent point clusters and vertically back in time (indicated upwards within the graphs above), avoiding losing less well-represented examples as you progress upwards toward fewer but more meaningful top-level clusters (the road plots at the highest of the graph above).

The researchers describe the technique as a “generic, downstream-task-agnostic curation algorithm” that “provides the power to derive interesting features from completely uncurated data sources, whatever the specifics of the applications at hand.”

In other words, any raw data set will be used to create a various and balanced training data set through hierarchical clustering.

Evaluation of mechanically curated data sets

The researchers conducted extensive experiments with computer vision models trained on hierarchically clustered datasets using images that didn’t contain manual labels or image descriptions.

They found that training features on their curated dataset resulted in higher performance on image classification benchmarks, especially on out-of-distribution examples—images that differ significantly from the training data. The model also performed significantly higher on retrieval benchmarks.

Notably, the performance of models trained on their mechanically curated dataset was nearly comparable to the performance of models trained on manually curated datasets, which require significant human effort to create.

The researchers also applied their algorithm to text data to coach large language models and to satellite imagery to coach a model to predict tree cover height. In each cases, training on the curated datasets resulted in significant improvements across all benchmarks.

Interestingly, their experiments show that models trained on balanced datasets can compete with state-of-the-art models, though they were trained on fewer examples.

The automated dataset curation technique presented on this work can have vital implications for applied machine learning projects, especially for industries where labeled and curated data are hard to acquire.

The technique has the potential to significantly reduce the price of annotating and manually maintaining datasets for self-supervised learning. A well-trained SSL model will be optimized for subsequent supervised learning tasks with only a few labeled examples. This method could pave the way in which for more scalable and efficient model training.

Another vital use case could also be for big corporations like Meta and Google, which sit on huge amounts of raw data that has not been prepared for model training. “We consider that (automatic dataset curation) will turn into increasingly vital in future training pipelines,” the researchers write.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read