HomeArtificial IntelligenceZyphra introduces Zyda, a 1.3 Tesla language modeling dataset that claims to...

Zyphra introduces Zyda, a 1.3 Tesla language modeling dataset that claims to be higher than Pile, C4 and arxiv

Zyphra Technologies announce, Release of Zydaa large dataset designed to coach language models. It consists of 1.3 trillion tokens and is a filtered and deduplicated mashup of existing premium open datasets, notably RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. The company claims its ablation studies have shown that Zyda performs higher than the datasets it is predicated on. An early dataset version is predicated on Zyphra's Zamba model and can eventually be available for Download on Hugging Face.

Photo credit: Zyphra

“(We) got here up with Zyda when (we) were attempting to create a pre-training dataset for (our) Zamba model suite,” Zyphra CEO Krithik Puthalath tells VentureBeat in an email. “The problem it solves is that it provides a particularly high-quality dataset within the trillion-token range for training language models that otherwise anyone who desired to train a language model would should recreate themselves using something like Zyda.”

Apparently, the corporate wanted to construct a greater, proverbial mousetrap. Zyphra combined several existing open datasets after which frolicked cleansing the tokens to make sure they were a novel group. Specifically, it performed syntactic filtering to eliminate low-quality documents before performing an “aggressive” deduplication effort “inside and across” the datasets. “The cross-deduplication could be very necessary because we found that many datasets contained a lot of documents that were also present in other datasets,” the corporate explains in a blog post. This probably shouldn't be surprising, since many likely depend on shared sources like Common Crawl.

Photo credit: Zyphra

Of the seven open language modeling datasets used, RefinedWeb (43.6 percent) is the most important inside Zyda. Slimpajama (18.7 percent) and StarCoder (17.8 percent) are second and third, respectively. The rest make up single-digit percentage points.

“Overall, we discarded about 40 percent of our original dataset and reduced the token count from about 2 (trillion) tokens to 1.3 (trillion).”

Because it's open source, developers can tap into this world-class language modeling dataset to construct smarter AI. That means improved word predictions in sentence composing, text generation, language translation, and more. If it really works in addition to Zyphra says, developers will only need to make use of one dataset, reducing production time and saving costs.

And in case you're curious how this latest dataset got the name Zyda, Puthalath reveals that it's a mixture of “Zyphra Dataset.”

You can download Zyda on Zyphra's Hugging Face Page.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read