Cohere for AI launches open source LLM for 101 languages

3 min read

[ad_1]

Today, Cohere for AI, the nonprofit research lab established by Cohere in 2022, unveiled Aya, an open-source large language model (LLM) supporting 101 languages — more than twice the number of languages covered by existing open-source models.

The researchers also released the Aya dataset, a corresponding collection of human annotations — this is key because one obstacle to training less common languages is that there is less source material to train on. But according to Cohere for AI, the lab’s engineers also found ways to improve model performance with less training data.

The Aya project, which was launched in January 2023, was a “huge endeavor” with over 3000 collaborators around the world, including teams and participants from 119 countries, said Sara Hooker, VP of research at Cohere and leader of Cohere for AI.

With over 513 million instruction fine-tuned annotations (data labels to help classify information), “I don’t think we knew at the time was quite how enormous it would be as a project,” Hooker told VentureBeat in an interview, calling this kind of data is the highly-valuable “gold dust” that goes on at the end of the LLM training (as opposed to pre-training data scraped from the internet).

VB Event

The AI Impact Tour – NYC

We’ll be in New York on February 29 in partnership with Microsoft to discuss how to balance risks and rewards of AI applications. Request an invite to the exclusive event below.

 

Request an invite

Ivan Zhang, co-founder and CTO of Cohere, posted on X that “we’re releasing human demonstrations across 100+ languages to further scale intelligence and ensure that it serves more of humanity than just the english literate world,” calling it “yet another impossible scientific and operational feat achieved by” Hooker and the Cohere for AI team.

Potential of LLMs for languages and cultures largely ignored

According to a Cohere blog post, The new model and dataset is meant to help “researchers unlock the powerful potential of LLMs for dozens of languages and cultures largely ignored by most advanced models on the market today.”

Cohere for AI said that it benchmarked the Aya models performance against available, open-source, massively multilingual models. It surpasses the best open-source models, such as mT0 and Bloomz, in performance on benchmark tests “by a wide margin,” and expands coverage to more than
50 previously unserved languages, including Somali and Uzbek.

Hooker pointed out that any model with above six languages is typically considered “extreme” in terms of multilingual performance, and that once there are about 25 languages, “that’s ‘massively multilingual’ — there are only a few models that actually tackle that many languages and report performance on them.”

A data ‘cliff’ outside of English

That means that there is a data “cliff” of sorts outside of English fine-tuning data, Hooker explained, so Aya’s data is “incredibly rare.”

“What I expect will happen is that people will select languages that they want to share from this dataset, and they will be able to iterate and create models which serve subsets of languages and and that’s a huge need,” she said. “But the biggest divide I see right now technically is precisation. These models have been used all over the world and so people want it to work for them. And they want to personalize — and part of that just requires data in different languages.”

Aleksa Gordic, previously a researcher at Google DeepMind, is currently building a full stack generative AI platform for language-specific LLMs and developed YugoGPT, an LLM that he says outperformed Mistral and Llama 2 for Serbian, Bosnian, Croatian, and Montenegrin.

“I definitely think that Aya and all similar multilingual data efforts are crucial,” he told VentureBeat. “LLMs feed on data and if you want to support non-English languages you need high quality and ideally abundant data sources for that target language of interest so you can build high quality LLMs.”

The effort is “definitely not enough,” he added, but “is a step in the right direction.” A global research community is needed to work on this, he explained, “and we also need support from governments around the world to understand the importance of building large and high quality data sources. That way you preserve your language, your culture in the brand new AI world.”

Cohere for AI’s Aya model and datasets are already available on Hugging Face.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.



[ad_2]

Source link