Europe’s largest seeded startup Mistral AI releases first model, outperforming Llama 2 13B

3 min read


VentureBeat presents: AI Unleashed – An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More

Mistral AI, the six-month-old Paris-based startup that made headlines with its unique Word Art logo and a record-setting $118 million seed round — reportedly the largest seed in the history of Europe — today released its first large language AI model, Mistral 7B.

The 7.3 billion parameter model outperforms bigger offerings, including Meta’s Llama 2 13B (one of the smaller of Meta’s newer models), and is said to be the most powerful language model for its size (to date).

It can handle English tasks while also delivering natural coding capabilities at the same time – making another option for multiple enterprise-centric use cases.

Mistral said it is open-sourcing the new model under the Apache 2.0 license, allowing anyone to fine-tune and use it anywhere (locally to cloud) without restriction, including for enterprise cases.


AI Unleashed

An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.


Learn More

Meet Mistral 7B

Founded earlier this year by alums from Google’s DeepMind and Meta, Mistral AI is on a mission to “make AI useful” for enterprises by tapping only publicly available data and those contributed by customers.

Now, with the release of Mistral 7B, the company is starting this journey, providing teams with a small-sized model capable of low-latency text summarisation, classification, text completion and code completion.

While the model has just been announced, Mistral AI claims to already best its open source competition. In benchmarks covering a range of tasks, the model was found to be outperforming Llama 2 7B and 13B quite easily. 

For instance, in the Massive Multitask Language Understanding (MMLU) test, which covers 57 subjects across mathematics, US history, computer science, law and more, the new model delivered an accuracy of 60.1%, while Llama 2 7B and 13B delivered little over 44% and 55%, respectively.

Similarly, in tests covering commonsense reasoning and reading comprehension, Mistral 7B outperformed the two Llama models with an accuracy of 69% and 64%, respectively. The only area where Llama 2 13B matched Mistral 7B was the world knowledge test, which Mistral claims might be due to the model’s limited parameter count, which restricts the amount of knowledge it can compress. 

“For all metrics, all models were re-evaluated with our evaluation pipeline for accurate comparison. Mistral 7B significantly outperforms Llama 2 13B on all metrics, and is on par with Llama 34B (on many benchmarks),” the company wrote in a blog post.

Mistral 7B vs Llama
Mistral 7B vs Llama

As for coding tasks, while Mistral calls the new model “vastly superior,” benchmark results show it still does not outperform the finetuned CodeLlama 7B. The Meta model delivered an accuracy of 31.1% and 52.5% in 0-shot Humaneval and 3-shot MBPP (hand-verified subset) tests, while Mistral 7B sat closely behind with an accuracy of 30.5% and 47.5%, respectively.

High-performing small model could benefit businesses

While this is just the start, Mistral’s demonstration of a small model delivering high performance across a range of tasks could mean major benefits for businesses.

For example, in MMLU, Mistral 7B delivers the performance of a Llama 2 that would be more than 3x its size (23 billion parameters). This would directly save memory and provide cost benefits – without affecting final outputs. 

The company says it achieves faster inference using grouped-query attention (GQA) and handles longer sequences at a smaller cost using Sliding Window Attention (SWA).

“Mistral 7B uses a sliding window attention (SWA) mechanism, in which each layer attends to the previous 4,096 hidden states. The main improvement, and reason for which this was initially investigated, is a linear compute cost of O(sliding_window.seq_len). In practice, changes made to FlashAttention and xFormers yield a 2x speed improvement for a sequence length of 16k with a window of 4k,” the company wrote.

The company plans to build on this work by releasing a bigger model capable of better reasoning and working in multiple languages, expected to debut sometime in 2024.

For now, Mistral 7B can be deployed anywhere (from locally to AWS, GCP or Azure clouds) using the company’s reference implementation and vLLM inference server and Skypilot

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.


Source link