Meet LLEMMA, the math-focused open source AI that outperforms rivals

In a new paper, researchers from various universities and Eleuther AI, a company renowned for its open-source models, introduce LLEMMA, an open-source large language model (LLM) specifically designed to solve mathematical problems.

LLEMMA surpasses other leading math-focused language models—including Google’s Minerva—in performance, offering a robust platform for further research.

Although LLEMMA is not a flawless math solver, it represents a significant stride towards the development of specialized large language models and can propel AI research in new directions.

State-of-the-art math models

LLEMMA has been built on Code Llama, an adaptation of Meta’s open-source Llama 2 model fine-tuned on code-specific datasets. The researchers developed two versions of the model, one with 7 billion parameters and another with 34 billion. The models were further fine-tuned on Proof-Pile-2, a dataset created by the researchers that is composed of a blend of scientific papers, web data featuring mathematics, and mathematical code.

“LLEMMA is pretrained on a diverse distribution of mathematics-related data, and is not tuned for a particular task. Therefore, we expect that LLEMMA can adapt to many other tasks via task-specific finetuning and few-shot prompting,” the researchers write.

In their experiments, the researchers found that LLEMMA demonstrated superior performance over all known open models on mathematical benchmarks. “We conclude that continued pretraining on Proof-Pile-2 is effective for improving a pretrained model’s ability to perform mathematical problem solving,” they write.

Moreover, LLEMMA exhibits the ability to use tools and prove formal theorems without additional finetuning. It can leverage computational tools, such as the Python interpreter and formal theorem provers, to solve mathematical problems. The use of tools can further strengthen the model’s problem-solving capabilities by providing an external source of knowledge to verify and correct its answers.

Providing tools for further research

While several large language models have been fine-tuned for mathematics, Google’s Minerva, based on its PaLM model, stands out. However, it’s not open source.

LLEMMA, on the other hand, surpasses Minerva on an “equi-parameter basis.” This means that LLEMMA-7B outperforms Minerva-8B, and LLEMMA-34B is nearly on par with Minerva-62B.

The researchers have released all their assets. This includes the 7-billion- and 34-billion-parameter models, the Proof-Pile-2 dataset, and the code to replicate their experiments. Proof-Pile-2 includes the AlgebraicStack, a new dataset with 11 billion tokens of code specifically related to mathematics.

According to the researchers, LLEMMA is the first open-source model that matches the performance of state-of-the-art closed-source models. This allows other researchers to build upon it and enhance the work further.

“We hope that LLEMMA and Proof-Pile-2 will be a useful base for future work on understanding language model generalization and dataset composition, investigating the limits of domain-specific language models, using language models as tools for mathematicians, and improving the mathematical capabilities of language models,” the researchers write.

The broader impact of math-focused LLMs

LLEMMA is part of a broader initiative to develop LLMs that specialize in a specific field, rather than a general model capable of performing multiple tasks. The LLEMMA model demonstrates that with improved data and larger datasets, smaller models can still yield significant results. For instance, the LLEMMA-7B outperforms Code Llama-34B on almost all math reasoning datasets.

The researchers note that “a domain-specific language model may offer superior capabilities for a given computational cost, or lower computational cost for a given level of capability.” This is in line with other research that shows small models can continue to improve when trained on a very large dataset composed of high-quality examples.

The suitability of LLMs for solving math problems has been a topic of extensive debate. Measuring the reasoning capabilities of LLMs is very difficult. Often, models score high on math benchmarks due to “data contamination,” where the test examples were included in the training data, essentially meaning the model has memorized the answers. There are also studies showing that an LLM might provide different answers to the same question when it is formulated in slightly different ways. And some scientists argue that LLMs are fundamentally unsuitable for math because of their stochastic nature.

The LLEMMA developers took meticulous steps to verify whether the benchmark examples were included in the training data. While they found similar examples in the training and test data, they concluded that “a nontrivial match between a test example and a training document did not imply that the model generated a memorized correct answer.”

Progress in developing LLMs that can reliably solve math problems can enhance the reasoning and planning capabilities of language models. The achievements of LLEMMA, particularly given the release of the models and code, can also benefit other fields by specializing LLMs for different domains.

The researchers suggest that “solving mathematical problems requires pattern matching against a large body of specialized prior knowledge, thus serving as an ideal setting for domain adaptation.” Even if LLMs do not become the ultimate tools for math problem-solving, they can form the basis for other types of models and AI research.

The researchers also believe that “language models capable of strong mathematical reasoning are upstream of a number of research topics, such as reward modeling, reinforcement learning for reasoning, and algorithmic reasoning.” It will be interesting to see what kind of new research LLEMMA could inspire.