198 Views

Research project releases multilingual open source LLM

LinkedIn Facebook X
November 26, 2024

Get a Price Quote

The OpenGPT-X research project has made available its “Teuken-7B” large language model available for download on Hugging Face. The large language model (LLM) has been trained from scratch in all 24 official languages of the European Union (EU) and contains seven billion parameters.

Researchers and companies can leverage this commercially usable open source model for their own AI applications. Funded by the German Federal Ministry of Economic Affairs and Climate Action (BMWK), the OpenGPT-X consortium — led by the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS — have developed the LLM to be open source with a distinctly European perspective.

The model has already been optimized for chat through “instruction tuning”, which is used to adapt LLMs to correctly under-stand instructions from users. This is important when using the models in practice such as in a chat application.

“The ‘Teuken-7B’ model is freely available, providing a public, research-based alternative for use in academia and industry,” says Prof. Stefan Wrobel, Director of Fraunhofer IAIS. “Our model has demonstrated its capabilities across a wide range of languages, and we hope that as many people as possible will adapt and develop the model for their own work and applications. In this way, we want to contribute, both within the scientific community and together with companies from different industries, to the growing demand for transparent and customizable generative AI.”

Teuken-7B is currently one of the few large language models developed multilingually from the ground up. It contains approximately 50 percent non-English pre-training data and has been trained in all 24 official European languages. The LLM has proven to be stable and reliable in its performance across multiple languages. This provides added value, particularly for international companies and organizations with multilingual communication requirements, products and services. The open source model allows companies and organizations to run their own customized models in real-world applications, while sensitive corporate data can remain within the company.

The OpenGPT-X team also addressed a number of research questions, such as how to train and operate multilingual AI language models in a more energy- and cost-efficient way. To this end, the project developed a multilingual “tokenizer”. The task of a tokenizer is to break down words into individual word components — the fewer tokens, the more energy-efficiently and quickly a language model can generate the answer. The developed tokenizer leads to a reduction in training costs compared to other multilingual tokenizers like Llama3 or Mistral. This is particularly valuable for European languages with longer word structures such as German, Finnish or Hungarian.

The bar chart shows the performance of Teuken-7B-instruct-research-v0.4 in the multilingual benchmarks ARC-, HellaSwag- and TruthfulQA in comparison to similar-sized open source models. The bar indicates the respective task performance averaged over 21 languages and the averaged model performance across ARC-, HellaSwag- and TruthfulQA. With the selected benchmarks, Teuken-7B-instruct-research-v0.4 is ahead of all other models on average. In the individual benchmarks ARC and HellaSwag, Teuken is in second place behind Salamandra-7b-instruct, and in TruthfulQA in second place behind Mistral-7B-instruct-v0.3. Copyright Fraunhofer IAIS. The diagram shows the additional computing power required to process a non-English text with a tokenizer belonging to a language model (in % compared to Llama 3). Teuken models require the least amount of additional computing power and thus generate the lowest costs for this multlingual tasks.  Copyright Fraunhofer IAIS.

The OpenGPT-X project was funded by the BMWK program “Innovative and practical applications and data spaces in the Gaia-X digital ecosystem”. The Teuken-7B LLM is accessible via the Gaia-X infrastructure. Actors in the Gaia-X ecosystem can thus develop innovative language applications and transfer them into concrete application scenarios in their respective domains. Unlike existing cloud solutions, Gaia-X is a federated ecosystem that allows service providers and data owners to connect. Data remains securely with its owners and is only shared under defined conditions.

“A special feature of the Teuken-7B LLM is that it enables the secure use of sensitive corporate data, as the Gaia-X standards guarantee data storage and processing in accordance with the strictest European data protection and security regulations. This new model and innovations like this strengthen the digital sovereignty, competitiveness and resilience of Germany and of Europe. This is why the Federal Ministry for Economic Affairs and Climate Action is funding the project with approximately 14 million euros in total,” says Dr. Franziska Brantner, Parliamentary State Secretary at BMWK.

Professor Bernhard Grill, Director of Fraunhofer IIS, explains the potential for safety-critical applications: “With this independently developed language model, the project partners demonstrate their ability to generate their own large models. Access to a large language model enables applications that offer much greater control over this technology without the need for opaque third-party components — for example, in safety-critical fields such as automotive, robotics, medicine and finance. By training on data relevant to a specific application and using application-specific architectures, companies can create customized AI that does not require ‘black box’ components.”

Important research results from the OpenGPT-X project have been incorporated into the model development, such as tools and technologies for processing large amounts of data, leveraging powerful European HPC infrastructure and performing efficient model training. Teuken-7B was trained on the JUWELS supercomputer at Forschungszentrum Jülich. In addition to the two Fraunhofer Institutes and Forschungszentrum Jülich, the consortium’s partners include TU Dresden, the German Research Center for Artificial Intelligence (DFKI), IONOS, Aleph Alpha, ControlExpert, Westdeutscher Rundfunk (WDR) and the German AI Association (KI Bundesverband). The technology developed in OpenGPT-X will also provide the partners with a basis for training their own models in the future.

“OpenGPT-X is an example of how the resources of a publicly funded project and the collaborative efforts of a broad consortium can deliver valuable foundational technology — from underlying infrastructure to model training to productive applications. In the interest of technology and data sovereignty, it is important to build on this foundation: Our hope is that OpenGPT-X will lay the groundwork for many subsequent activities,” emphasizes Daniel Abbou, Managing Director of the German AI Association and President of the European AI Forum.

The Teuken-7B LLM is freely available in two versions — one for research-only purposes and an “Apache 2.0” licensed version that can be used by companies for both research and commercial purposes and integrated into their own AI applications. The performance of the two models is roughly comparable, but some of the datasets used for instruction tuning preclude commercial use and were therefore not used in the Apache 2.0 version.

The research project, which was launched at the beginning of 2022, is now nearing completion. It will run until 31 March 2025 so that further optimizations and evaluations of the models can take place.

Download options and model cards can at: https://huggingface.co/openGPT-X

The OpenGPT-X Discord Server is available at: https://discord.gg/RvdHpGMvB3

Background information and benchmarks are available at https://opengpt-x.de/en/models/teuken-7b

Recent Stories