Evaluating the LLM models

Mozilla has gone full steam ahead with “AI”. They are woke idiots but that’s an understatement. Still I am linking it here because it makes sense. Just because you need to tick the check-boxes doesn’t bring any relevance to the “community”; they have too much money in the bank from sweetheart deals and just abandoned Firefox. They are pursuing the next shiny object in the market.

n November 2023, our team at Mozilla.ai took part in one of the community efforts to focus on the evaluation of large language models fine-tuned on specialized data. The “NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1 GPU + 1 Day”, led by PyTorch and Microsoft Research. We co-sponsored the challenge with AWS, Lambda Labs, and Microsoft. It brought together a number of teams competing in the craft of fine-tuning models. All the winners and results are here and the associated GitHub repositories are here.

Source: LLM evaluation with the NeurIPS Efficiency Challenge by @MozillaAI

Evaluating large language models (LLMs) is an extremely complex task due to the variety of models, evaluation metrics and frameworks. There is no consensus on the best approach.
The NeurIPS Efficiency Challenge aimed to shed light on how people practically evaluate LLMs and make the process open. It focused on fine-tuning models within 24 hours on a single GPU.
Many teams used parameter-efficient fine-tuning techniques like LoRA to reduce the number of parameters needing updates during fine-tuning. Data curation was also important.
Offline evaluation uses held-out test sets to assess metrics like accuracy, while online evaluation examines real-world model usage. Both are needed but don’t always align.
HELM is a popular evaluation framework that runs multiple tasks, but different implementations of metrics exist, adding complexity.
Reproducing ML artifacts is challenging due to the range of Docker experience and non-standardized practices. Storage is also critical for long training jobs.
Infrastructure like Kubernetes and Docker need improvements to better support stateful ML applications with checkpointing and storage.
Models are improving rapidly, requiring constant re-evaluation. Standardized tools and workflows are needed to consolidate evaluations.
The competition highlighted challenges in modeling, evaluation frameworks, and infrastructure for running large-scale open evaluations.
Establishing robust and transparent foundations is important for building trustworthy AI, which Mozilla.ai aims to support through their research.