Mozilla has gone full steam ahead with “AI”. They are woke idiots but that’s an understatement. Still I am linking it here because it makes sense. Just because you need to tick the check-boxes doesn’t bring any relevance to the “community”; they have too much money in the bank from sweetheart deals and just abandoned Firefox. They are pursuing the next shiny object in the market.
n November 2023, our team at Mozilla.ai took part in one of the community efforts to focus on the evaluation of large language models fine-tuned on specialized data. The “NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1 GPU + 1 Day”, led by PyTorch and Microsoft Research. We co-sponsored the challenge with AWS, Lambda Labs, and Microsoft. It brought together a number of teams competing in the craft of fine-tuning models. All the winners and results are here and the associated GitHub repositories are here.
Source: LLM evaluation with the NeurIPS Efficiency Challenge by @MozillaAI
- Evaluating large language models (LLMs) is an extremely complex task due to the variety of models, evaluation metrics and frameworks. There is no consensus on the best approach.
- The NeurIPS Efficiency Challenge aimed to shed light on how people practically evaluate LLMs and make the process open. It focused on fine-tuning models within 24 hours on a single GPU.
- Many teams used parameter-efficient fine-tuning techniques like LoRA to reduce the number of parameters needing updates during fine-tuning. Data curation was also important.
- Offline evaluation uses held-out test sets to assess metrics like accuracy, while online evaluation examines real-world model usage. Both are needed but don’t always align.
- HELM is a popular evaluation framework that runs multiple tasks, but different implementations of metrics exist, adding complexity.
- Reproducing ML artifacts is challenging due to the range of Docker experience and non-standardized practices. Storage is also critical for long training jobs.
- Infrastructure like Kubernetes and Docker need improvements to better support stateful ML applications with checkpointing and storage.
- Models are improving rapidly, requiring constant re-evaluation. Standardized tools and workflows are needed to consolidate evaluations.
- The competition highlighted challenges in modeling, evaluation frameworks, and infrastructure for running large-scale open evaluations.
- Establishing robust and transparent foundations is important for building trustworthy AI, which Mozilla.ai aims to support through their research.