But Meta’s model is only available upon request and has a license that restricts its use to research purposes. Hugging Face goes one step further. Meetings detailing the past year’s work are recorded and posted online, and anyone can download the model for free and use it for research or to build commercial applications.
A big focus for BigScience was to build ethical considerations into the model from its inception, rather than treating them as an afterthought. LLMs are trained on tons of data collected by scraping the internet. This can be problematic, as these datasets include a lot of personal information and often reflect dangerous biases. The group developed data management structures specifically for the LLM that should make it clearer what data is being used and to whom it belongs, and collected various data sets from around the world that were not readily available online.
The group is also launching a new license for responsible artificial intelligence, which is something like a terms of service agreement. It is designed to act as a deterrent against using BLOOM in high-risk sectors such as law enforcement or healthcare, or to harm, deceive, exploit or impersonate people. The license is an experiment in LLM self-regulation before the laws catch up, says Danish Contractor, an artificial intelligence researcher who volunteered on the project and co-created the license. But in the end, there’s nothing stopping anyone from abusing BLOOM.
From the beginning, the project had its own ethical guidelines that served as guiding principles for the development of the model, says Giada Pistilli, Hugging Face’s ethicist, who drafted BLOOM’s ethical charter. For example, work was done to recruit volunteers from different backgrounds and locations, ensuring that outsiders could easily reproduce the project’s findings and publish its results in the open.
This philosophy translates into one big difference between BLOOM and other LLMs available today: the sheer number of human languages the model can understand. It can handle 46 of them, including French, Vietnamese, Mandarin, Indonesian, Catalan, 13 Indian languages (such as Hindi), and 20 African languages. Slightly more than 30% of the training data was in English. The model also understands 13 programming languages.
This is very unusual in the world of large language models, where English dominates. This is another consequence of the fact that LLMs are built by collecting data from the Internet: English is the language most used online.
The reason why BLOOM was able to improve this situation is that the team gathered volunteers from all over the world to build appropriate datasets in other languages, even if those languages were not well represented on the Internet. For example, Hugging Face has organized workshops with African AI researchers to try to find datasets such as local government or university records that could be used to train models on African languages, says Chris Emezue, a Hugging Face intern and researcher at Masakhane , an organization working on natural language processing for African languages.