SwissBERT: Multilingual Language Model for Switzerland with 12 Billion Tokens

AI News

SwissBERT: Multilingual Language Model for Switzerland with 12 Billion Tokens

Jimmy W.

July 18, 2023

SwissBERT: Multilingual Language Model for Switzerland with 12 Billion Tokens

SwissBERT: A Multilingual Language Model for Switzerland

The BERT model has become one of the top choices for Natural Language Processing tasks. BERT, which stands for Bidirectional Encoder Representations from Transformers, uses a Transformer attention mechanism to understand contextual relations between words or sub-words in text. It is a powerful language model that has been trained using self-supervised learning techniques.

Before BERT, language models analyzed text sequences in a one-directional manner during training. This approach worked well for generating sentences by predicting the next word. However, BERT introduced bidirectional training, which improved language context and flow compared to previous models.

Initially released for the English language, BERT-inspired models like CamemBERT and GilBERTo were developed for other languages. Recently, a team of researchers from the University of Zurich created SwissBERT, a multilingual language model specifically designed for Switzerland. SwissBERT has been trained on over 21 million Swiss news articles in Swiss Standard German, French, Italian, and Romansh Grischun, totaling 12 billion tokens.

SwissBERT addresses the challenges faced by Swiss researchers who need to perform multilingual tasks. Switzerland has four official languages – German, French, Italian, and Romansh. Combining individual language models for these languages is difficult for multilingual tasks. Additionally, there is no separate neural language model for Romansh. SwissBERT overcomes these challenges by combining articles from different languages and creating multilingual representations by leveraging common entities and events in the news.

The SwissBERT model is an adaptation of a cross-lingual Modular (X-MOD) transformer that was pre-trained in 81 languages. The researchers trained custom language adapters to adapt the X-MOD transformer to their corpus. They also created a Switzerland-specific subword vocabulary for SwissBERT, resulting in a model with an impressive 153 million parameters.

The researchers evaluated SwissBERT’s performance on tasks such as named entity recognition on contemporary news (SwissNER) and detecting stances in user-generated comments on Swiss politics. SwissBERT outperformed common baselines and showed improvements over XLM-R in detecting stance. When evaluating the model’s capabilities on Romansh, SwissBERT performed significantly better than models not trained in the language in terms of zero-shot cross-lingual transfer and German-Romansh alignment.

SwissBERT has been released with examples for fine-tuning downstream tasks. It holds great promise for future research and non-commercial purposes. With further adaptation, the model’s multilingualism can benefit various applications.

If you’re interested in learning more about SwissBERT, you can check out the research paper, blog, and model links provided. The credit for this research goes to the dedicated team of researchers from the University of Zurich. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter for more exciting AI research news and projects.

Source link

LEAVE A REPLY Cancel reply