Large Language Models (LLM) have immense capabilities that have advanced remarkably in the last few years. Two primary causes of this increase are the internet’s exponential data growth and ongoing advancements in pre-training methods. Prominent models such as GPT, Gemini, and Llama have raised the bar in a number of areas, including logical reasoning, coding, and creative writing.
The caliber and volume of the datasets on which these models are trained significantly impact their effectiveness. Because there is so much English content available online, English is becoming the main language used to train LLMs. This reliance on English datasets has been hampering obtaining comparable performance in other languages. The curse of multilingualism refers to the possibility that models that were mostly trained on English data may underperform in non-English languages as a result of insufficient exposure during pre-training.
To overcome this, in recent research, a team of researchers from Sea AI Lab, Singapore and SUTD, Singapore, presented the Sailor project, a set of free language models created especially for Southeast Asian (SEA) languages. These models have parameters ranging from 0.5B to 7B and are designed to accommodate the region’s linguistic variety. They are based on the flexible language model Qwen1.5, which is designed for multilingual applications.
Sailor models have been continuously pre-trained using a large corpus of 200B to 400B tokens, beginning with Qwen1.5. The languages that make up the majority of this corpus include English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao, all of which are important in the Southeast Asian region. The training procedure uses this large amount of data to apply a number of strategies meant to improve model performance.
BPE (Byte Pair Encoding) dropout is one such method that has been used to increase the models’ resilience. BPE dropout improves the model’s capacity to generalize across various language patterns and situations while assisting in the mitigation of overfitting problems.
The training pipeline also incorporates rigorous deduplication and data-cleaning processes. These actions are essential for guaranteeing the caliber of the training set, which enhances the Sailor models’ overall performance. The models gain precision and dependability in their forecasts by eliminating extraneous data and noise.
The team has shared that the combination of training data has been optimized by using tiny proxy models. This method allows for the adjustment of hyperparameters, such as the data mixture ratio, which enhances training process effectiveness and, in turn, improves model performance.
Experiments on a range of tasks, such as examination, question responding, reading comprehension, and common sense thinking, have shown how resilient and useful Sailor models are when compared to diverse standards. These findings highlight the potential of Sailor models to help the SEA region’s language problems across a broad spectrum.
In conclusion, the research presents a thorough methodology for creating LLMs that function effectively in the SEA region’s variety of languages, addressing issues like multilingualism and data quality while utilizing some great methods to improve model resilience and performance.
Check out the Paper, Project, and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 40k+ ML SubReddit
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.