Recent advancements in multilingual natural language processing are focusing on enhancing efficiency and adaptability across diverse languages and domains. New models like Onomas-CNN X demonstrate that convolutional networks can outperform traditional transformers in specific tasks, achieving high accuracy while drastically reducing processing time and energy consumption. Meanwhile, the MrBERT architecture showcases how targeted adaptations can optimize performance in specialized fields such as biomedical and legal sectors, addressing the need for localized linguistic capabilities. Additionally, the introduction of BIRDTurk highlights the challenges of applying text-to-SQL systems in morphologically rich languages, revealing performance gaps that stem from structural differences and limited training data. Research on cross-lingual classification of social media data emphasizes the importance of filtering techniques to manage noise in multilingual datasets, while studies on euphemism transfer underscore the complexities of cultural context in language processing. Collectively, these efforts aim to bridge the gap between research and practical applications, addressing commercial needs in global communication and data analysis.
We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves sta...
We present a convolutional neural network approach for classifying proper names by language and entity type. Our model, Onomas-CNN X, combines parallel convolution branches with depthwise-separable op...
Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource langua...
The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we in...
Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the fi...
Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investiga...
Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a systematic study of compact language models by ex...
In multilingual pretraining, the test loss of a pretrained model is heavily influenced by the proportion of each language in the pretraining data, namely the \textit{language mixture ratios}. Multilin...
Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, w...