Lightweight Language Identification Model (200 Languages)

Published: February 01, 2025

Project Description: Efficient Multi-Lingual Language Identification

Accurate language identification is a fundamental prerequisite for many Natural Language Processing (NLP) tasks, especially in multilingual environments. However, many existing models are computationally heavy or support a limited number of languages. This project introduces a highly efficient, BERT-based transformer model specifically designed for identifying a wide range of languages (200) while maintaining a small footprint suitable for CPU-bound applications.

Objective

The primary goal was to develop and train a compact yet accurate language identification model capable of distinguishing between 200 different languages. Key objectives included optimizing the model for real-time performance on standard CPU hardware and ensuring ease of deployment through quantization and standard export formats like ONNX.

Key Features

Compact Architecture: Utilizes a BERT-based model with only 24.5 million parameters, significantly smaller than many standard BERT variants.
Broad Language Coverage: Trained to identify 200 distinct languages, covering a vast linguistic landscape.
Large-Scale Training Data: Trained on a substantial dataset of 121 million sentences, ensuring robustness and exposure to diverse linguistic patterns.
CPU Optimization: Designed to run efficiently on CPUs, making it suitable for environments without dedicated GPUs.
Real-Time Performance: Optimized for low-latency inference required in real-time applications.
Deployment Ready: Supports model quantization to further reduce size and computational cost, and can be easily exported to the ONNX format for cross-platform compatibility and optimized inference.

Model Architecture

Base Architecture: BertForSequenceClassification
Hidden Size: 384
Number of Layers: 4
Attention Heads: 6
Max Sequence Length: 512
Dropout: 0.1
Vocabulary Size: 50,257
Total Parameters: 24.5 million

Methodology / Implementation

Dataset: Trained on the hac541309/open-lid-dataset containing 121 million sentences across 200 languages, split into 90% training and 10% testing sets.
Tokenizer: Custom BertTokenizerFast with special tokens for [UNK], [CLS], [SEP], [PAD], [MASK]
Training:
- Hyperparameters: Learning rate of 2e-5, batch sizes of 256 (training) and 512 (testing), trained for 1 epoch with cosine scheduler
- Framework: Utilized Hugging Face Trainer API with Weights & Biases for experiment tracking and logging
Data Augmentation: Implemented custom text augmentation strategies to improve model robustness:
- Removing digits with random probability
- Shuffling words to introduce variation
- Selectively removing words
- Adding random digits to simulate noise
- Modifying punctuation to handle different text formats
Optimization: Post-training quantization techniques (e.g., dynamic or static quantization) were applied to reduce model size and accelerate inference speed.
Export: The optimized model was converted to the ONNX format, allowing it to be run using various ONNX-compatible runtimes across different platforms and programming languages.
Interface: A Gradio interface was developed for easy demonstration, testing, and interaction with the model.

Evaluation Results

Overall Performance Metrics:
- Accuracy: 0.9733
- Precision: 0.9735
- Recall: 0.9733
- F1 Score: 0.9733
Script-Level Performance (selected examples from evaluation on ~12 million texts):
- Latin (125 languages): 97.32% precision, 97.22% recall
- Cyrillic (12 languages): 99.13% precision, 98.97% recall
- Arabic (21 languages): 90.82% precision, 91.34% recall
- Devanagari (10 languages): 95.04% precision, 94.28% recall
- Bengali (3 languages): 99.50% precision, 99.29% recall

Tools and Technologies

Core Concepts: Natural Language Processing (NLP), Language Identification, Text Classification, Model Optimization, Transformer Architecture
Models: BERT-based Transformer
Frameworks/Libraries: PyTorch, Hugging Face Transformers, ONNX / ONNX Runtime, Weights & Biases
Optimization: Quantization libraries (e.g., PyTorch’s quantization tools)
Interface: Gradio
Programming Language: Python

Usage Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("alexneakameni/language_detection")
model = AutoModelForSequenceClassification.from_pretrained("alexneakameni/language_detection")

language_detection = pipeline("text-classification", model=model, tokenizer=tokenizer)

text = "Hello world!"
predictions = language_detection(text)
print(predictions)

Applications

Content Filtering/Moderation: Identifying the language of user-generated content for applying appropriate policies.
Customer Support Routing: Automatically directing incoming user queries to support agents proficient in the detected language.
Multilingual Data Preprocessing: Tagging text data with its language before further NLP analysis (e.g., translation, sentiment analysis).
Search Engine Enhancement: Improving search results by considering the language of the query and documents.
Web Crawling: Identifying the language of web pages during large-scale crawls.

This project delivers a practical and efficient solution for large-scale language identification, balancing broad language coverage with the computational constraints often present in real-world deployment scenarios. The model is publicly available on Hugging Face Hub under the identifier “alexneakameni/language_detection”.

Share on

Twitter Facebook LinkedIn

Alex KAMENI