Lightweight Language Identification Model (200 Languages)
Published:
Project Description: Efficient Multi-Lingual Language Identification
Accurate language identification is a fundamental prerequisite for many Natural Language Processing (NLP) tasks, especially in multilingual environments. However, many existing models are computationally heavy or support a limited number of languages. This project introduces a highly efficient, BERT-based transformer model specifically designed for identifying a wide range of languages (200) while maintaining a small footprint suitable for CPU-bound applications.
Objective
The primary goal was to develop and train a compact yet accurate language identification model capable of distinguishing between 200 different languages. Key objectives included optimizing the model for real-time performance on standard CPU hardware and ensuring ease of deployment through quantization and standard export formats like ONNX.
Key Features
- Compact Architecture: Utilizes a BERT-based model with only 24.5 million parameters, significantly smaller than many standard BERT variants.
- Broad Language Coverage: Trained to identify 200 distinct languages, covering a vast linguistic landscape.
- Large-Scale Training Data: Trained on a substantial dataset of 121 million sentences, ensuring robustness and exposure to diverse linguistic patterns.
- CPU Optimization: Designed to run efficiently on CPUs, making it suitable for environments without dedicated GPUs.
- Real-Time Performance: Optimized for low-latency inference required in real-time applications.
- Deployment Ready: Supports model quantization to further reduce size and computational cost, and can be easily exported to the ONNX format for cross-platform compatibility and optimized inference.
Model Architecture
- Base Architecture: BertForSequenceClassification
- Hidden Size: 384
- Number of Layers: 4
- Attention Heads: 6
- Max Sequence Length: 512
- Dropout: 0.1
- Vocabulary Size: 50,257
- Total Parameters: 24.5 million
Methodology / Implementation
- Dataset: Trained on the hac541309/open-lid-dataset containing 121 million sentences across 200 languages, split into 90% training and 10% testing sets.
- Tokenizer: Custom BertTokenizerFast with special tokens for [UNK], [CLS], [SEP], [PAD], [MASK]
- Training:
- Hyperparameters: Learning rate of 2e-5, batch sizes of 256 (training) and 512 (testing), trained for 1 epoch with cosine scheduler
- Framework: Utilized Hugging Face Trainer API with Weights & Biases for experiment tracking and logging
- Data Augmentation: Implemented custom text augmentation strategies to improve model robustness:
- Removing digits with random probability
- Shuffling words to introduce variation
- Selectively removing words
- Adding random digits to simulate noise
- Modifying punctuation to handle different text formats
- Optimization: Post-training quantization techniques (e.g., dynamic or static quantization) were applied to reduce model size and accelerate inference speed.
- Export: The optimized model was converted to the ONNX format, allowing it to be run using various ONNX-compatible runtimes across different platforms and programming languages.
- Interface: A Gradio interface was developed for easy demonstration, testing, and interaction with the model.
Evaluation Results
- Overall Performance Metrics:
- Accuracy: 0.9733
- Precision: 0.9735
- Recall: 0.9733
- F1 Score: 0.9733
- Script-Level Performance (selected examples from evaluation on ~12 million texts):
- Latin (125 languages): 97.32% precision, 97.22% recall
- Cyrillic (12 languages): 99.13% precision, 98.97% recall
- Arabic (21 languages): 90.82% precision, 91.34% recall
- Devanagari (10 languages): 95.04% precision, 94.28% recall
- Bengali (3 languages): 99.50% precision, 99.29% recall
Tools and Technologies
- Core Concepts: Natural Language Processing (NLP), Language Identification, Text Classification, Model Optimization, Transformer Architecture
- Models: BERT-based Transformer
- Frameworks/Libraries: PyTorch, Hugging Face Transformers, ONNX / ONNX Runtime, Weights & Biases
- Optimization: Quantization libraries (e.g., PyTorch’s quantization tools)
- Interface: Gradio
- Programming Language: Python
Usage Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("alexneakameni/language_detection")
model = AutoModelForSequenceClassification.from_pretrained("alexneakameni/language_detection")
language_detection = pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "Hello world!"
predictions = language_detection(text)
print(predictions)
Applications
- Content Filtering/Moderation: Identifying the language of user-generated content for applying appropriate policies.
- Customer Support Routing: Automatically directing incoming user queries to support agents proficient in the detected language.
- Multilingual Data Preprocessing: Tagging text data with its language before further NLP analysis (e.g., translation, sentiment analysis).
- Search Engine Enhancement: Improving search results by considering the language of the query and documents.
- Web Crawling: Identifying the language of web pages during large-scale crawls.
This project delivers a practical and efficient solution for large-scale language identification, balancing broad language coverage with the computational constraints often present in real-world deployment scenarios. The model is publicly available on Hugging Face Hub under the identifier “alexneakameni/language_detection”.