Language documentation researchers face a fundamental challenge: when working with endangered or low-resource languages, every annotation decision matters. Data is scarce, speaker communities are small, and manual annotation is expensive. The question isn't just "how much data do we need?" but "what data should we include to make the most of limited resources?"
Our new paper at EMNLP 2025 tackles this question head-on. Led by Dr. Zoey Liu and our team including Dr. Masoud Jasbi, Dr. Kenji Sagae, Dr. Emily Prud'hommeaux, and myself, we investigated three approaches to building part-of-speech tagging datasets across 60 languages from 12 language families.
Resource availability exists on a continuum. English enjoys massive datasets, crowd-sourced annotations, and commercial incentive for NLP development. At the other end, Indigenous and endangered languages face severe constraints: extremely small speaker populations, limited data sources (often restricted to specific domains like religious texts or grammar books), and challenges in obtaining manual annotations.
These limitations complicate dataset creation. Where English researchers can afford to be less thoughtful about data selection, endangered language researchers must allocate resources with surgical precision. The stakes are high: inefficient data collection can waste precious community time and financial resources.
Many might consider POS tagging "a solved problem" in mainstream NLP. POS tagging remains crucial for language documentation and descriptive linguistics. Theoretical linguists rely on lexical categories to characterize typological profiles. Cognitive scientists use POS tag distributions to study language learning and code-switching. Documentary linguists and Indigenous community members need POS tags for creating pedagogical materials.
The Universal Dependencies (UD) project makes POS tagging an ideal test case. UD provides consistent annotations across many languages, enabling controlled cross-linguistic comparison while addressing real-world needs in language documentation.
We compared three strategies for selecting training data:
For communities where sharing language data via API is ethically acceptable, large language models offer a promising path. We tested GPT-4.1-mini using only 1,000 randomly sampled tokens as prompt examples—roughly what an annotator could label in a two-hour session.
The results showed that the off the shelf GPT-4.1-mini achieved F1 scores above 0.83 in 58 of 60 languages, with most exceeding 0.90. French reached 0.97, English 0.93, Hindi 0.90. The cost per language? About $4 USD in API calls.
This suggests that for languages where data sharing is acceptable, LLMs can deliver strong first-pass tagging with minimal cost—far less than annotating the thousands of tokens that would be needed to train a traditional model to comparable performance.
Many Indigenous communities cannot ethically share data externally, making LLM APIs incompatible with their values. To evaluate these scenarios, we explored Active Learning (AL), where a model iteratively selects the most informative examples for annotation.
We used uncertainty sampling with Conditional Random Fields (CRF), selecting sentences where the model was least confident. After each round of annotation simulation, we retrained and selected new examples.
The results showed clear efficiency gains. Active Learning reached reasonable performance much faster than random sampling. For Irish (classified as "Threatened" by Ethnologue), F1 scores progressed from 0.71 with 1,000 tokens to 0.85 with 4,500 tokens to 0.90 with 12,000 tokens. The learning curve showed rapid initial growth, then plateaued after around 4,500-5,500 tokens—a "sweet spot" where additional annotations yield diminishing returns.
Unlike previous Active Learning research that relies on "eyeballing" learning curves, we applied Bayesian growth curve modeling (using a four-parameter Weibull function) to quantify learning rates statistically. This allowed us to rigorously compare how fast different methods improve.
The growth rate analysis confirmed that Active Learning learns approximately twice as fast as random sampling across the languages tested. While both methods eventually reach similar maximum F1 scores (upper asymptotes), Active Learning gets there with significantly fewer annotations which is crucial for resource-constrained scenarios.
We analyzed factors affecting individual POS tag performance and a few patterns emerged:
Lexical diversity helps: Tags with more varied word distributions (higher word entropy) yielded better F1 scores. Seeing many different nouns helps the model generalize to new nouns.
Syntactic complexity hurts: Tags appearing in more diverse syntactic contexts (higher syntax entropy) performed worse. Too much structural variation makes patterns harder to learn.
Frequency isn't everything: Tag probability in the training set showed no significant effect on performance. Diversity of usage patterns mattered more than raw frequency.
We also measured KL divergence between training and test sets. As training size increased, the distributions converged, and closer alignment predicted better F1 scores—confirming that representative sampling matters.
Based on our findings, here's guidance for building POS tagging datasets. For different scenarios we provide recommended approaches and the expected effort.
| Scenario | Recommended Approach | Expected Effort |
|---|---|---|
| Data can be shared via API | GPT-4.1-mini with 1,000 random tokens | ~$4 + 2 hours annotation |
| Data must remain local | Active Learning with 4,500-5,500 tokens | Moderate annotation effort |
| Extremely restricted access | Random sampling | Higher annotation effort |
The decision hinges on ethics and community values. If data sharing is acceptable, LLMs offer remarkable efficiency. If not, Active Learning provides a principled way to minimize annotation burden while respecting data sovereignty.
Our framework extends beyond POS tagging. The methodology—comparing sampling strategies using growth curve modeling—applies to any NLP task where annotation is expensive: morphological segmentation, automatic speech recognition, named entity recognition, or machine translation.
The statistical approach we introduce moves Active Learning evaluation from subjective visual comparison to rigorous quantitative analysis. Researchers can now determine not just whether Active Learning helps, but precisely how much faster it learns and when performance plateaus.
This research serves a larger mission: enabling Indigenous and endangered language communities to build language technology on their own terms. Large language models can bootstrap initial models. Active Learning can sustain development with minimal community burden. Both approaches can empower communities to balance accuracy, cost, and data sovereignty.
Language technology should serve communities, not extract from them. By providing clear guidance on efficient dataset construction, we hope to lower barriers for community-driven language documentation and revitalization efforts.
The full paper is available on the ACL Anthology. Our code and experimental materials are available on GitHub. Please reach out to Dr. Zoey Liu for questions about the work.
This work represents collaboration across three institutions—University of Florida, UC Davis, and Boston College—and reflects our shared commitment to making NLP accessible for all languages, not just those with abundant resources.
We presented this work at EMNLP 2025 in Suzhou, China, where the conversations and feedback from the community were invaluable. If you're working on language documentation or low-resource NLP, we'd love to hear from you. What challenges are you facing in dataset creation? What approaches have worked for your communities?
For more information about our research on computational linguistics and language documentation, visit ufdatastudio.com.