Throughout history, writing systems have enabled critical communications and community record-keeping. The majority of modern communication is made through text-based writing structures. However, a few societies over time recorded governmental documents semasiographically; that is, using pictures or iconography to express ideas and concepts. Popular uses of semasiographic iconography include road signs and groups of emojis. Uniquely, the Mixtec society used highly structured notation and recorded historical events within documents called codices. Instead of a single icon, the Mixtec used several characters to express a detailed and expressive narrative. We are interested in studying how the traditional natural language processing tasks are expressible in Mixtec codices. We see a significant gap between current tools and semasiograpic NLP. We have identified four tasks to help researchers study Mixtec codices' initial operations: (1) part of speech tagging, (2) named entity recognition, (3) regular expression execution, and (4) entity resolution.

Example step by step labeling of a mixtec codice page to create a narrative.

In previous work, we trained a state-of-the-art model to identify objects in codices. Our early results show that we can perform gender identification with an f1-score close to 90% across the full Codex Zouche-Nuttall, Codex Selden, and Codex Vindobonensis. These results and previous scholarship on Mixtec suggest it is possible to train models to tag expressions of dates, people, and locations in the codices. After identifying objects, we study approaches to understand the scene register ordering and boustrophedon register transitions. We can then derive the implicit grammars that exist across the codices and create regular expressions to search and concisely express patterns. This effort will enable researchers to investigate semasiographic languages more efficiently.

As part of this grant, we also explore how codices map to narratives. We will work with native speakers and record their narration of the codices. We will map the oral narration to the codices objects. We can learn from the mapping and use generative artificial intelligence to generate new codices based on new narratives. We will partner with museums to communicate and disseminate these products.

This work opens up investigative opportunities for understanding image-first writings. Mixtecs used similar writing styles as their contemporaries: Aztec, Otomi, Totonac, and others. We are excited that this effort can be extended to explore patterns across graphic novels and children's literature. Stay tuned for more updates on this project.

Links and Resources

Publications

Aashish Dhawan, Christopher Driggers-Ellis, Dzmitry Kasinets, Daisy Zhe Wang, Christan Grant. Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task. The Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP). San Diego, California, USA. 2026. ★ Overall Winner of the shared task.
Aashish Dhawan, Christopher Driggers-Ellis, Christan Grant, Daisy Zhe Wang. Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing. The Ninth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT) at EACL. Rabat, Morocco. 2026.
Girish Salunke, Christopher Driggers-Ellis, Christan Grant. Classifying Name-Date and Year Figures in Mixtec Codices. Computational Humanities Research Conference (CHR). Luxembourg, Luxembourg. 2025.
Alexander R. Webber, Zachary Sayers, Amy Wu, Elizabeth Thorner, Justin Witter, Gabriel Ayoubi, Christan Grant. Analyzing Finetuned Vision Models for Mixtec Codex Interpretation. The 4th Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP). Mexico City, Mexico. 2024.

Formal Machine Interpretation of Mixtec Codices

Semasiographic NLP with Mixtec Codices

Links and Resources

Publications

Proudly Funded By

Semasiographic NLP with Mixtec Codices

Links and Resources

Publications

Related Projects

Proudly Funded By