Semasiographic NLP with Mixtec Codices

Throughout history, writing systems have enabled critical communications and community record-keeping. The majority of modern communication is made through text-based writing structures. However, a few societies over time recorded governmental documents semasiographically; that is, using pictures or iconography to express ideas and concepts. Popular uses of semasiographic iconography include road signs and groups of emojis. Uniquely, the Mixtec society used highly structured notation and recorded historical events within documents called codices. Instead of a single icon, the Mixtec used several characters to express a detailed and expressive narrative. We are interested in studying how the traditional natural language processing tasks are expressible in Mixtec codices. We see a significant gap between current tools and semasiograpic NLP. We have identified four tasks to help researchers study Mixtec codices' initial operations: (1) part of speech tagging, (2) named entity recognition, (3) regular expression execution, and (4) entity resolution.

Example step by step labeling of a mixtec codice page to create a narrative.

In previous work, we trained a state-of-the-art model to identify objects in codices. Our early results show that we can perform gender identification with an f1-score close to 90% across the full Codex Zouche-Nuttall, Codex Selden, and Codex Vindobonensis. These results and previous scholarship on Mixtec suggest it is possible to train models to tag expressions of dates, people, and locations in the codices. After identifying objects, we study approaches to understand the scene register ordering and boustrophedon register transitions. We can then derive the implicit grammars that exist across the codices and create regular expressions to search and concisely express patterns. This effort will enable researchers to investigate semasiographic languages more efficiently.

As part of this grant, we also explore how codices map to narratives. We will work with native speakers and record their narration of the codices. We will map the oral narration to the codices objects. We can learn from the mapping and use generative artificial intelligence to generate new codices based on new narratives. We will partner with museums to communicate and disseminate these products.

This work opens up investigative opportunities for understanding image-first writings. Mixtecs used similar writing styles as their contemporaries: Aztec, Otomi, Totonac, and others. We are excited that this effort can be extended to explore patterns across graphic novels and children's literature. Stay tuned for more updates on this project.

Semasiographic NLP with Mixtec Codices

Proudly Funded By