As part of EMNLP24, the Hyatt Regency Hotel in Miami also played host to the Ninth Conference on Machine Translation (WMT24). The WMT24 workshop takes place annually at ACL-hosted conferences the world over, and we were fortunate enough to have WMT24 in Miami for 2024 on the final two days of EMNLP24, Nov. 15-16. As the name suggests, the workshop focuses on machine translation as the primary research topic. Each year, the workshop hosts shared tasks for various parts of the machine translation pipeline, and this year was no exception.
For the uninitiated, machine translation is using a computer program, LLMs have become most common in recent years, to translate content in one natural language into another. For machine learning (ML) approaches, such as LLMs, training data is required in the languages undergoing translation to train a model from scratch or to fine-tune existing ones. Like any dependent variable, metrics are necessary to measure the quality of translation. Some of the most common are BLEU, chrF++, and various COMET scores.
In the machine translation research community, there is a focus on the competitiveness of machine translation approaches, and this is best exemplified by the general shared task on machine translation, which is held annually as one of the primary fixtures of WMT24. Machine translation research teams submit models to perform a translation task set up by workshop organizers, and the performance of each approach is compared to baseline models (usually winners from previous competitions), existing commercial techniques, and other competitors using various machine translation quality metrics and/or human evaluation scores. This year, the submission Unbabel-Tower70B was ranked first in 8 language pairs (LPs) during human evaluation and ranked first in all 11 LPs according to an aggregate ranking of automatic evaluation metrics (Rei et al., 2024). A summary of this submission will come later in this section.
In addition to the shared task on machine translation, shared tasks on datasets and metrics were held this year. Submissions in the dataset task were utilized to extend the FLORES+ and MT Seed datasets while the metrics task has inherent utility in the fact that better measurement of translation quality will improve the efficiency of machine translation research in the future.
Each year, researchers submit many papers to the workshop outside which are not part of the shared task, which reveal new findings on the machine translation research topic which would be irrelevant to the specific shared tasks proposed by event organizers. Throughout the two days of the workshop, there were presentations and posters disseminating the findings of various research teams on new techniques, new datasets and new approaches to evaluating machine translation. Two papers that caught this researcher's eye introduced a new benchmark called Vistra for translated situated text on signage (Salesky, Koehn, and Post, 2024) and a literature review detailing how the COMET evaluation metric has been misused by the machine translation research community with recommendations for better use in the future and the introduction of a new software package called sacreCOMET (Zouhar et al., 2024). Summaries of both works are in the following sections.
WMT is an important annual event for machine translation research where some of the most important work in this research area is presented each year, and WMT24, this November in Miami, was no exception. Results from research developing every stage of the machine translation pipeline were proudly on display, and exciting new developments took center stage. Just a handful of them are summarized in the remaining sections of this post.
Pitfalls and Outlooks in Using COMET (Zouhar et al., 2024)
In this review and subsequent treatment of problems with the COMET family of translation quality metrics as it is currently used in the machine translation literature, Zouhar et al. Identify a set of 9 problems, separated into three broad categories; provide recommendations on COMET for the machine translation research community; and introduce the sacreCOMET package, which the authors believe will foster more effective use of COMET by machine translation researchers.
Formally, COMET is a framework for training evaluation models to produce machine translation quality metrics. These are then neural metrics for evaluating machine translation that, which are learned by evaluation models using human evaluations of existing machine-translated language data. The COMET framework used to learn these metrics was first proposed at EMNLP20 where metrics produced by the evaluation models trained using the COMET framework achieved state-of-the-art performance on the WMT19 metrics shared task (Rei et al., 2020). In the machine translation literature, it is common to refer to a metric trained using COMET as a “COMET score,” regardless of the specific model used and to refer to the “COMET metric” without specifying what evaluation model was used specifically. Both are sources of potential confusion identified and treated by Zouhar et al. (2024). COMET scores are typically normalized from 0 to 1, inclusive.
The problems identified in Zouhar et al. (2024) fall into the categories of Technicality, Teaining and Test Data, and Tool Usage and Score Interpretation.
Technicality issues with COMET focus on the use of obsolete Python and COMET packages and nonstandard computational precision. As might be expected, the recommendation from Zouhar et al. is to use the most up-to-date versions of Python and the previous COMET package, unbabel-comet. Somewhat surprisingly, to address inconsistencies between COMET metric calculations performed using full (32-bit floating point) and half (16-bit floating point) precision, the investigation found that full precision and half precision COMET score calculations are effectively identical to 3 significant digits on both CPU and GPU. The recommendation is therefore to use half precision on GPU where it provides about a 30% speedup with large batch sizes.
Issues with training and test data stem mostly from the fact it is a learned metric. It is therefore prone to many of the same pitfalls of any machine learning approach. First, there is no guardrail against the introduction of an empty hypothesis—whenever an empty string is evaluated by the model. More specifically, it is possible for COMET evaluation models to rate an empty string more highly than legitimate translations from machine learning models. Zouhar et al. (2024) simply recommends that the score for such cases be fixed to 0 and that lexical metrics like BLEU and chrF should be used to screen for other malformed evaluation strings. Zouhar et al. also identify potential error from COMET metrics whenever the hypothesis is in a different language than the reference string and demonstrate that COMET results for translations in wrong target languages are much better than random but fluent data. This issue arises often when multilingual LLMs are asked to perform a machine translation task because, even when the task’s target language is specified, it is possible for LLMs to hallucinate and translate into an unexpected language (Zhang, Haddow, and Birch, 2023). The recommendation for language mismatches is less straightforward. The language of the hypothesis needs to be detected by an external language detection model, and the COMET metric being used should be set to 0 for hypotheses in unexpected languages. Training data for a COMET evaluation model may be biased in some way, leading to a bias in the resultant metric. Because different evaluation data are necessary to build COMET metrics in each source-target language pair (LP), this implies that COMET scores cannot be compared across language pairs (Zouhar et al., 2024). Additionally, Zouhar et al. (2024) reports that domain bias, bias in the context and subject matter from which training data were sampled, can affect the efficacy of COMET metrics and observe that it is possible for adversarial translation models to game the metric by disguising translations for one domain less represented in the metric’s training data as translations in a more-represented domain.
Examining issues with COMET usage and score interpretation, Zouhar et al. (2024) identifies two major problems: Presently, the COMET evaluation model framework supports only a single reference translation when measuring a COMET score for some hypothesis; and the COMET metric is often reported without specifying what COMET model was used. Multiple reference support is desirable to handle cases in which a source sentence may have many acceptable translations in the target language (Zouhar et al., 2024), but Zouhar et al. say that recommending support for multiple references is “aspirational” and an open direction for future work. The recommendation for reporting COMET versions is to always specify what evaluation model was used and to cite the paper which introduced said model rather than the original COMET proposal, Rei et al. (2020).
Following their review of issues facing COMET, Zouhar et al. introduce the sacreCOMET software package, which is an answer to the software versioning, computational precision, and model reporting issues. The two primary innovations of sacreCOMET are the ‘cite’ command and the interactive/flag-based COMET signature feature. Cite will produce a bibtex citation for whatever model is given in the model flag. Running the sacrecomet executable without flags runs an interactive signature questionnaire that will produce a signature for the COMET version, model and precision used. With the model and prec flags specified, the signature is produced automatically.
Benchmarking Visually-Situated Translation of Text in Natural Images (Salesky, Koehn, and Post, 2024)
In this work, Salesky, Koehn and Post introduce the Vistra benchmark of images showing English (En) text in context with visually-situated translations into four target languages: Mandarin Chinese (Zh), German (De), Spanish (Es), and Russian (Ru). In addition to introducing Vistra, Salesky, Koehn and Post utilize their dataset to benchmark Optical Character Recognition (OCR) and machine translation models.
According to a presentation given at WMT24, Vistra contains 772 images of English text in natural contexts. It was released into the creative commons under the CC BY-SA license. Subjects imaged vary in amount and style of text as well as the layout of text on the subject, and the images themselves vary in sign-framing and image dimensions. Images in which signage text was difficult for humans to read were excluded from the final dataset. According to the presentation Salesky gave at WMT24, much of the dataset is of street signs in Baltimore and was compiled by the research team itself.
In a benchmark of OCR models, Salesky, Koehn and Post tasked Paddle-OCR (Du et al., 2020), Tesseract-OCT (Smith, 2007), Google Could Vision OCR (Popat et al., 2017; Ingle et al., 2019) and GPT-4o with producing bounding boxes around text (except for GPT-4o) in Vistra images and extracting the text. From analysis of their performance, Salesky, Koehn and Post (2024) produces a taxonomy of eight common OCR error types.
In a study of how OCR errors affect a downstream machine translation task, Salesky, Koehn and Post task mBART (Liu et al., 2020), Google Translate, and GPT-4o with translating the visually-situated Vistra text with transcriptions provided by the previously benchmarked OCR models. Meanwhile GPT-4o, being a multimodal LLM, was also prompted to translate the text in Vistra images without intermediate steps as a point of comparsion to the cascaded machine translation task.
Salesky, Koehn and Post (2024) notes strong performance from the multimodal LLM approach in the COMET metric (wmt22-comet-da) but weaker lexical translation quality as measured with chrF and BLEU. In the analysis of these data, Salesky, Koehn and Post note the following: Whereas GPT-4o translated all 14 instances of the word “Exit” in Vistra into German as “Ausgang” rather than “Ausfhart” when presented only with OCR output in cascading machine translation, visual context determines which translation is correct. When asked to translate Vistra directly, GPT-4o translated into “Ausgang” only 5 times. Because of how COMET metrics are learned, they have been built to not punish good-enough translations rather than requiring exactly correct translations. Salesky, Koehn and Post caution that the COMET metric may be less aligned with the contextually-dependent visually-situated translation task than lexical metrics, which are less forgiving.
In light of the results of their experiments with cascading machine translation and multimodal LLMs, Salesky, Koehn and Post suggest that a new metric for the investigated translation task is in order.
Tower v2: Unbabel-IST 2024 Submission for the General MT Shared Task (Rei et al., 2024)
This work introduces TOWER-V2, which unpins the submission by Rei et al. to the WMT24 shared task on general machine translation. As its acronym suggests TOWER-V2 iterates on the recent TOWER model published in February by Alves et al. (2024). The primary advancements are improvement of the models’ training data; extension of model support to 5 new languages, including the under-resourced Icelandic (Is) and Ukrainian (Uk); and scaling the translation model presented in tower from 7B and 13B parameters to as many as 70B.
TOWER and TOWER-v2 are LLMs that have been finetuned for machine translation. In this work, Rei et al. report that TOWER-v2 7B is based on MISTRAL-7B (Jiang et al., 2023) and that TOWER-v2 70B is based on Llama-3-70B (AI@Meta, 2024) while the original TOWER was based on Llama-2 (Alves et al., 2024). In both cases, TOWER works by extending the backbone model’s training on multilingual corpora followed by further training on corpora specialized for machine translation tasks and finally by finetuning the translation model to follow prompts for machine translation tasks (Rei et al., 2024; Alves et al., 2024). TOWER-v2 improves this pipeline by expanding the data corpora, introducing new languages to the training data, and using different/larger LLMs as a backbone (Rei et al., 2024).
The WMT24 general shared task on machine translation focused on translation from English to other languages (En->Xx) and on non-English to non-English translation (Xx->Yy). In keeping with that, Rei et al. (2024) presents an experiment comparing the new TOWER-v2 models in various configurations to other LLM baselines both with and without quality-aware decoding (QAD) strategies. TOWER-v2-70B utilizing QAD outperformed normal TOWER-v2-7B and 70B models and the baseline LLMs, including Claude-Sonnet-3.5 and GPT-4o, in both En-Xx and Xx-Yy translation tasks when tested on the WMT24 shared task test data. In aggregate ranking of MetricX (Juraska et al., 2023), xCOMET (Guerreiro et al., 2023), and COMETKiwi (Rei et al., 2023) scores for each of the languages supported for the two tasks, TOWER-v2-70B outranked or was comparable to the baseline LLMs. TOWER-v2-70B with QAD strategies outranked both TOWER-v2-70B without QAD and the baseline LLMs in all three metrics in both tasks.
In the WMT24 general machine translation shared task, as reported in the shared task results by Kocmi et al. (2024), TOWER-v2-70B was the best performing participant, winning the machine translation competition in 8 of 11 language pairs (LPs) as judged by a human evaluation of the machine translation results. It also placed first for every LP according to rankings made with the automated metrics COMETKiwi and MetricX, but Kocmi et al. caution that this is evidence of overfitting in TOWER-v2-70B since human evaluations did not agree with its dominance in automatic evaluation rankings. Despite declaring it the best performance by a participant thanks to winning human evaluation for 8 of 11 LPs, Kocmi et al. also report that the commercial LLM Claude-3.5 wins 9 LPs.
References
AI@Meta. 2024. Llama3 Model Card.
Alves et al. 2024. Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning.
Du et al. 2020. PP-OCR: A Practical Ultra Lightweight OCR System.
Guerreiro et al. 2023. xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection
Ingle et al. 2019. A Scalable Handwritten Text Recognition System.
Jiang et al. 2023. Mistral 7B.
Juraska et al. 2023. MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task.
Kocmi et al. 2024. Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet.
Liu et al. 2020. Multilingual Denoising Pre-training for Neural Machine Translation.
Popat et al. 2017. Sequence-to-Label Script Identification for Multilingual OCR.
Rei et al. 2020. COMET: A Neural Framework for MT Evaluation.
Rei et al. 2023. Scaling up CometKiwi: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task.
Rei et al. 2024. Tower v2: Unbabel-IST 2024 Submission for the General MT Shared Task.
Salesky, Koehn, and Post. 2024. Benchmarking Visually-Situated Translation of Text in Natural Images.
Smith. 2007. An Overview of the Tesseract OCR Engine.
Zhang, Haddow, and Birch. 2023. Prompting Large Language Model for Machine Translation: A Case Study
Zouhar et al. 2024. Pitfalls and Outlooks in Using COMET.