How can you consistently and effeciently evaluate machine translation engines? Subjective observations of the machine translation (MT) “quality” are simple and easy, but reviewing 35-40 words tells you nothing about your long-term experience with an engine across 10,000 words.
The end of this article has two appendices. One is a glossary of our score terms. The other shows twelve (12) example source-target segments plus the output translations from Isabella’s engine and Google.
A truly objective, accurate and automated evaluation of MT linguistic quality is beyond today’s state of the art. In fact, this deficit is what leads to the poor quality of MT output in the first place. This doesn’t mean MT is useless because translators are using MT every day.
What good are MT evaluations if they can’t accurately report a translation’s quality?
Slate Desktop™’s evaluation scores do not tell you about the quality of an engine’s translations. Instead, Slate Desktop™ focuses on describing engine criteria that can be measured objectively. Here, I generically refer to these criteria as an engine’s “linguistic performance.” The scores indicate how an engine might reduce or increase a translator’s workload compared to another engine. With objective evaluation scores, you can better predict how an engine might affect your work efficiency in the long term.
So, let’s look at the best practices of MT evaluation. Then, I’ll review Isabella’s engine scores with a focus on how they relate to her client’s work. Finally, I’ll compare Google’s output from the same evaluation segments with Isabella’s engine results.
Evaluation Best Practices
Current MT evaluation best practices require an evaluation set with 2,000-3,000 source-target segment pairs. The source segments represent the variety of work that the translator is likely to encounter. The target segments represent the desired reference translations.
The evaluation process uses the MT engine you’re evaluating to create “test” segments from the evaluation set’s source segments. It then measures each “test” segment against its respective “reference” and assigns a “closeness” score. These are like fuzzy match scores, but for target-to-test segments not source-to-TM segments. The process accumulates individual scores, like an average, to describe how the engine performed with that evaluation set. A performance descriptions for one engine has some value, but it’s much more valuable to compare descriptions of one evaluation set from different engines to tell us which engine performs better.
Measuring Isabella’s Engine
Isabella reported she started with three .tmx files and 250,768 segment pairs from the same client since 2003. Her Engine Summary (image below) shows Slate built Isabella’s engine from 119,053 segments after it removed 131,715 segment pairs (53%) for technical reasons. You can learn more about translation memory preparation on our support site.
Slate randomly removed and set aside 2,353 segment pairs that represent Isabella’s 14 years of work as the evaluation set leaving only 116,700 pairs to create the engine’s statistical models. During the evaluation process, the source segments are like a new project from the engine’s viewpoint. That is, the engine is not recalling segments that were used to build it. This evaluation strategy gives a 95% confidence that the engine will perform similarly when Isabella gets a new project from this client.
Isabella’s Engine vs Google
Before I could compare the performance of Isabella’s engine to Google, Isabella graciously granted me permission to translate her evaluation set’s 2,353 source segments using Google Translate. Here are Google’s evaluation scores side-by-side with Isabella’.
|Average segment length (words per segment)||16.5|
|Evaluation Scores||Google Translate|
|Evaluation BLEU score (all)||33.07||69.33|
|Evaluation BLEU score (1.0 filtered)||32.47||61.82|
|Edit Distance per line (non-zero)||42||32|
|Exact matches count||102||700|
|Edit Distance entire project||93,605||52,856|
|Average segment length (exact matches)||4.7||11.4|
This Engine Summary table includes a variety of scores, but these are the three that I rely on the most: the Average sentence length, the Quality quotient, and the Evaluation BLEU score (1.0 filtered).
The average segment length of source segments in the evaluation set tells us if Isabella’s translation memories are heavily weighted with terms, such as from a termbase. Isabella’s 16.5 average above is normal and the translation memories likely include a good balance of short and long segments. If the average were very small (for example 4 or 5 words), the engine will work poorly with long sentences.
The quality quotient (QQ) score means its likely that Isabella will simply review up to 30% of segments as exact matches when she works with her engine and her client’s future projects. Exact matches with this engine are 7 times more likely than if she did the same work with Google.
The evaluation BLEU score (filtered) represents the amount of typing and/or dictation work Isabella will need to do when her engine fails to suggest an exact match. Her engine’s score of 61.8 indicates her engine’s segments are likely to require less work than segments from Google with a score of 32.5. It’s important to note that this evaluation set’s Google BLEU score is comparable to Google scores with other published evaluation scores.
Putting It All Together
Isabella described her translation memories as client-specific with mostly her translations, those of a trusted colleague and some from unknown colleagues. She said, “All in all, a great mess” because they contain some terminological discrepancies, long convoluted segments, and other one-word long segments. She created her engine on her 4-year-old laptop computer in less than a day without any specialized training.
Isabella’s evaluation set is a representative subset of the TMs that Slate created to build the engine. The evaluation set’s scores show that her engine significantly outperforms Google Translate in every measured category. Furthermore, because of how Slate created the evaluation set and her translation memories are primarily her work specific to her client, she has a 95% likelihood of experiencing similar performance with future work from that client.
When Isabella works on projects with Slate, her engine is likely to give her 7 of year 10 segments that require changes (the converse of the QQ). Like many users, she might find these suggestions overwhelming because she’s accustomed to the CAT hiding the suggestions from poor fuzzy matches. Still, 70% represents much less work than the 96% she would likely receive from Google. With a little practice, it’s easy and fast to trash segments that require radical changes and start from scratch.
There’s no way to predict how her engine will perform with work from other clients or other subject matter. The nature of the statistical machine translation technology tells us that the performance will degrade as a project’s linguistic contents diverge from her engine corpus’ contents. Isabella’s engine could drop significantly for projects with disparate linguistic content. Fortunately, Isabella controls her engine and Slate gives her some tools to clean up the “great mess,” for example by forced terminology files to resolve the terminological discrepancies.
This was her first engine and she can experiment to her heart’s content. She can create as many engines as she likes. She can mix various translation memories and compare their performance, much like I compared her engine to Google in this article. Furthermore, she can experiment without any additional cost. If she has translation memories for five clients, she can create one engine for each of them or one that combines all. I look forward to hearing about her experiments.
When using Google Translate, Isabella needs to wait for Google to update and improve their engine. For example, her Google results reflect their recent update their en-it engine to NMT and these scores reflect those improvements. To Google’s credit, it handles variations across different subjects better than Isabella’s engine likely will. As Isabella pointed out, Google “has been constantly improving since inception.” So, across many different subjects, Google will continue to deliver 4% to 5% exact matches.
Fortunately, Isabella doesn’t face an either-or decision. Isabella’s first Slate Desktop™ engine performs well with her client’s projects, but we don’t know how it will perform with other projects. It costs her nothing to try it or improve it. Finally, she can also use Google whenever she feels it might be beneficial.
Slate’s Engine summary and table above present standard and custom machine translation scores. These scores are designed to help you predict the performance you can expect when using the engine.
These examples are the three longest segments from each of four categories. The BLEU score human references are Isabella’s translation from her TM.
Isabella Massardo invited me to write a guest post to explain Slate Desktop™ evaluation scores, and to compare them to Google’s results. This is a re-post of that article on her blog: http://massardo.com/blog/mt-evaluation/. Thank you, Isabella.