Common terms for new users of statistical machine translation (SMT) and Moses Toolkit.
Aligned data are the elements of a parallel corpus consisting of two or more languages. Each element in one language matches the corresponding element in the other language(s). The elements, sometimes called segments, can be block-aligned, paragraph-aligned, sentence-aligned, phrase-aligned or token-aligned.
There are two alignment processes. In corpus preparation, the alignment process creates aligned data. During training, the alignment process uses a program such as MGIZA++ to create word alignment files.
BLEU stands for Bi-Lingual Evaluation Understudy”. A BLEU score indicates how closely the token sequences in one set of data, such as machine translation output, correlate with (match) the token sequences in another set of data, such as a reference human translation. See: evaluation process
Corpus preparation is the general process to extract, transform, categorize various documents from their original purpose to and align the resulting data into a parallel corpus for training a translation model.
development set (dev set)
See “tuning set”
evaluation set (eval set)
See “test set”
The evaluation process uses a translation model of components created in the training process and configured with the tuning process to translate several thousand source language sentences in the eval set. This process then compares the resulting machine translations to reference translations, also in the eval set. The final BLEU score evaluation report shows how well the machine translations match the reference translations.
A translation model that uses hierarchical training corpus.
hierarchical training data
A training corpus with each phrase annotated with the hierarchical structure of the language, such as parts of speech, word function, etc.
A “language model” or “lm” is a statistical description of one language that includes the frequencies of token-based n-grams occurrences in a corpus. The “lm” is trained from a large monolingual corpus and saved as a file. The language model file is a required component of every translation model. Moses uses language model to select the most “probably” target language sentence from a large set of “possible” translations it generated using the phrase table and reordering table.
language model types
Language model files contain statistical data generated by one of various programs. Moses Decoder can use language model file types including: KenLM SRILM, RandLM and IRSTLM. SRILM, RandLM and IRSTLM toolkits include tools that train the new language model files. KenLM, however, only reads ARPA standard language model files which can be created by SRILM, IRSTLM.
moses configuration file
The Moses configuration file is a text file created during the tuning process. The file contains the paths to the phrase table(s), reordering table, language model(s) with other codes and numeric values that control how the Moses Decoder works.
An n-gram is a sub-sequence of n number of (1, 2, 3, etc) items in a larger sequence. In an lm n-grams are sequences of tokens. In phrase tables and reordering tables, n-grams are sequences of pairs of source and target language tokens.
See “parallel data”
A linguistic corpus of two or more languages where each element in one language corresponds to an element with the same meaning in the other language(s). The original, authored language is identified as the source language. Non-source languages are referred to as “target” languages. For Moses SMT, parallel data takes the form of one source and one target language text file where both files contain corresponding translation of sentences line by line.
A “phrase table” is a statistical description of a parallel corpus of source–target language sentence pairs. The frequencies that n-grams in a source language text co-occur with n-grams in a parallel target language text represent the probability that those source-target paired n-grams will occur again in other texts similar to the parallel corpus. In practical terms, the phrase table is a file created during the training process and saved in the translation model folder. It functions as a sophisticated dictionary between the source and target languages. Phrase tables and reordering tables are translation model components.
A “pipeline” is a toolchain of processes connected by standard streams, so that the output of each process (stdout) feeds directly as input (stdin) to the next one.
A recaser model is a special translation model translates lower cased data to “natural” cased text (upper and lower casing).
A “reordering table” contains the statistical frequencies that describe the changes in word order between source and target languages, such as “big house” versus “house big”. In practical terms, a “reordering table” is a file created during the training process and saved as a file in the model folder. The reordering table is translation model components.
The source language is the language of the text that is to be translated. Typically, this is the authored language of the text. The source language is the same as the TMX specification “srclang” attribute of the <tu> tag.
The target language is the language the source language text should be translated to.
Tokenization is the process of separating words from punctuation and symbols into tokens.
Tokens are the basic unit in a machine translation process. Tokens are a sequence of characters, such as words, punctuation or symbols, separated by a space. See: BLEU score
A “toolchain” is a series of linked or “chained” programming tools used in a series where the output of an upstream tool become the input for a “downstream” tool.
See: training corpus
Training is a process in the machine learning branch of artificial intelligence field. In the training process, a system “learns” the relationships between parallel data. In SMT, the source language texts are stimuli that generate the target language text as a response. In practical terms, training starts with the bitext files and creates the phrase table and reordering table that are components of a translation model.
A translation memory ™ is parallel data that was collected for the purpose of aiding future translations.
A “translation model” consists of one or more phrase tables, zero or more reordering tables, one or more language models and one moses configuration file that were created during the training and tuning processes.
Tuning is a process that finds the optimized configuration file settings for a translation model when used a specific purpose. The tuning process translates thousands of source language phrases in the tuning set with a translation model, compares the model’s output to a set of reference human translations, and adjusts the settings with the intention to improve the translation quality. This process continues through numerous iterations. With each iteration, the tuning process repeats the steps until it reaches an optimized translation quality.
A word aligner is a program that created word alignment files during the word alignment process. Moses currently supports these word aligners: GIZA++, MGIZA++, and BerkeleyAligner.
A word is the smallest unit of meaning in a language that will stand on its own. In SMT, a word is a token created in the tokenization process that is not a punctuation or symbol.
SMT Glossary v 1.0
(Excerpts from the “DoMY Glossary” in Do Moses Yourself Community Edition)
Copyright © 2011 Precision Translation Tools Co., Ltd.
SMT Glossary by Precision Translation Tools Co., Ltd. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.