Training Data – Cleaning and Tokenization

Moses Core - Training Data Cleaning and Tokenization

Published on Jul 7, 2014

Once data is converted into the right format, it needs to be tokenized and cleaned before it can be used to train a SMT system. This presentation explains tokenization and word segmentation for East Asian languages and outlines cleaning options for SMT training data, used by many MT vendors. The presentation provides guidance on which data cleaning to apply and how to apply it to obtain the best quality MT system. For some languages it is beneficial to add linguistic information to the SMT system. This is also described.