Training Data – Cleaning and Tokenization

Once data is converted into the right format, it needs to be tokenized and cleaned before it can be used to train a SMT system. This presentation explains tokenization and word segmentation for East Asian languages and outlines cleaning options for SMT training data, used by many MT vendors. The presentation provides guidance on which data cleaning to apply and how to apply it to obtain the best quality MT system. For some languages it is beneficial to add linguistic information to the SMT system. This is also described.

The presenter talks about “best practices” when the video was recorded in 2014. Slate Desktop™ uses updated best practices based on lessons learned through the intervening years.

Published on Jul 7, 2014