Training Data – Cleaning and Tokenization

Once data is converted into the right format, it needs to be tokenized and cleaned before it can be used to train a SMT system. This presentation explains tokenization and word segmentation for East Asian languages and outlines cleaning options for SMT training data, used by many MT vendors. The presentation provides guidance on which data cleaning to apply and how to apply it to obtain the best quality MT system. For some languages it is beneficial to add linguistic information to the SMT system. This is also described.

The presenter talks about “best practices” of 2014 when the video was recorded. Slate Desktop™ uses newer best practices based on lessons learned since this video was created.

Published on Jul 7, 2014