Working With Huge Corpora

The first Slate Desktop™ support ticket included this comment. 

build a test engine based on one large TM (as an easy start)…

I remembered a brilliant computational linguist’s comment. Kenneth Heafield created a critically important component inside Moses. I had wrote him to ask about configuration. I explained that we expect our customers (translators)

will have a corpus averaging 20 million words…

Ken replied…

… 20 million is tiny and cute

Clearly there was a disconnect because 20 million words is respectable or “very large,” possibly “huge” for most translators in their world.

Computational linguists, data scientists and MT experts live in a different world. They count segments by the billions. They count words by the trillions! So from Ken’s perspective, 20 million words is “tiny and cute.” These researchers don’t building machine translation systems for one or two people. They research systems for millions of people to share. Therefore, they report their work with huge corpora and massive computer servers in their world.

Slate Desktop™ uses the same technology to serve a different world that researchers barely know. To avoid confusion working, we use less subjective and more objective descriptions. For Slate Desktop™, huge data means any parallel training corpus more than 1.5 million segment pairs — small by researchers’ standards but huge for translators.

Customers: login for illustrated instructions.

Log into Your Account

Lost Password?