Working With Huge Corpora

The first Slate Desktop support ticket included this comment. 

build a test engine based on one large TM (as an easy start)…

I remembered a brilliant computational linguist’s comment. Kenneth Heafield created a critically important component inside Moses. I had wrote him to ask about configuration. I explained that we expect our customers (translators)

will have a corpus averaging 20 million words…

Ken replied…

… 20 million is tiny and cute

Clearly there was a disconnect because 20 million words is respectable or “very large,” possibly “huge” for most translators in their world.

Computational linguists, data scientists and MT experts live in a different world. They count segments by the billions and words by the trillions! So from Ken’s perspective, 20 million words is “tiny and cute.”

