Slate Corpus™

Convert translation memories to machine translation training corpora. Organize and curate corpora. Prepare to build machine translation engines. This is a stand-alone application. It’s a suite application in Slate Desktop™ and Slate Desktop Pro™.

Slate Corpus™ is the corpus preparation application in the Slate Desktop™ suite, packaged as a stand-alone application.

  • Convert the translation memory segments into MT training corpora.
  • Organize corpora segments in an inventory by client, subject and project type.
  • Clean and prepare MT training corpora.

Languages – Supports all languages and all new languages added with maintenance updates. Your translation memories and Slate Desktop™ create engines that translate between any combination of the 48 possible language pairs. That’s 2,256 language pair combinations.

Organize Translation Memories – Tools that organize an inventory of translation memories by client, subject matter and project type. More tools prepare translation memories as training corpus to build translation engines.

Sample Translation Memories – Sample translation memories and other files to help you practice and learn.

Scripting Automation – Tools that automate repetitive and complicated Slate Desktop™ tasks to efficiently process large projects support a command line terminal or integration into your third-party applications.

Privacy & Confidentiality – Applications run on your PC. There’s no the Internet connection. They don’t log your activities like online subscription services. You’re fully in control of confidential work.

Hardware System Requirements

  • Intel Core i5 (i7 recommended) or AMD Athlon 64 CPU (4-core x86-64, 2.4 GHz. more cores and faster are better)
  • 8 GB of RAM (16 GB better, 4 GB possible as a toys)
  • 2 GB of free hard drive space for base application
  • 20 GB minimum free space during install. 100 GB (and more) on a high-performance drive is strongly recommended.

Operating System Requirements

Windows 7 64-bit with SP1
Windows 8 or 8.1 64-bit
Windows 10 64-bit

Ubuntu 16.04 or newer, 64-bit
CentOS/RHEL, kernel 3.2+, 64-bit
Other Linux on request

To be determined, currently unsupported.

Training Corpus – Translation memories to convert to corpus or publicly available corpus.

Personalized engines

  • 70,000 to 150,000 sentence segments
  • One full-time translator’s work for 3 to 4 years

Customized engines

  • 200,000 to 500,000 sentence segments
  • Supports a team of translators


Using translation memories with only specialized segments yields better custom machine translation. Specializations such as financial & regulatory reports, clinical trials & pharmaceuticals, technical manuals, legal contracts, etc. yield more consistent and accurate translations.

There’s no upper limit on the number of segments, but too many segments may degrade the engine for specific, specialized use.

Supported File Types

  • Text files with UTF-8 character encoding, Linux or Windows new line separators
  • Tab-delimited files are specialized Text file (as above) with one tab per line. Text left of the tab is the source language. Text right of the tab is target language.
  • TMX – translation memory exchange up to version 1.4b
  • XLIFF – XML Localization Interchange File Format version 1.2 (.xlf, .xliff, .sdlxliff, .mxliff, .mqxliff)
  • Gettext .po and .mo files

Other file types (.docx, .xlsx, etc.) supported through your computer-assisted translation (CAT)

License Agreement – A one-time payment, royalty-free end-user license agreement (EULA) to use the software on your machine in perpetuity without subscriptions or usage fees.

Platforms – Install and activate on any supported operating systems. Today’s support includes MS Windows and Linux. MacOS is planned.

Activation – Install and activate on one machine. Build engines and work on the same computer.

Maintenance Updates – Maintenance updates are published occasionally with new languages, enhanced features and bug fixes.

Technical Support – Access to priority technical support during the period between major version updates via our online support portal,

Open SourceSlate Corpus™ distributes open source components under their respective licenses.

Language Tokenizers – Language tokenizer utilities that insert spaces between words, punctuation and symbols.