Slate Toolkit™

Build your own machine translation programs. These open source phrase-based SMT utilities are the machine translation kernel in Slate Desktop™ and Slate Desktop Pro™. For a complete and graphical end-user experience, see those user-friendly applications.

Slate Toolkit™ is a collection of phrase-based SMT utilities. Many of these utilities are redistributed from the Moses Toolkit. Moses utilities and features that are not used in our commercial products are missing or untested. Other utilities are not part of Moses at all. All utilities are distributed under their respective open source licenses without warranty. See the README tab for more details.

Slate Toolkit™ is not a user-friendly graphical application. If this is what you’re looking for need, check out Slate Desktop™ and Slate Desktop Pro™.

Languages – Support for languages in any combination. Your bilingual corpus, possibly converted from your translation memories, creates translation models between all possible language pairs.

Sample Translation Memories – Sample translation memories and other files to help you practice and learn.

Software Development Tools – A skilled software engineer can use develop machine translation programs with features like these using Slate Toolkit™.

  • Organize Translation Memories
  • Build Customized Engines
  • Evaluate Engines
  • Deploy Engines
  • Pre-Translate Files
  • Plugins Connect to CAT Tools
  • Forced Terminology
  • Terminology On-The-Fly
  • Weighted Updates
  • Backup & Restore

Hardware System Requirements

  • Intel Core i5 (i7 recommended) or AMD Athlon 64 CPU (4-core x86-64, 2.4 GHz. more cores and faster are better)
  • 8 GB of RAM (16 GB better, 4 GB possible as a toys)
  • 2 GB of free hard drive space for base application
  • 20 GB minimum free space during install. 100 GB (and more) on a high-performance drive is strongly recommended.

Operating System Requirements

Windows
Windows 7 64-bit with SP1
Windows 8 or 8.1 64-bit
Windows 10 64-bit

Linux
Ubuntu 16.04 or newer, 64-bit
CentOS/RHEL, kernel 3.2+, 64-bit
Other Linux on request

MacOS
To be determined, currently unsupported.

Training Corpus – Translation memories to convert to corpus or publicly available corpus.

Personalized engines

  • 70,000 to 150,000 sentence segments
  • One full-time translator’s work for 3 to 4 years

Customized engines

  • 200,000 to 500,000 sentence segments
  • Support a team of translators
  • No upper limit number of segments
  • Too many segments risks degrading the engine

Specialization

Specializations such as financial & regulatory reports, clinical trials & pharmaceuticals, technical manuals, legal contracts, etc. create consistent and accurate translations. These translation memories yield better custom machine translation.

Supported File Types

  • Text files with UTF-8 character encoding, Linux or Windows new line separators
  • Tab-delimited files are specialized Text file (as above) with one tab per line. Text left of the tab is the source language. Text right of the tab is target language.

Licensing Agreement – We distribute Slate Toolkit™ with many open Source software components, from many different projects, written by many authors and contributors, under their respective open source licenses. In turn, they can have sub-components from different sources that could be licensed differently.

You may redistribute this package and its utilities freely under their respective open source licenses. The price is a packaging and distribution fee.

Platforms – Install and activate on any supported operating systems. Today’s support includes MS Windows and Linux. MacOS is planned.

Open SourceSlate Toolkit™ includes these components distributed under their respective open source licenses.

Python Scripting Runtime – Python 64-bit version 3.8.0 or newer is a free open source scripting runtime environment. Required dependency libraries include: pip, pywin32, six, numpy, nltk, lxml, regex, polib, jieba, PyArabic, tinysegmenter3, hazm, sacremoses, wxPython

Perl Scripting Runtime – Perl 64-bit version 5.30 or newer is a free open source scripting runtime environment.

GNU Utilities – The GNU Utilities are essential open source utilities that the Slate Toolkit™ needs to create models. The MS Windows package installs and updates them but we do not maintain them.

MGIZA++ – MGIZA++ is an essential open source utility that Slate Toolkit™ uses to create phrase-based statistical machine translation models.

Moses Tools – These tools are a collection of open source utilities, derived from the original Moses open source project, that create phrase-based SMT models.

Moses Decoder – The decoder is the binary utility that converts source text to target text using a phrase-based statistical machine translation model created by the Moses Tools.

Slate Demo Scripts – Shell scripts (Windows .cmd files and Bash .sh files) and a sample training corpus to demonstrate native Moses and MGIZA++ utilities.

To the best our knowledge, Slate Toolkit™ is the only full-stack collection of phrase-based SMT utilities for native 64-bit Windows and other OSes. The Slate Toolkit™ is not the Moses Toolkit. Some utilities here are not part of Moses. Many utilities are from the Moses Toolkit. Moses utilities and features that are not used in our commercial products are either missing or untested, and therefore are unsupported. These utilities are distributed under their respective open source licenses without warranty.

Supported SMT Utilities

Support for open source utilities maintains cross-platform functionality of phrase-based “mode” of statistical machine translation. Factored phrase-based, hierarchical and other SMT modes might work, but are not tested or supported.

From MGIZA++:

  • mgiza(.exe)
  • mkcls(.exe)
  • snt2cooc(.exe)
  • merge_alignment.py

From Moses Toolkit:

  • build_binary(.exe)
  • consolidate(.exe)
  • evaluator(.exe)
  • extract(.exe)
  • extractor(.exe)
  • lexical-reordering-score(.exe)
  • lmplz(.exe)
  • mert(.exe)
  • moses(.exe)
  • processLexicalTable(.exe)
  • processPhraseTable(.exe)
  • query(.exe)
  • score(.exe)
  • symal(.exe)
  • extract-parallel.perl
  • filter-model-given-input.pl
  • filter-rule-table.py
  • flexibility_score.py
  • giza2bal.pl
  • LexicalTranslationModel.pm
  • mert-moses.pl
  • moses_sim_pe.py
  • reduce_combine.pl
  • score-parallel.perl
  • train-model.perl
Unsupported Moses Utilities

Slate Toolkit™ does not include all of the utilities in the original Moses and MGIZA++ projects. We do not support utilities that are not listed above. For example, these utilities are neither included nor supported:

  • BerkeleyAligner
  • IRSTLM
  • RandLM
  • SRILM

Some utilities that we package with in Slate Toolkit™ may or may not work. We have not tested them because they are not needed by our commercial Slate products and therefore, we do not support them. These utilities includes but are not limited to the following:

  • clean-corpus-n.perl
  • detokenizer.perl
  • lowercase.perl
  • snt2cooc.pl
  • tokenizer.perl

The above lists may change. Contact us if you have questions.

Supported SMT Features

Support for phrase dictionary, lexical reordering, language modeling and other advanced features is maintained through the supported utilities.

  • PhraseDictionaryMemory
  • PhraseDictionaryBinary
  • LexicalReordering (memory)
  • LexicalReordering (binary)
  • KenLM (all modes)
  • max-kenlm-order=12
  • with-xmlrpc-c (support for -xml-input)
  • cmph support
Unsupported Moses Features

We do not support feature that are not listed above. Those include, but are not limited to these features.

  • BerkeleyAligner
  • PhraseDictionaryOnDisk
  • PhraseDictionaryCompact
  • LexicalReordering (compact)
  • IRSTLM
  • RandLM
  • SRILM
  • hierarchical models
  • suffix arrays
  • bilingual language models

If you need cross-platform (e.g. on Windows) support for a particular utility or feature from one of the original open source packages that is not listed here, please let us know. We may be able to add it to Slate Toolkit™.

Getting Started

Lengthy command lines from the various open source utilities can be error-prone. Therefore, this toolkit includes Windows .cmd and Bash .sh shell scripts, plus a very small sample training corpus. These do not constitute a production-ready environment. Rather, they demonstrate the open source command lines for essential steps that prepare corpora, train & tune models and translate text.

The demo-all script is the best place to start. Just run it in-place.

Outputs from upstream scripts become the inputs to downstream scripts. Therefore, the script names are numbered in the order to follow when you run them individually. The order is also referenced in the demo-all script.

File Types

SlateToolkit can use these file types:

  • Text files with UTF-8 character encoding, Linux or Windows new line separators
  • Tab-delimited files are specialized Text file (as above) with one tab per line. Text left of the tab is the source language. Text right of the tab is target language.
Caveats

These open source projects were written for academic use on Unix-like systems. Therefore, there are a few things you can do to protect yourself from problems.

Naming of files and folders

Unix and Windows deal with locations of files and folders in different ways:

  • Windows paths use drive letters and backslashes; Unix paths use slashes to indicate where a file is. Slate Toolkit™ generally supports each system’s native style, but if you run into glitches, please let us know.
  • In Windows, “a.txt” and “A.txt” are the same file; in Unix they are different. Avoid names that can be confused in this way, and make sure you capitalize all names consistently. The software may not always realize that the two are the same name on your system.
  • Unix software often uses whitespace to separate one filename from another. Files or folders are allowed to have whitespace in them, but it often brings out bugs in software. Avoid whitespace in file and folder path names.
  • Handling of non-ASCII characters can differ between individual computers depending on configuration, and likewise often triggers software bugs.
  • Many punctuation marks can have special meanings on different systems, such as colons, quotes and apostrophes, equals signs, dollar signs, percentage signs, asterisks, tildes, and so on. Avoid these marks. Keep it simple! When in doubt, use dashes and/or underscores.

For trouble-free use, we recommend that you use only files and folders with names consisting exclusively of ASCII letters (a-z), digits (0-9), dots, and dashes or underscores (-, _). With your help and patience we hope to improve the user experience over time.

Line endings

On Windows systems, a line of text ends in a fixed sequence of two characters: carriage return and line feed, also written as “”. Unix systems use just the line feed, or “”.

This may confuse some tools when dealing with files that were not written with your system’s native line endings. Windows Notepad may show all contents in a Unix text file as a single, long line; or instead of returning to the starting column for every new line, some software may just start the next line right below where the last one ended. Rare Unix tools may interpret the carriage returns as “jump back to the beginning of the line and erase all text that was previously displayed.”

Slate Toolkit™ accepts input files with either style of line endings. It generally creates output files with platform-specific line endings, but at times it creates Unix-style files on Windows systems. This may not be perfect and with your feedback we hope to improve it over time.

Contact Information
Slate Rocks LLC
Web: https://www.slate.rocks/
Email: [email protected]
System Hardware Requirements

Hardware System Requirements

  • Intel Core i5 (i7 recommended) or AMD Athlon 64 CPU (4-core x86-64, 2.4 GHz. more cores and faster are better)
  • 8 GB of RAM (16 GB better, 4 GB possible as a toys)
  • 2 GB of free hard drive space for base application
  • 20 GB minimum free space during install. 100 GB (and more) on a high-performance drive is strongly recommended.
Operating System Requirements

Windows Systems

Windows
Windows 7 64-bit with SP1
Windows 8 or 8.1 64-bit
Windows 10 64-bit

Linux Systems

Linux
Ubuntu 16.04 or newer, 64-bit
CentOS/RHEL, kernel 3.2+, 64-bit
Other Linux on request

MacOS Systems

MacOS
To be determined, currently unsupported.

GNU Utilities – The GNU Utilities are essential open source utilities that the Slate Toolkit™ needs to create models. The MS Windows package installs and updates them but we do not maintain them.

  • sort(.exe)
  • split(.exe)
  • libiconv2(.dll) and libintl3(.dll)
  • gzip(.exe), also copied as gunzip(.exe) and bzcat(.exe)

Perl Scripting Runtime – Perl 64-bit version 5.30 or newer is a free open source scripting runtime environment.

  • The Windows package installs Perl from Strawberry Perl and updates the system %PATH%.
  • The Linux package works “out of the box” on Ubuntu 12.04 or newer systems. Other Linux systems may work. We welcome your feedback about your experiences.

Python Scripting Runtime – Python 64-bit version 3.8.0 or newer is a free open source scripting runtime environment. Required dependency libraries include: pip, pywin32, six, numpy, nltk, lxml, regex, polib, jieba, PyArabic, tinysegmenter3, hazm, sacremoses, wxPython

  • The Windows package installs Python from python.org and updates the system %PATH%.
  • The Linux package works “out of the box” on a standard Ubuntu 12.04 or newer systems. Other Linux systems may work and we welcome your feedback about your experiences.