Slate™ Toolkit Edition

Slate Toolkit Edition Logo

Slate™ Toolkit Edition is for programmers (like an SDK) and power-users who are comfortable with advanced terminal command-lines. The Toolkit is a collection of free open source software utilities needed to generate and use phrase-based statistical machine translation (SMT) models.

Requirements
Open Source Components
README

Operating Systems

    • Microsoft Windows, 64-bit
      • Windows 7 64-bit with Service Pack 1
      • Windows 8 or 8.1 64-bit
      • Windows 10 64-bit
    • Linux, x86_64 kernel version 3.2+
      • Ubuntu 16.04 or newer (other Debian-based on request)
      • CentOS/RHEL-based (other RPM-based on request)

Hardware

    • Processor (CPU): x86-64/x64 quad-core processor at 2.5 GHz or faster
      • Intel Core i3 (i7 4th gen or newer strongly recommended)
      • AMD Athlon 64
    • RAM (memory): 4 GB, 8 GB strongly recommended (16+ GB ideal)
    • System Hard Disk (program storage): 2 GB consumed by installation files
    • High-Performance Disk Space (workspace): 40+ GB of free space (250 GB or more strongly recommended)

Translation Memories

    • Personalized engine: 70,000 to 150,000 sentence segments.
      • One full-time translator’s production for 3 to 4 years is sufficient to create a personalized engine.
    • Customized engine: 200,000 to 500,000 sentence segments.
      • An engine for a team of translators requires more.
    • There is no upper limit.
      • Too many segments increases the risk of degrading the engine from peak performance.

Open Source Licenses

SlateMT uses a collection of software components written by many authors and contributors, on many different projects. We distributed these components under their respective open source licenses. They, in turn, can have parts that come from different sources and/or are licensed differently. The source code and licensing for these are at these links:

* slatetoolkit (scripts under AGPL v3, plus files from Moses and MGIZA++)
* moses (mostly LGPL v2.1 or later)
* MGIZA++ (mostly GPL v2 or later)
* European Parliament corpus (public domain)
* GNU sort.exe, split.exe, gzip and bzip* (various LGPL/GPL)

Where the license is listed as “…or later,” you have permission to redistribute under that license version or, at your option, later versions of that same license.

Perl Runtime

Perl is a free open source runtime environment with supporting core components. (includes: Date::Format)

Python Runtime

Python is a free open source runtime environment and supporting core components. (includes pip, pywin32, six, numpy, nltk, lxml, regex, polib, jieba, PyArabic, tinysegmenter3, wxPython)

GNU Utilities

The GNU Utilities are separate but essential open source utilities that is the Moses Toolkit must have to create SMT models.

MGIZA++

MGIZA++ is an essential open source utility that is the Moses Toolkit uses to create SMT models.

Moses Decoder

The decoder is the utility that converts source to target text using SMT models created by the toolkit.

Moses Toolkit

The toolkit, derived from the original Moses open source project, is a collection of open source utilities that create SMT models.

Slate Toolkit README

We created Slate™ Toolkit Edition to manage phrase-based SMT models in our other Slate™ products on various operating systems. Utilities that do not support this goal are either missing or untested and not supported.

We distribute these utilities without warranty or support. You may redistribute each under its respective open source license.

Hardware System Requirements

  • Intel Core i3 (i7 recommended) or AMD Athlon 64 CPU (4-core x86-64, 2.4 GHz or faster)
  • 4 GB of RAM (8 GB recommended)
  • 2 GB of available hard-disk space for installation
  • 250 GB (or more) of additional free space on a high-performance drive is required after installation

Windows Operating System Requirements

  • Microsoft Windows XP Professional x64 Edition with Service Pack 3
  • Microsoft Windows 7 64-bit Edition with Service Pack 1
  • Microsoft Windows 8 or 8.1 64-bit Edition
  • Microsoft Windows 10 64-bit Edition
  • Microsoft Windows Server 2003 R2 x64 Edition
  • Microsoft Windows Server 2008 x64 Edition
  • Microsoft Windows Server 2012 x64 Edition

Scripting Runtimes

Slate™ Toolkit Edition for Windows installs scripting runtimes and updates the system %PATH%.

  • Perl 64-bit version 5.26 or newer. We install Strawberry Perl
  • Python 64-bit version 3.65. We install the version from Python.org

Additional Utilities

Slate™ Toolkit Edition for Windows include these 64-bit utilities.

  • sort.exe (GNU coreutils version 7.6)
  • split.exe (GNU coreutils version 5.3.0)
  • libiconv2.dll and libintl3.dll (GNU coreutils version 5.3.0)
  • gzip.exe (version 1.3.12, also copied as gunzip.exe and bzcat.exe)

These are neither maintained nor supported by us, but we will update them when needed.

Linux Operating System Requirements

  • Ubuntu Linux x86_64 kernel version 3.2+
  • GNU standard library and command-line utilities.

Scripting Runtimes

The Linux package works “out of the box” on a standard Ubuntu 12.04 or newer systems.

On Red Hat-based systems you may need to install Perl and/or Perl’s Date::Format package:

yum install perl perl-TimeDate

Other Linux systems may work and we welcome your feedback about your experiences.

Features

Supported Utilities

We maintain the following utilities, make the installer and support their cross-platform functionality for phrase-based SMT. They may work with other SMT modes, such as factored phrase-based or hierarchical but we do not test these modes. We would like to hear about your experiences.

From MGIZA++:

  • mgiza(.exe)
  • mkcls(.exe)
  • snt2cooc(.exe)
  • merge_alignment.py

From Moses:

  • build_binary(.exe)
  • consolidate(.exe)
  • evaluator(.exe)
  • extract(.exe)
  • extractor(.exe)
  • lexical-reordering-score(.exe)
  • lmplz(.exe)
  • mert(.exe)
  • moses(.exe)
  • processLexicalTable(.exe)
  • processPhraseTable(.exe)
  • query(.exe)
  • score(.exe)
  • symal(.exe)
  • extract-parallel.perl
  • filter-model-given-input.pl
  • filter-rule-table.py
  • flexibility_score.py
  • giza2bal.pl
  • LexicalTranslationModel.pm
  • mert-moses.pl
  • moses_sim_pe.py
  • reduce_combine.pl
  • score-parallel.perl
  • train-model.perl

If you need a particular utility that’s not listed here to run cross-platform (e.g. on Windows), please let us know. We may be able to add them to Slate™ Toolkit Edition.

Supported Features

We support the following features with the utilities listed above. See the “Unsupported Features” section below.

  • PhraseDictionaryMemory
  • PhraseDictionaryBinary
  • LexicalReordering (memory)
  • LexicalReordering (binary)
  • KenLM (all modes)
  • max-kenlm-order=12
  • with-xmlrpc-c (support for -xml-input)
  • cmph support

Unsupported Features

We do not support feature that are not listed above. If an unsupported utility is packaged in Slate™ Toolkit Edition, it may or may not work because we have not tested it, but we do not support it. These utilitie includes, for example:

  • clean-corpus-n.perl
  • detokenizer.perl
  • lowercase.perl
  • snt2cooc.pl
  • tokenizer.perl

Slate™ Toolkit Edition does not include all of the utilities in the original Moses and MGIZA++. We do not support utilities that are not listed above. For example, these utilities are neither included nor supported:

  • BerkeleyAligner
  • PhraseDictionaryOnDisk
  • PhraseDictionaryCompact
  • LexicalReordering (compact)
  • IRSTLM
  • RandLM
  • SRILM
  • hierarchical models
  • suffix arrays
  • bilingual language models

This list may change as we update Slate™ Toolkit Edition. Please contact us if a feature you need is missing.

Getting started

To work with this package, you should be familiar with Moses and MGIZA++.

The command lines for the various utilities can be long and error-prone. So, we have included shell scripts (Windows .cmd files and Bash .sh files) to demonstrate the command lines that take you through the paces on a sample corpus, from training to translation. These are meant for you to read and use as examples.

You can run the whole thing by executing the demo-all script. Just run the script in-place.

The outputs to some scripts become the inputs to others. Therefore, when you run the scripts individually, please follow the order as referenced in the demo-all script.

Caveats

This software is originally written and maintained for academic use on Unix-like systems. You will notice this in many places, and there are a few things you can do to protect yourself from problems.

Naming of files and folders

Unix and Windows deal with locations of files and folders in different ways:

  • Windows paths use drive letters and backslashes; Unix paths use slashes to indicate where a file is. Slate™ Toolkit Edition generally supports each system’s native style, but if you run into glitches, please let us know.
  • In Windows, “a.txt” and “A.txt” are the same file; in Unix they are different. Avoid names that can be confused in this way, and make sure you capitalize all names consistently. The software may not always realize that the two are the same name on your system.
  • Unix software often uses whitespace to separate one filename from another. Files or folders are allowed to have whitespace in them, but it often brings out bugs in software. Avoid whitespace in file and folder path names.
  • Handling of non-ASCII characters can differ between individual computers depending on configuration, and likewise often triggers software bugs.
  • Many punctuation marks can have special meanings on different systems, such as colons, quotes and apostrophes, equals signs, dollar signs, percentage signs, asterisks, tildes, and so on. Avoid these marks. Keep it simple! When in doubt, use dashes and/or underscores.

For trouble-free use, we recommend that you use only files and folders with names consisting exclusively of ASCII letters (a-z), digits (0-9), dots, and dashes or underscores (-, _). With your help and patience we hope to improve the user experience over time.

Line endings

On Windows systems, a line of text ends in a fixed sequence of two characters: carriage return and line feed, also written as “”. Unix systems use just the line feed, or “”.

This may confuse some tools when dealing with files that were not written with your system’s native line endings. Windows Notepad may show all contents in a Unix text file as a single, long line; or instead of returning to the starting column for every new line, some software may just start the next line right below where the last one ended. Rare Unix tools may interpret the carriage returns as “jump back to the beginning of the line and erase all text that was previously displayed.”

Slate™ Toolkit Edition accepts input files with either style of line endings. It generally creates output files with platform-specific line endings, but at times it creates Unix-style files on Windows systems. This may not be perfect and with your feedback we hope to improve it over time.

Contact Information

Slate Rocks LLC
Web: https://www.slate.rocks/
Email: info@slate.rock