Slate Toolkit README

Slate Toolkit Edition Logo

We created Slate™ Toolkit Edition to manage phrase-based SMT models in our other Slate™ products on various operating systems. Utilities that do not support this goal are either missing or untested and not supported.

We distribute these utilities without warranty or support. You may redistribute each under its respective open source license.

Hardware System Requirements

  • Intel Core i3 (i7 recommended) or AMD Athlon 64 CPU (4-core x86-64, 2.4 GHz or faster)
  • 4 GB of RAM (8 GB recommended)
  • 2 GB of available hard-disk space for installation
  • 250 GB (or more) of additional free space on a high-performance drive is required after installation

Windows Operating System Requirements

  • Microsoft Windows XP Professional x64 Edition with Service Pack 3
  • Microsoft Windows 7 64-bit Edition with Service Pack 1
  • Microsoft Windows 8 or 8.1 64-bit Edition
  • Microsoft Windows 10 64-bit Edition
  • Microsoft Windows Server 2003 R2 x64 Edition
  • Microsoft Windows Server 2008 x64 Edition
  • Microsoft Windows Server 2012 x64 Edition

Scripting Runtimes

Slate™ Toolkit Edition for Windows installs scripting runtimes and updates the system %PATH%.

  • Perl 64-bit version 5.26 or newer. We install Strawberry Perl
  • Python 64-bit version 3.65. We install the version from Python.org

Additional Utilities

Slate™ Toolkit Edition for Windows include these 64-bit utilities.

  • sort.exe (GNU coreutils version 7.6)
  • split.exe (GNU coreutils version 5.3.0)
  • libiconv2.dll and libintl3.dll (GNU coreutils version 5.3.0)
  • gzip.exe (version 1.3.12, also copied as gunzip.exe and bzcat.exe)

These are neither maintained nor supported by us, but we will update them when needed.

Linux Operating System Requirements

  • Ubuntu Linux x86_64 kernel version 3.2+
  • GNU standard library and command-line utilities.

Scripting Runtimes

The Linux package works “out of the box” on a standard Ubuntu 12.04 or newer systems.

On Red Hat-based systems you may need to install Perl and/or Perl’s Date::Format package:

yum install perl perl-TimeDate

Other Linux systems may work and we welcome your feedback about your experiences.

Features

Supported Utilities

We maintain the following utilities, make the installer and support their cross-platform functionality for phrase-based SMT. They may work with other SMT modes, such as factored phrase-based or hierarchical but we do not test these modes. We would like to hear about your experiences.

From MGIZA++:

  • mgiza(.exe)
  • mkcls(.exe)
  • snt2cooc(.exe)
  • merge_alignment.py

From Moses:

  • build_binary(.exe)
  • consolidate(.exe)
  • evaluator(.exe)
  • extract(.exe)
  • extractor(.exe)
  • lexical-reordering-score(.exe)
  • lmplz(.exe)
  • mert(.exe)
  • moses(.exe)
  • processLexicalTable(.exe)
  • processPhraseTable(.exe)
  • query(.exe)
  • score(.exe)
  • symal(.exe)
  • extract-parallel.perl
  • filter-model-given-input.pl
  • filter-rule-table.py
  • flexibility_score.py
  • giza2bal.pl
  • LexicalTranslationModel.pm
  • mert-moses.pl
  • moses_sim_pe.py
  • reduce_combine.pl
  • score-parallel.perl
  • train-model.perl

If you need a particular utility that’s not listed here to run cross-platform (e.g. on Windows), please let us know. We may be able to add them to Slate™ Toolkit Edition.

Supported Features

We support the following features with the utilities listed above. See the “Unsupported Features” section below.

  • PhraseDictionaryMemory
  • PhraseDictionaryBinary
  • LexicalReordering (memory)
  • LexicalReordering (binary)
  • KenLM (all modes)
  • max-kenlm-order=12
  • with-xmlrpc-c (support for -xml-input)
  • cmph support

Unsupported Features

We do not support feature that are not listed above. If an unsupported utility is packaged in Slate™ Toolkit Edition, it may or may not work because we have not tested it, but we do not support it. These utilitie includes, for example:

  • clean-corpus-n.perl
  • detokenizer.perl
  • lowercase.perl
  • snt2cooc.pl
  • tokenizer.perl

Slate™ Toolkit Edition does not include all of the utilities in the original Moses and MGIZA++. We do not support utilities that are not listed above. For example, these utilities are neither included nor supported:

  • BerkeleyAligner
  • PhraseDictionaryOnDisk
  • PhraseDictionaryCompact
  • LexicalReordering (compact)
  • IRSTLM
  • RandLM
  • SRILM
  • hierarchical models
  • suffix arrays
  • bilingual language models

This list may change as we update Slate™ Toolkit Edition. Please contact us if a feature you need is missing.

Getting started

To work with this package, you should be familiar with Moses and MGIZA++.

The command lines for the various utilities can be long and error-prone. So, we have included shell scripts (Windows .cmd files and Bash .sh files) to demonstrate the command lines that take you through the paces on a sample corpus, from training to translation. These are meant for you to read and use as examples.

You can run the whole thing by executing the demo-all script. Just run the script in-place.

The outputs to some scripts become the inputs to others. Therefore, when you run the scripts individually, please follow the order as referenced in the demo-all script.

Caveats

This software is originally written and maintained for academic use on Unix-like systems. You will notice this in many places, and there are a few things you can do to protect yourself from problems.

Naming of files and folders

Unix and Windows deal with locations of files and folders in different ways:

  • Windows paths use drive letters and backslashes; Unix paths use slashes to indicate where a file is. Slate™ Toolkit Edition generally supports each system’s native style, but if you run into glitches, please let us know.
  • In Windows, “a.txt” and “A.txt” are the same file; in Unix they are different. Avoid names that can be confused in this way, and make sure you capitalize all names consistently. The software may not always realize that the two are the same name on your system.
  • Unix software often uses whitespace to separate one filename from another. Files or folders are allowed to have whitespace in them, but it often brings out bugs in software. Avoid whitespace in file and folder path names.
  • Handling of non-ASCII characters can differ between individual computers depending on configuration, and likewise often triggers software bugs.
  • Many punctuation marks can have special meanings on different systems, such as colons, quotes and apostrophes, equals signs, dollar signs, percentage signs, asterisks, tildes, and so on. Avoid these marks. Keep it simple! When in doubt, use dashes and/or underscores.

For trouble-free use, we recommend that you use only files and folders with names consisting exclusively of ASCII letters (a-z), digits (0-9), dots, and dashes or underscores (-, _). With your help and patience we hope to improve the user experience over time.

Line endings

On Windows systems, a line of text ends in a fixed sequence of two characters: carriage return and line feed, also written as “”. Unix systems use just the line feed, or “”.

This may confuse some tools when dealing with files that were not written with your system’s native line endings. Windows Notepad may show all contents in a Unix text file as a single, long line; or instead of returning to the starting column for every new line, some software may just start the next line right below where the last one ended. Rare Unix tools may interpret the carriage returns as “jump back to the beginning of the line and erase all text that was previously displayed.”

Slate™ Toolkit Edition accepts input files with either style of line endings. It generally creates output files with platform-specific line endings, but at times it creates Unix-style files on Windows systems. This may not be perfect and with your feedback we hope to improve it over time.

Contact Information

Slate Rocks LLC
Web: https://www.slate.rocks/
Email: info@slate.rock