Microsoft open sources Distributed Machine Learning Toolkit for big data research

Saturday, November 4, 2017

Distributed machine learning toolkit is extensively used by the Microsoft to contribute to (AI) Artificial intelligence..

According to its website

Distributed Machine Learning Toolkit

Distributed machine learning has become more important than ever in this big data era. Especially in recent years, practices have demonstrated the trend that more training data and bigger models tend to generate better accuracies in various applications. However, it remains a challenge for common machine learning researchers and practitioners to learn big models from huge amount of data, because the task usually requires a large number of computation resources. In order to tackle this challenge, we release the Microsoft Distributed Machine Learning Toolkit (DMTK), which contains both algorithmic and system innovations. These innovations make machine learning tasks on big data highly scalable, efficient, and flexible. We will continue to add new algorithms to DMTK in a regular basis.

The current version of DMTK includes the following components (more components will be added to the future versions):

• DMTK Framework: a flexible framework that supports unified interface for data parallelization, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency.

• LightLDA, an extremely fast and scalable topic model algorithm, with a O(1) Gibbs sampler and an efficient distributed implementation.

• Distributed (Multisense) Word Embedding, a distributed version of (multi-sense) word embedding algorithm.

• LightGBM: a very high-performance gradient boosting tree framework (supporting GBDT, GBRT, GBM, and MART), and its distributed implementation.

Machine learning researchers and practitioners can also build their own distributed machine learning algorithms on top of our framework with small modifications to their existing single-machine algorithms.

We believe that in order to push the frontier of distributed machine learning, we need the collective effort from the entire community, and need the organic combination of both machine learning innovations and system innovations. This belief strongly motivates us to open source the DMTK project.

The DMTK Framework is front-and-centre, since that's where both extensions will happen. It's a two-piece critter, consisting of a parameter server and a client SDK.

The parameter server has “separate data structures for high- and low-frequency parameters”, Microsoft says, so as to balance memory capacity and access speed. It aggregates updates from local workers, and synchs different model mechanisms, including binary space partitions (BSP), answer set programming (ASP), and statistical signal processing (SSP) “in a unified manner”.

The client SDK provides:

A local model cache – designed to reduce communication workloads by synching with the parameter server only when needed;
A pipeline between local training and model communication; and
Round-robin scheduling of big model training, which the project's site explains “allows each worker machine to pull the sub-models as needed from the parameter server, resulting in a frugal use of limited memory capacity and network bandwidth to support very big models.”

For more about DMTK click here www.dmtk.io

extrovert.dev

Microsoft open sources Distributed Machine Learning Toolkit for big data research

Distributed Machine Learning Toolkit

2 Responses to Microsoft open sources Distributed Machine Learning Toolkit for big data research