Microsoft has this week made its Distributed Machine Learning Toolkit (DMTK) openly available to the developer community.
Researchers at the Microsoft Asia lab have released the toolkit on GitHub under an MIT (Massachusetts Institute of Technology) license, to encourage the use of multiple computers in parallel to solve complex problems. Its design builds on a parameter server-based programming framework, which allows big data machine learning tasks to be easily scaled, and flexibly and efficiently executed.
The toolkit also contains two distributed machine learning algorithms, which can be used to train the world’s fastest and largest topic model, as well as the largest word-embedding model.
The Microsoft offering uses simple APIs to make it easily accessible to researchers and developers, and to help reduce the complexity of machine learning components such as data, models and training. According to the lab team, the toolkit can be used to train a topic model with one million topics and a 20 million-word vocabulary, on a web document collection with 200 billion tokens, using a cluster of only 24 machines – a workload which would have previously required thousands of machines.
Microsoft suggests that the toolkit could also support other complex tasks including computer vision, speech recognition and textual understanding. The researchers said that more tools will be added in new DMTK versions. Through open sourcing the project, they hope that machine learning researchers and developers can help to co-develop the algorithms contained in the kit and expand its potential applications.
The announcement comes just a matter of days after Google released its open source machine learning project, TensorFlow. Facebook too launched open source tools for deep learning in its Torch library, at the beginning of this year.
Twitter, Spotify and Netflix have also been quickly driving their work on the field of research, however deep learning open source contributions have been uncommon from these organisations.