
Machine learning's exciting, but the work is complex and difficult. It typically involves a lot of manual heavy lifting -- assembling workflows and pipelines, setting up data sources, and shunting back and forth between on-prem and cloud-deployed resources.
The more tools you have in your belt to make that job easier, the better. Thankfully, python is a giant tool belt of a language that's widely used in big data and machine learning. Here are five Python libraries that help make the heavy lifting for those trades a little less heavy.
PyWrenA simple package with a powerful premise, PyWren lets you run Python-based scientific computing workloads as multiple instances of AWS Lambda functions. A profile of the project at The New Stack describes how PyWren uses AWS Lambda as a giant parallel processing system, tackling projects that can be sliced and diced into little tasks that don't need a lot of memory or storage to run.
One downside is that lambda functions can't run for more than 300 seconds max. But if you need a job that only takes a few minutes to complete, and you need to run it thousands of times across a dataset, PyWren may be a good way to parallelize that work in the cloud at a scale unavailable on end user hardware.
TfdeployGoogle's TensorFlow framework is taking off big-time now that it's at a full 1.0 release . One common question asked about it: How can I make use of the models I train in TensorFlow without using TensorFlow itself?
Tfdeploy is a partial answer to that question. It exports a trained TensorFlow model to "a simple NumPy-based callable," meaning the model can be used in Python with the only dependencies being Tfdeploy and the the NumPy math-and-stats library. Most of the operations you can perform in TensorFlow can also be performed in Tfdeploy, and you can extend the behaviors of the library by way of standard Python metaphors (e.g., overloading a class).
Now the bad news: Tfdeploy doesn't support GPU acceleration, if only because NumPy doesn't do that. Tfdeploy's creator suggests using the gNumPy project as a possible replacement.
LuigiWriting batch jobs is generally only one part of processing heaps of data; you also have to string all those jobs together into something resembling a workflow, or a pipeline. Luigi, created by Spotify and named for the other plucky plumber made famous by Nintendo , was built to "address all the plumbing typically associated with long-running batch processes."
With Luigi, a developer can take several different unrelated data processing tasks -- "a Hive query, a Hadoop job in Java, a Spark job in Scala, dumping a table from a database" -- and create a workflow that runs them, end-to-end. The entire description of a job and all of its dependencies are created as Python modules, not as XML config files or some other data format, so it can be integrated into other Python-centric projects.
KubelibIf you're adopting Kubernetes as an orchestration system for machine learning jobs, the last thing you want is for the mere act of using Kubernetes to create more problems than it solves. Kubelib provides a set of Pythonic interfaces to Kubernetes, originally as a way to aid with Jenkins scripting. But it can be used without Jenkins as well, and it can do everything that's exposed through the kubectl CLI or the Kubernetes API.
PyTorchLet's not forget about this recent and high-profile addition to the Python world, an implementation of the Torch machine learning framework. This project doesn't just port Torch to Python, but adds many other conveniences, such as GPU acceleration and a library that allows multiprocessing to be done with shared memory (for partitioning jobs across multiple cores). Best of all, it can provide GPU-powered replacements for some of the unaccelerated functions in NumPy.