Joblibis a powerful python package for management of computation: parallel computing, caching, and primitives for out-of-core computing. It is handy when working on so called big data , that can consume more than the available RAM (several GB nowadays). In such situations, objects in the working space must be persisted to disk, for out-of-core computing, distribution of jobs, or caching.
An efficient strategy to write code dealing with big data is to rely on numpy arrays to hold large chunks of structured data . The code then handles objects or arbitrary containers (list, dict) with numpy arrays. For data management, joblib provides transparent disk persistence that is very efficient with such objects. The internal mechanism relies on specializing pickle to handle better numpy arrays.
Recent improvements reduce vastly the memory overhead of data persistence.
Limitations of the old implementationDumping/loading persisted data with compression was a memory hog, because of internal copies of data, limiting the maximum size of usable data with compressed persistence:

We see the increased memory usage during the calls to dump and load functions, profiled using the memory_profiler package with this gist
Another drawback was that large numpy arrays (>10MB) contained in an arbitrary Python object were dumped in separate .npy file, increasing the load on the file system:
>>> import numpy as np >>> import joblib # joblib version: 0.9.4 >>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))] # 3 files are generated: >>> joblib.dump(obj, '/tmp/test.pkl', compress=True) ['/tmp/test.pkl', '/tmp/test.pkl_01.npy.z', '/tmp/test.pkl_02.npy.z'] >>> joblib.load('/tmp/test.pkl') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])] What’s new: compression, low memory… Memory usage is now stable:
All numpy arrays are persisted in a single file: >>> import numpy as np >>> import joblib # joblib version: 0.10.0 (dev) >>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))] # only 1 file is generated: >>> joblib.dump(obj, '/tmp/test.pkl', compress=True) ['/tmp/test.pkl'] >>> joblib.load('/tmp/test.pkl') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]
Persistence in a file handle (ongoing work in a pull request )
More compression formats are availableBackward compatibility
Existing joblib users can be reassured: the new version is still compatible with pickles generated by older versions (>= 0.8.4). You are encouraged to update (rebuild?) your cache if you want to take advantage of this new version.
Benchmarks: speed and memory consumptionJoblib strives to have minimum dependencies (only numpy) and to be agnostic to the input data . Hence the goals are to deal with any kind of data while trying to be as efficient as possible with numpy arrays .
To illustrate the benefits and cost of the new persistence implementation, let’s now compare a real life use case ( LFW dataset from scikit-learn ) with different libraries:
Joblib, with 2 different versions, 0.9.4 and master (dev), Pickle Numpy
The four first lines use non compressed persistence strategies, the last four use persistence with zlib/gzipstrategies. Code to reproduce the benchmarks is available on this gist .
:black_circle: Speed : the results between joblib 0.9.4 and 0.10.0 (dev) are similar whereas numpy and pickle are clearly slower than joblib in both compressed and non compressed cases.
:black_circle: Memory consumption : Without compression, old and new joblib versions are the same; with compression, the new joblib version is much better than the old one. Joblib clearly outperforms pickle and numpy in terms of memory consumption . This can be explained by the fact that numpy relies on pickle if the object is not a pure numpy array (a list or a dict with arrays for example), so in this case it inherits the memory drawbacks from pickle. When persisting pure numpy arrays (not tested here), numpy uses its internal save/load functions which are efficient in terms of speed and memory consumption.
:black_circle: Disk used : results are as expected: no