Better Python compressed persistence in joblib

Problem setting: persistence for big data

Joblibis a powerful python package for management of computation: parallel computing, caching, and primitives for out-of-core computing. It is handy when working on so called big data , that can consume more than the available RAM (several GB nowadays). In such situations, objects in the working space must be persisted to disk, for out-of-core computing, distribution of jobs, or caching.

An efficient strategy to write code dealing with big data is to rely on numpy arrays to hold large chunks of structured data . The code then handles objects or arbitrary containers (list, dict) with numpy arrays. For data management, joblib provides transparent disk persistence that is very efficient with such objects. The internal mechanism relies on specializing pickle to handle better numpy arrays.

Recent improvements reduce vastly the memory overhead of data persistence.

Limitations of the old implementation

Dumping/loading persisted data with compression was a memory hog, because of internal copies of data, limiting the maximum size of usable data with compressed persistence:

Better Python compressed persistence in joblib

We see the increased memory usage during the calls to dump and load functions, profiled using the memory_profiler package with this gist

Another drawback was that large numpy arrays (>10MB) contained in an arbitrary Python object were dumped in separate .npy file, increasing the load on the file system:

>>> import numpy as np >>> import joblib # joblib version: 0.9.4 >>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))] # 3 files are generated: >>> joblib.dump(obj, '/tmp/test.pkl', compress=True) ['/tmp/test.pkl', '/tmp/test.pkl_01.npy.z', '/tmp/test.pkl_02.npy.z'] >>> joblib.load('/tmp/test.pkl') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])] What’s new: compression, low memory… Memory usage is now stable:
Better Python compressed persistence in joblib

All numpy arrays are persisted in a single file: >>> import numpy as np >>> import joblib # joblib version: 0.10.0 (dev) >>> obj = [np.ones((5000, 5000)), np.random.random((5000, 5000))] # only 1 file is generated: >>> joblib.dump(obj, '/tmp/test.pkl', compress=True) ['/tmp/test.pkl'] >>> joblib.load('/tmp/test.pkl') [array([[ 1., 1., ..., 1., 1.]], array([[ 0.47006195, 0.5436392 , ..., 0.1218267 , 0.48592789]])]

Persistence in a file handle (ongoing work in a pull request )

More compression formats are available

Backward compatibility

Existing joblib users can be reassured: the new version is still compatible with pickles generated by older versions (>= 0.8.4). You are encouraged to update (rebuild?) your cache if you want to take advantage of this new version.

Benchmarks: speed and memory consumption

Joblib strives to have minimum dependencies (only numpy) and to be agnostic to the input data . Hence the goals are to deal with any kind of data while trying to be as efficient as possible with numpy arrays .

To illustrate the benefits and cost of the new persistence implementation, let’s now compare a real life use case ( LFW dataset from scikit-learn ) with different libraries:

Joblib, with 2 different versions, 0.9.4 and master (dev), Pickle Numpy
Better Python compressed persistence in joblib

The four first lines use non compressed persistence strategies, the last four use persistence with zlib/gzipstrategies. Code to reproduce the benchmarks is available on this gist .

:black_circle: Speed : the results between joblib 0.9.4 and 0.10.0 (dev) are similar whereas numpy and pickle are clearly slower than joblib in both compressed and non compressed cases.

:black_circle: Memory consumption : Without compression, old and new joblib versions are the same; with compression, the new joblib version is much better than the old one. Joblib clearly outperforms pickle and numpy in terms of memory consumption . This can be explained by the fact that numpy relies on pickle if the object is not a pure numpy array (a list or a dict with arrays for example), so in this case it inherits the memory drawbacks from pickle. When persisting pure numpy arrays (not tested here), numpy uses its internal save/load functions which are efficient in terms of speed and memory consumption.

:black_circle: Disk used : results are as expected: no

Better Python compressed persistence in joblib

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本