Extreme IO performance with parallel Apache Parquet in Python

In this post, I show how Parquet can encode very large datasets in a small file footprint, and how we can achieve data throughput significantly exceeding disk IO bandwidth by exploiting parallelism (multithreading).

Apache Parquet: Top performer on low-entropy data

As you can read in the Apache Parquet format specification , the format features multiple layers of encoding to achieve small file size, among them:

Dictionary encoding (similar to how pandas.Categorical represents data, but they aren't equivalent concepts) Data page compression (Snappy, Gzip, LZO, or Brotli) Run-length encoding (for null indicators and dictionary indices) and integer bit-packing

To give you an idea of how this works, let's consider the dataset:

['banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'banana', 'apple', 'apple', 'apple']

Almost all Parquet implementations dictionary encode by default. So the first pass encoding becomes:

dictionary: ['banana', 'apple'] indices: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

The dictionary indices are further run-length encoded:

dictionary: ['banana', 'apple'] indices (RLE): [(8, 0), (3, 1)]

Working backwards, you can easily reconstruct the original dense array of strings.

In myprior blog post, I created a dataset that compresses very well with this style of encoding. When writing with pyarrow , we can turn on and off dictionary encoding (which is on by default) to see how it impacts file size:

import pyarrow.parquet as pq pq.write_table(dataset, out_path, use_dictionary=True, compression='snappy)

With a dataset that occupies 1 gigabyte (1024 MB) in a pandas.DataFrame, with Snappy compression and dictionary encoding, it occupies an amazing 1.436 MB , small enough to fit on an old-school floppy disk. Without dictionary encoding, it occupies 44.4 MB .

Parallel reads in parquet-cpp via PyArrow

In parquet-cpp , the C++ implementation of Apache Parquet, which we've made available to python in PyArrow, we recently added parallel column reads.

To try this out, install PyArrow from conda-forge:

conda install pyarrow -c conda-forge

Now, when reading a Parquet file, use the nthreads argument:

import pyarrow.parquet as pq table = pq.read_table(file_path, nthreads=4)

For low entropy data, decompression and decoding becomes CPU-bound. Because we are doing all the work in C++, we are not burdened by the concurrency issues of the GIL and thus can achieve a significant speed boost. See the results I achieved reading a 1 GB dataset to a pandas DataFrame on my quad-core laptop (Xeon E3-1505M):

Extreme IO performance with parallel Apache Parquet in Python

The second set of benchmarks is on the dataset which uses the default dictionary encoding. Even though the files are both small (~1.5 MB and ~45 MB), the impact of dictionary decoding on performance is substantial. With 4 threads, the performance reading into pandas breaks through an amazing 4 GB/s . This is much faster than Feather format or other alternatives I've seen.

Conclusions

With the 1.0 release of parquet-cpp (Apache Parquet in C++) on the horizon, it's great to see this kind of IO performance made available to the Python user base.

Since all of the underlying machinery here is implemented in C++, other languages (such as R) can build interfaces to Apache Arrow (the common columnar data structures) and parquet-cpp. The Python bindings are a lightweight wrapper on top of the underlying libarrow and libparquet C++ libraries.

Extreme IO performance with parallel Apache Parquet in Python

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本