Andrew Dalke: Reading ASCII file in Python3.5 is 2-3x faster as bytes than strin ...

[previous | next ] /home/ writings / diary / archive / 2016 / 08 /03/bytes_and_unicode_read_performance Reading ASCII file in python3.5 is 2-3x faster as bytes than string

I'm porting chemfp from Python2 to Python3. I read a lot of ASCII files. I'm trying to figure out if it's better to read them as binary bytes or as text strings.

No matter how I tweak Python3's open() parameters, I can't get the string read performance to within a factor of 2 of the bytes read performance. As I haven't seen much discussion of this, I figured I would document it here.

chemfp reads chemistry file formats which are specified as ASCII. They contain user-specified fields which are 8-bit clean, so sometimes people use them to encode non-ASCII data. For example, the SD tag field "price" might include the price in GBP or EUR, and include the currency symbol either as Latin-1 or UTF-8. (I haven't come across other encodings, but I've also never worked with SD files used internally in, say, a Japanese pharamceutical company.)

These are text files, so it makes sense to read it as text, right? The main problem is, reading in "r" mode is a lot slower than reading "rb" mode. Here's my benchmark, which uses Python 3.5.2 on a Mac OS X 10.10.5 machine to read the first 10MiB from a 3.1GiB file:

% python -V Python 3.5.2 % python -m timeit 'open("chembl_21.sdf", "r").read(10*1024*1024)' 100 loops, best of 3: 10.3 msec per loop % python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)' 100 loops, best of 3: 3.74 msec per loop

The Unicode string read() is much slower than the byte string read(), with a performance ratio of 2.75. (I'll give all numbers in ratios.)

Python2 had a similar problem. I originally used "U"niversal mode in chemfp to read the text files in FPS format, but found that if I switched from "rU" to "rB", and wrote my code to support both '\n' and '\r\n' conventions, I could double my overall system read performance - the "U" option gives a 10x slowdown!

% python2.7 -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)' 100 loops, best of 3: 3.7 msec per loop % python2.7 -m timeit 'open("chembl_21.sdf", "rU").read(10*1024*1024)' 10 loops, best of 3: 36.7 msec per loop

This observation is not new. A quick Duck Duck Go search found a 2015 blog post by Nelson Minar which concluded:

Python 2 and Python 3 read bytes at the same speed In Python 2, decoding Unicode is 10x slower than reading bytes In Python 3, decoding Unicode is 3-7x slower than reading bytes In Python 3, universal newline conversion is ~1.5x slower than skipping it, at least if the file has DOS newlines In Python 3, codecs.open() is faster than open().

The Python3 open() function takes more parameters than Python2, including 'newline', which affects how the text mode reader identifies newlines, and 'encoding':

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True) ... newline controls how universal newlines works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows: * On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.

I'll 'deuniversalize' the text reader and benchmark newline="\n" and newline="":

% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)' 100 loops, best of 3: 3.81 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline="\n").read(10*1024*1024)' 100 loops, best of 3: 8.8 msec per loop python -m timeit 'open("chembl_21.sdf", "r", newline="").read(10*1024*1024)' 100 loops, best of 3: 10.2 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline=None).read(10*1024*1024)' 100 loops, best of 3: 10.2 msec per loop

The ratio of 2.3 for newline="\n" slowndown is better than the 2.75 for univeral newlines and the newline="" case that Nelson Minar tested, but still less than half the performance of the byte reader.

I also wondered if the encoding made a difference:

% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="ascii").read(10*1024*1024)' 100 loops, best of 3: 8.8 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="utf8").read(10*1024*1024)' 100 loops, best of 3: 8.8 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="latin-1").read(10*1024*1024)' 100 loops, best of 3: 10.1 msec per loop My benchmark shows that ASCII and UTF-8 encodings are equally fast, and Latin-1 is 14% slower, even though my data set contains only ASCII. I did not expect any difference. I assume a lot of time has been spent making the UTF-8 code go fast, but don't know why the Latin-1 reader is noticably slower on ASCII data.

Nelson Minar also tested the codecs.open() performance, so I'll repeat it:

% python -m timeit -s 'import codecs' 'codecs.open("chembl_21.sdf", "r").read(10*1024*1024)' 100 loops, best of 3: 10.2 msec per loop

I noticed no performance difference between codec.open() and builtin open() for this test case.

I've left with a bit of a quandary. I work with ASCII text data, with only the occasional non-ASCII field. For example, chemfp has specialized code to read an id tag and encoded fingerprint field from an SD file. In rare and non-standard cases, the handful of characters in the id/title line might be non-ASCII, but the hex-encoded fingerprint is never anything other than ASCII. It makes sense to use the text reader. But If I use the text reader, it will decode everything in each record (typically 2K-8K bytes), when I only need to decode at most 100 bytes of the record.

In chemfp, I used to have a two-pass solution to find records in an SD file. The first pass found the fields of interest, and the second counted newlines for better error reporting. I found that even that level of data re-scanning caused an observable slowdown, so I shouldn't be surprised that an extra pass to check for non-ASCII characters might also be a problem. But, two-fold slowdown?

This performance overhead leads me to conclude that I need to process my performance critical files as bytes, rather than strings, and delay the byte-to-string decoding as much as possible.

RDKit and (non-)Unicode I checked with the RDKit, which is a cheminformatics toolkit. The core is in C++, with Python extensions through Boost.Python. It treats the files as bytes, and lazily exposes the data to Python as Unic

Andrew Dalke: Reading ASCII file in Python3.5 is 2-3x faster as bytes than strin ...

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本