I'm porting chemfp from Python2 to Python3. I read a lot of ASCII files. I'm trying to figure out if it's better to read them as binary bytes or as text strings.
No matter how I tweak Python3's open() parameters, I can't get the string read performance to within a factor of 2 of the bytes read performance. As I haven't seen much discussion of this, I figured I would document it here.
chemfp reads chemistry file formats which are specified as ASCII. They contain user-specified fields which are 8-bit clean, so sometimes people use them to encode non-ASCII data. For example, the SD tag field "price" might include the price in GBP or EUR, and include the currency symbol either as Latin-1 or UTF-8. (I haven't come across other encodings, but I've also never worked with SD files used internally in, say, a Japanese pharamceutical company.)
These are text files, so it makes sense to read it as text, right? The main problem is, reading in "r" mode is a lot slower than reading "rb" mode. Here's my benchmark, which uses Python 3.5.2 on a Mac OS X 10.10.5 machine to read the first 10MiB from a 3.1GiB file:
% python -V Python 3.5.2 % python -m timeit 'open("chembl_21.sdf", "r").read(10*1024*1024)' 100 loops, best of 3: 10.3 msec per loop % python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)' 100 loops, best of 3: 3.74 msec per loopThe Unicode string read() is much slower than the byte string read(), with a performance ratio of 2.75. (I'll give all numbers in ratios.)
Python2 had a similar problem. I originally used "U"niversal mode in chemfp to read the text files in FPS format, but found that if I switched from "rU" to "rB", and wrote my code to support both '\n' and '\r\n' conventions, I could double my overall system read performance - the "U" option gives a 10x slowdown!
% python2.7 -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)' 100 loops, best of 3: 3.7 msec per loop % python2.7 -m timeit 'open("chembl_21.sdf", "rU").read(10*1024*1024)' 10 loops, best of 3: 36.7 msec per loopThis observation is not new. A quick Duck Duck Go search found a 2015 blog post by Nelson Minar which concluded:
Python 2 and Python 3 read bytes at the same speed In Python 2, decoding Unicode is 10x slower than reading bytes In Python 3, decoding Unicode is 3-7x slower than reading bytes In Python 3, universal newline conversion is ~1.5x slower than skipping it, at least if the file has DOS newlines In Python 3, codecs.open() is faster than open().The Python3 open() function takes more parameters than Python2, including 'newline', which affects how the text mode reader identifies newlines, and 'encoding':
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True) ... newline controls how universal newlines works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows: * On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.I'll 'deuniversalize' the text reader and benchmark newline="\n" and newline="":
% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)' 100 loops, best of 3: 3.81 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline="\n").read(10*1024*1024)' 100 loops, best of 3: 8.8 msec per loop python -m timeit 'open("chembl_21.sdf", "r", newline="").read(10*1024*1024)' 100 loops, best of 3: 10.2 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline=None).read(10*1024*1024)' 100 loops, best of 3: 10.2 msec per loopThe ratio of 2.3 for newline="\n" slowndown is better than the 2.75 for univeral newlines and the newline="" case that Nelson Minar tested, but still less than half the performance of the byte reader.
I also wondered if the encoding made a difference:
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="ascii").read(10*1024*1024)' 100 loops, best of 3: 8.8 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="utf8").read(10*1024*1024)' 100 loops, best of 3: 8.8 msec per loop % python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="latin-1").read(10*1024*1024)' 100 loops, best of 3: 10.1 msec per loop My benchmark shows that ASCII and UTF-8 encodings are equally fast, and Latin-1 is 14% slower, even though my data set contains only ASCII. I did not expect any difference. I assume a lot of time has been spent making the UTF-8 code go fast, but don't know why the Latin-1 reader is noticably slower on ASCII data.Nelson Minar also tested the codecs.open() performance, so I'll repeat it:
% python -m timeit -s 'import codecs' 'codecs.open("chembl_21.sdf", "r").read(10*1024*1024)' 100 loops, best of 3: 10.2 msec per loopI noticed no performance difference between codec.open() and builtin open() for this test case.
I've left with a bit of a quandary. I work with ASCII text data, with only the occasional non-ASCII field. For example, chemfp has specialized code to read an id tag and encoded fingerprint field from an SD file. In rare and non-standard cases, the handful of characters in the id/title line might be non-ASCII, but the hex-encoded fingerprint is never anything other than ASCII. It makes sense to use the text reader. But If I use the text reader, it will decode everything in each record (typically 2K-8K bytes), when I only need to decode at most 100 bytes of the record.
In chemfp, I used to have a two-pass solution to find records in an SD file. The first pass found the fields of interest, and the second counted newlines for better error reporting. I found that even that level of data re-scanning caused an observable slowdown, so I shouldn't be surprised that an extra pass to check for non-ASCII characters might also be a problem. But, two-fold slowdown?
This performance overhead leads me to conclude that I need to process my performance critical files as bytes, rather than strings, and delay the byte-to-string decoding as much as possible.
RDKit and (non-)Unicode I checked with the RDKit, which is a cheminformatics toolkit. The core is in C++, with Python extensions through Boost.Python. It treats the files as bytes, and lazily exposes the data to Python as Unic