FASTCALL microbenchmarks

For my FASTCALL project (Cpython optimization avoiding temporary tuples and dictionaries to pass arguments), I wrote many short microbenchmarks. I grouped them into a new Git repository: pymicrobench . Benchmark results are required by CPython developers to prove that an optimization is worth it. It's not uncommon that I abandon a change because the speedup is not significant, makes CPython slower, or because the change is too complex. Last 12 months, I counted that I abandonned 9 optimization issues, rejected for different reasons, on a total of 46 optimization issues.

This article gives Python 3.7 results of these microbenchmarks compared to Python 3.5 (before FASTCALL). I ignored 3 microbenchmarks which are between 2% and 5% slower: the code was not optimized and the result is not signifiant (less than 10% on a microbenchmark is not significant).

On results below, the speedup is between 1.11x faster (-10%) and 1.92x faster (-48%). It's not easy to isolate the speedup of only FASTCALL. Since Python 3.5, Python 3.7 got many other optimizations.

Using FASTCALL gives a speedup around 20 ns: measured on a patch to use FASTCALL. It's not a lot, but many builtin functions take less than 100 ns, so 20 ns is significant in practice! Avoiding a tuple to pass positional arguments is interesting, but FASTCALL also allows further internal optimizations.

Microbenchmark on calling builtin functions:

Benchmark 3.5 3.7 struct.pack("i", 1) 105 ns 77.6 ns: 1.36x faster (-26%) getattr(1, "real") 79.4 ns 64.4 ns: 1.23x faster (-19%)

Microbenchmark on calling methods of builtin types:

Benchmark 3.5 3.7 {1: 2}.get(7, None) 84.9 ns 61.6 ns: 1.38x faster (-27%) collections.deque([None]).index(None) 116 ns 87.0 ns: 1.33x faster (-25%) {1: 2}.get(1) 79.4 ns 59.6 ns: 1.33x faster (-25%) "a".replace("x", "y") 134 ns 101 ns: 1.33x faster (-25%) b"".decode() 71.5 ns 54.5 ns: 1.31x faster (-24%) b"".decode("ascii") 99.1 ns 75.7 ns: 1.31x faster (-24%) collections.deque.rotate(1) 106 ns 82.8 ns: 1.28x faster (-22%) collections.deque.insert() 778 ns 608 ns: 1.28x faster (-22%) b"".join((b"hello", b"world") * 100) 4.02 us 3.32 us: 1.21x faster (-17%) [0].count(0) 53.9 ns 46.3 ns: 1.16x faster (-14%) collections.deque.rotate() 72.6 ns 63.1 ns: 1.15x faster (-13%) b"".join((b"hello", b"world")) 102 ns 89.8 ns: 1.13x faster (-12%)

Microbenchmark on builtin functions calling Python functions (callbacks):

Benchmark 3.5 3.7 map(lambda x: x, list(range(1000))) 76.1 us 61.1 us: 1.25x faster (-20%) sorted(list(range(1000)), key=lambda x: x) 90.2 us 78.2 us: 1.15x faster (-13%) filter(lambda x: x, list(range(1000))) 81.8 us 73.4 us: 1.11x faster (-10%)

Microbenchmark on calling slots ( __getitem__ , __init__ , __int__ ) implemented in Python:

Benchmark 3.5 3.7 Python __getitem__: obj[0] 167 ns 87.0 ns: 1.92x faster (-48%) call_pyinit_kw1 348 ns 240 ns: 1.45x faster (-31%) call_pyinit_kw5 564 ns 401 ns: 1.41x faster (-29%) call_pyinit_kw10 960 ns 734 ns: 1.31x faster (-24%) Python __int__: int(obj) 241 ns 207 ns: 1.16x faster (-14%)

Microbenchmark on calling a method descriptor (static method):

Benchmark 3.5 3.7 int.to_bytes(1, 4, "little") 177 ns 103 ns: 1.72x faster (-42%)

Benchmarks were run on speed-python , server used to run CPython benchmarks.

FASTCALL microbenchmarks

Trending Articles

分享一下protobufjs动态加载proto

《沈冰自述——我和周永康的故事》全本

50万5年，拿到伦敦政治经学院和莫斯科高等经济学院双学位 (豆瓣英国留学小组)

傳衣缽戴著媽媽法船受證

如何在 Ubuntu 18.04 服务器上安装和配置 KVM

Delphi 12.2.5 绿色版

拉习近平下马！爆炒佟丽娅丑闻央视现罕见诡异画面？【阿波罗网报导】

大象公会 |“主席头”的起源、发展与定型

免费翻墙节点大全

出售: 拉菲爾MA300

港媒：只在中共内部流传的习近平新南巡讲话

《新绝代双骄》全集免安装中文版下载

关门一家亲：习远平、张澜澜、徐才厚

PotPlayer 1.7.18958 可攜式阿榮版 (正式版) - 取代KMPlayer的免費影片播放軟體

[問題] 求ˊMVP2005中文版模組所需要的程式與更新檔案下載點!!!

[黑白字幕组] 学园战记无量 Gakuen Senki Muryou [12][1080p][简体内嵌]

【梦奇字幕组】★古畑任三郎★ Season 1 Episode 04 杀人传真 [720P][MKV]

uni-app cli项目内使用uniCloud需要使用HBuilderX的运行菜单运行项目，且需要在uniCloud目录关联服务空间

illegal property name

出售: HIT AUDIO SUPER 12 紫色