Pyflame: Uber Engineering’s Ptracing Profiler for Python

At Uber, we make an effort to write efficient backend services to keep our compute costs low. This becomes increasingly important as our business grows; seemingly small inefficiencies are greatly magnified at Uber’s scale. We’ve found flame graphs to be an effective tool for understanding the CPU and memory characteristics of our services, and we’ve used them to great effect with ourGo andjavascript services. In order to get high quality flame graphs for python services, we wrote a high-performance profiler called Pyflame , implemented in C++. In this article, we explore design considerations and some unique implementation characteristics that make Pyflame a better alternative for profiling Python code.

Deterministic Profilers

Python offers several built-in deterministic profilers via the profile and cProfile modules. The deterministic profilers in Python ( profile and cProfile ) work by using the sys.settrace() facility to install a trace function that’s run at various points of interest, such as the start and end of each function and at the beginning of each logical line of code. This mechanism yields high-resolution profiling information, but it has a number of shortcomings.

High Overhead

The first drawback is its extremely high overhead: we commonly see it slowing down programs by 2x. Worse, we found this overhead to cause inaccurate profiling numbers in many cases. The cProfile module has difficulty accurately reporting timing statistics for methods that run very quickly because the profiler overhead itself is significant in those cases. Many engineersdon’t useprofiling information because they can’t trust its accuracy.

Lack of Full Call Stack Information

The second problem with the built-in deterministic profilers is that they don’t record full call stack information. The built-in profiling modules only record information going up one stack level, which limits the usefulness of these modules. For example,when one decorator is applied to a large number of functions, the decorator frequently shows up in the callees and callers sections of the profiling output, with the true call information obscured due to the flattened call stack information. This clutter makes it difficult to understand true callee and caller information.

Lack of Services Written for Profiling

Finally, the built-in deterministic profilers require that the code be explicitly instrumented for profiling. A common problem for us is that many services weren’t written with profiling in mind. Under high load, we may encounter serious performance problems with the service and want to collect profiling information quickly. Since the code isn’t already instrumented for profiling, there’s no way to immediately start collecting profiling information. If the load is severe enough, we may need an engineer to write code to enable a deterministic profiler (typically by adding an RPC method to turn it on and another to dump profiling data). This code then needs to be reviewed, tested, and deployed. The whole cycle might take several hours, which is not fast enough for us.

Sampling Profilers

There are also a number of third-party sampling profilers for Python. These sampling profilers typically work by installing a POSIX interval timer , which periodically interrupts the process and runs a signal handler to record stack information. Sampling profilers sample the profiled process rather than deterministically collecting profiling information. This technique is effective because the sampling resolution can be dialed up or down. When the sampling resolution is high, the profiling data is more accurate but performance suffers. For instance, the sampling resolution can be set high to get detailed profiles with a correspondingly high amount of overhead, or it can be set low to get less detailed profiles with less overhead.

A few limitations come with sampling profilers. First, they typically come withhigh overhead because they’re implemented in Python. Python itself is not fast, especially compared to C or C++. In fact, the cProfile deterministic profiler is implemented in C for this reason. With these sampling profilers, getting acceptable performance often means setting the timer frequency to something that is relatively coarse-grained.

The other limitation is that the code needs to be explicitly instrumented for profiling, just as with deterministic profilers. Therefore, existing sampling profilers lead to the same problem as before: under high load, we want to profile some code, only to realize we have to rewrite it first.

Pyflame to the Rescue

With Pyflame, we wanted to maintain all of the possible profiling benefits:

Collect the full Python stack, all the way to its root Emit data in a format that could be used to generate a flame graph Have low overhead Work with processes not explicitly instrumented for profiling

More importantly, we aimed to avoid all existing limitations. It might sound impossible to ask for all of the features without making any sacrifices. But it’s not as impossible as it sounds!

Using ptrace for Python Profiling

Most Unix systems implement a special process trace system call called ptrace(2) . ptrace is not part of the POSIX specification, but Unix implementations like BSD, OS X, and linux all provide a ptrace implementation that allows a process to read and write to arbitrary virtual memory addresses, read and write CPU registers, deliver signals, etc. If you’ve ever used a debugger like GDB , then you’ve used software that’s implemented using ptrace.

It’s possible to use ptrace to implement a Python profiler. The idea is to periodically ptrace attach to the process, use the memory peeking routines to get the Python stack trace, and then detach from the process. Specifically with Linux ptrace , a profiler can be written using the request types PTRACE_ATTACH , PTRACE_PEEKDATA , and PTRACE_DETACH . In theory, this is pretty straightforward. In practice, it’s complicated by the fact that recovering the stack trace using only the PTRACE_PEEKDATA request is very low-level and unintuitive.

First, we’ll briefly cover how the PTRACE_PEEKDATA request wo

Pyflame: Uber Engineering’s Ptracing Profiler for Python

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本