I’ve been itching to write this article since I authored python vs Copy on Write last year. To recap, I found myself awake at 4 AM trying to performance tune the cache on a long running ETL written in Python.
In a flash of pizza fueled, sleep deprived clarity, I remembered Brandon Rhodes’ talk on Python, Linkers, and Virtual Memory; because of reference counting, Python cannot take advantage of the Copy on Write optimization provided by the OS. An hour later I had a patch ready for our staging environment.
If you read my previous article, you may have been under the impression that my fix was a silver bullet and you’d be partially right.

The patch did have the intended effect of reducing the runtime of the areas of code that used the cache, but to my horror I found that I had created another bottleneck. We use psycopg2 to query our database, and our code was now spending hours running this line to create the new cache:
cursor.fetchall()Thinking back to my long night of coding, I remember noticing that for some reason the range(very large number) function that I was using to generate the dummy cache in my experiments was taking a long time to run. So, I hypothesized that allocating a large list to store the fetchall results was bottle necking my process.
I refactored, reran the ETL in our staging environment and waited as fast as I could. Like magic, the ETL finished with more than a 40% reduction in runtime and I could console myself in that my sleep schedule would return to normal.
But, there was always the lingering question as to why large list allocations take so much time. So grab your gear, prime your Python, C and Unix cheat sheets and get ready for a deep dive into the python interpreter.

Tools
To avoid reading through the C Python interpreter’s source, we’ll run a number of Unix profiling tools on Ubuntu 16.04 against some simple examples. We will start with ltrace for standard library traces and strace for system call traces.
Then we will move onto Valgrind to thoroughly diagnose the problem. Valgrind is a framework that contains a number of dynamic analysis tools. Its Memcheck tool is commonly used to detect memory leaks. For our problem we will be using Massif , DHAT and Callgrind from the Valgrind suite.
We will start with Massif for heap profiling. Massif provides a timeline of allocated memory. We will then use DHAT for some summarized allocation statistics.
The bulk of our work will focus on Callgrind to profile the call history using a call graph. KCachegrind will let us visualize the call graph and determine exactly where our cycles are being spent.
I should note that part of my purpose in writing this blog is to demonstrate the use of these tools. Were I to perform a task like this for practical purposes, I would probably reach for Memcheck and Callgrind, and likely skip the rest.
Feel free to follow along by checking out this git repository . In most cases, I’ve encapsulated examples into single bash scripts. In this article, I’ll make a note of the path of the example I’m using and cat the bash script so that you can see its contents. I’ve also produced a docker image that will provide a version of Python 2.7 compiled to be run with Valgrind. In order to install docker, run the following:
$ sudo apt-get install docker.io $ sudo adduser $USER dockerYou may have to log out and log back in, in order to be recognized as a member of the docker group.

The Benchmark
To start, let’s go back to the program that started this whole investigation:
src/python/benchmark.py #!/usr/bin/env python # -*- coding: utf-8 -*- range(10000000)Remember the original hypothesis; allocating large lists of integers caused an increase in runtime. The range function in this case is a very easy way to create a large list of integers. The size parameter passed to range is somewhat arbitrary. It has to be sufficiently large to be noticeable on the machine you’re running on. In my case, the magic number is 10,000,000.
A brief note on the differences between Python 2.7 and Python 3; range is implemented differently. We can measure the difference using the Unix time command:
src/python/$ time python2.7 benchmark.py real 0m0.404s user 0m0.220s sys 0m0.180s‘real’ is the time from start to finish. ‘user’ is the time spent in user-mode code, outside of the kernel. ‘sys’ is the time spent in the kernel via system calls. Compare this runtime to the results from Python 3:
src/python/$ time python3.5 benchmark.py real 0m0.040s user 0m0.032s sys 0m0.004sInstead of allocating a populated list, Python 3 creates a Range object. I haven’t verified, but I assume the range object operates like a generator. To eliminate this complexity, we’ll stick with Python 2.7.
The C Python interpreter is a complex piece of software. In order to establish a baseline for understanding its operation, we’ll create a naively comparable program in C:
src/c/benchmark.c #include <stdlib.h> #define LENGTH 10000000 int main() { // Allocate the memory. int * buffer = malloc(LENGTH * sizeof(int)); // Seed the memory. for (int index = 0; index < LENGTH; index++) { buffer[index] = index; } // Free the memory. free (buffer); return 0; }Note that you will need to install some software to get this program to compile:
$ sudo apt-get install build-essentialA quick check with the time command:
src/c/$ make rm -f benchmark rm -f *.out.* gcc -g -o benchmark benchmark.c src/c/$ time ./benchmark real 0m0.072s user 0m0.064s sys 0m0.004sI understand that this is an apples to oranges comparison, but the differences we find should be illuminating. The simplified C program will also let us explore our tool set with something that is easy to understand. Right, it’s time to learn how to defend your runtime against a blogger armed with apple and an orange.

ltrace
ltrace is a program that lets us take a peek at the standard library calls being made by a program. As our C and Python programs are somewhat simple, ltrace should give us a quick idea of how different these programs are. First, the C program:
src/c/ltrace src/c/$ cat ltrace #!/bin/sh ltrace ./benchmark src/c/$ ./ltrace __libc_start_main(0x400566, 1, 0x7ffdc5a20818, 0x400590 <unfinished ...> malloc(40000000) = 0x7fa98820b010 free(0x7fa98820b010) = <void> +++ exited (status 0) +++As expected, the C program allocated all of the 40 MB of memory with a single command (malloc).
Second, the Python program. Running ltrace directly on the python script is somewhat unwieldy. We’ll have to filter the results. Lets start with a summary of the calls:
src/python/ltrace % time seconds usecs/call calls function ------ ----------- ----------- --------- -------------------- 47.46 10.684727 43 246094 free 44.10 9.927329 40 245504 malloc 2.60 0.585793 39 14734 memcpy 1.55 0.348615 40 8530 memset 1.26 0.283193 40 6962 strlen 1.03 0.231403 39 5819 memcmp 0.52 0.116267 43 2688 strcmp 0.41 0.091900 40 2248 __ctype_b_loc 0.18 0.041460 39 1040 realloc 0.18 0.040438 13479 3 qsort 0.11 0.023975 39 602 strchr ...It seems that our script is spending the majority of its time on memory operations. Digging deeper into the mallocs, this script sorts and counts the calls based on the buffer size.
src/python/malloc src/python/$ cat malloc #!/bin/sh ltrace -e malloc -l python python benchmark.py 2>&1 | sed "s/).*//" | sed "s/.*(//" | sort -nr | uniq -c $ ./malloc 1 80000000 1 36654 4 36032 1 25593 1 25474 1 20808 1 19945 1 19712 ... 1 1141 1 1105 1 1053 1 1034 243929 992 1 991 1 990 1 978 1 977 ...Two interesting observations can be made. The first is that there is a very large allocation being made; double the size of the C allocation. As we are interested in runtime, we also find that python is performing 243929 allocations of 992 byte buffers, or 242 MB.
As an aside, if you’d like to play around with ltrace and python, you can set the range in the benchmark program to something a bit more manageable, like 100. The output is still fairly unwieldy:
src/python/$ ltrace python -c "range(100)" __libc_start_main(0x49dab0, 2, 0x7ffed23f7ab8, 0x5d71b0 <unfinished ...> getenv("PYTHONHASHSEED") = nil getenv("PYTHONINSPECT") = nil getenv("PYTHONUNBUFFERED") = nil getenv("PYTHONNOUSERSITE") = nil getenv("PYTHONWARNINGS") = nil fileno(0x7f26b16ec8e0) = 0 ...Among other things, we can see some of the built in functions being copied around in memory:
... strlen("print") = 5 memcpy(0x7f1bf84e56b4, "print\0", 6) = 0x7f1bf84e56b4 strlen("range") = 5 memcpy(0x7f1bf84e56e4, "range\0", 6) = 0x7f1bf84e56e4 strlen("raw_input") = 9 memcpy(0x7f1bf84e5714, "raw_input\0", 10) = 0x7f1bf84e5714 ...We can also combine ltrace with tools like gnuplot to visually see the malloc calls over time. Note the size of the range parameter has been reduced so that we can see other allocations performed by the interpreter:
src/python/timeline src/python/$ cat timeline #!/bin/sh ltrace -tt -e malloc -l python python -c "range(10000)" 2>&1 | grep "malloc" | sed "s/python.*(//" | sed "s/).*//" | gnuplot -p -e "set xdata time; set timefmt '%H:%M:%S'; plot '<cat' using 1:2 with lines"
strace
Since we’re talking about ltrace, we may as well cover strace . strace lets us take a look at the system calls our programs are making. System calls are like functions that request the services from the kernel. They typically deal with resource allocation requests.
You can find a categorized list of them at System Calls for Developers . System calls are typically not made directly in code. Instead, wrapper functions in the form of a standard library are used. strace intecrepts these calls and prints them for inspection.
As before, wewill start with our C program. We will filter everything except memory related calls:
src/c/strace src/c/$ cat strace #!/bin/sh strace -e memory ./benchmark src/c/$ ./strace brk(NULL) = 0xaf0000 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5e8b790000 mmap(NULL, 125519, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f5e8b771000 mmap(NULL, 3967392, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f5e8b1a4000 mprotect(0x7f5e8b363000, 2097152, PROT_NONE) = 0 mmap(0x7f5e8b563000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f5e8b563000 mmap(0x7f5e8b569000, 14752, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f5e8b569000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5e8b770000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5e8b76f000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5e8b76e000 mprotect(0x7f5e8b563000, 16384, PROT_READ) = 0 mprotect(0x600000, 4096, PROT_READ) = 0 mprotect(0x7f5e8b792000, 4096, PROT_READ) = 0 munmap(0x7f5e8b771000, 125519) = 0 mmap(NULL, 40001536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5e88b7e000 munmap(0x7f5e88b7e000, 40001536) = 0 +++ exited with 0 +++The two calls of interest are the last two prior to exit (mmap and munmap), which correspond to the malloc and free calls respectively. We can infer this correspondence because mmap and munmap are called with a size of 40,001,536.
On to the python program. We’ll summarize the results again with the -c flag to reduce the size of the output:
src/python/strace src/python/$ cat strace #!/bin/sh strace -c -e memory python benchmark.py src/python/$ ./strace % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000264 0 61837 brk 0.00 0.000000 0 29 mmap 0.00 0.000000 0 14 mprotect 0.00 0.000000 0 2 munmap ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000264 61882 totalIt seems the bulk of our memory related system calls are brk calls and not mmap calls. What’s the difference between brk and mmap? brk is typically used for small allocations. mmap on the other hand lets you allocate independent regions without being limited to a single chunk of memory. It useful for avoiding issues like memory fragmentation when allocating large memory blocks.
As we found with ltrace, it seems that in addition to the large block of memory allocated by mmap, the python process also allocates a large number of small memory blocks with brk.
The Environment So far, ltrace and strace have been useful for getting an overall understanding of the differences between the C and