Inpart 1 I went through some of the best practices in making sure your application runs as smoothly as possible. This time I’ll make some practical use of that knowledge.
It’s very tempting to open with a claim along the lines of “we’ve done the testing so you don’t have to” but it would be false advertising. Question everything. Conduct your own experiments. Build yourself a sandbox so you can tweak and test until you’re happy with the results.
tl;dr: If you’re going to skip reading, test your own application. There is no “one weird trick” below that makes your code fast. Also PyPy is fast.
Build asandboxI start by spinning up some AWS instances to conduct my experiments. I chose AWS because it’s a common target. If you run stuff on your own hardware, ask your IT department to give you a spare machine to play with. If you deploy on Google App Engine or Google Compute Engine, by all means use the same platform to play with your code.
I create an EC2 instance to run my code and an RDS instance to host a PostgreSQL server. I chose m4.large and db.m3.medium with Amazon linux as the operating system. My production environment would probably have an Elastic Load Balancer but I choose not to include it at least in my preliminary tests. I only want to measure the parts I can control and replace.
Deploy the applicationYou need to get the application (and all of its dependencies) to install, run and bind to a port. You also need some data to work with. Your product list may be fast when tested with ten products but slow when dealing with twelve million.
If you can, use a full (possibly anonymized) copy of your production data. Even if your code is correct your database server’s query optimizer may come up with entirely different execution plans depending on the information it has on the tables involved.
Saleor provides a manage.py populatedb command and that’s what I’m using to prepare the data. The important part is that all tests are run against the same database.
I’m running Cpython 2.7 as that’s the version many of us are stuck with.
Make sure Django itself isfastDisable DEBUG . I cannot stress this enough. If you conduct tests with DEBUG enabled, you’re not testing the same application your customers will face.
Remove all kinds of profilers and debuggers that are not meant to be used in production environments. Opbeat and New Relic are fine, Django debug toolbar has to go.
Enable the cached template loader so your code does not have to spend time looking for and parsing templates.
Make sure that your database connections are reused when possible. If your environment permits persisting database connections, configure CONN_MAX_AGE accordingly.
Commence testingMy goal here is not to tweak a working application so instead of optimizing for a particular use case I’m going to test how different platforms withstand sustained load.
I’ve selected the default home page as it’s fairly representative. It accesses the database to fetch a list of products and it makes use of a number of templates to display them.
I’m using wrk as my tool of choice. 10 concurrent connections should be enough to test anything under 100 requests per second but I’m using 50 concurrent connections to see what effects it has on latency. All tests were repeated until the consecutive runs gave the same results.
Here’s the test command I used:
$ wrk http://1.2.3.4:8000/ -c 50 -t 4 -d 30s --latency --timeout 30s Good ol’GunicornFor a number of years Gunicorn has been the most popular way of getting Django up and running. So popular in fact that many professionals started to treat it like a toy. Is it really slow though?

As you can see running threaded is probably not what you want. I’ve tested Gunicorn with various numbers of threads and processes before settling on 10 and 4 respectively.
As expected doubling the number of processes roughly doubles the throughput as the machine has two CPU cores. More processes result in slightly better performance but ― as processes cannot share state ― also higher memory use. Maintaining a high process count is not recommended as it introduces a thundering herd problem where each incoming connection results in all workers being woken up even though only one of them will be able to accept it.
What about the latency?

Keep in mind with concurrency that high the latency is much higher that what you’d expect to see during normal traffic. We’re not comparing absolute values here but rather the relative ability to handle congestion. It seems that the multi-process setup is a clear winner here.
uWSGI: what the prosuseIf your application sees a lot of traffic then you’re certainly no stranger to uWSGI . Its strictly-professional status is reinforced by its impenetrable documentation.
Chances are you’ve also opted to run it with the gevent backend. Because gevent is fast , right?

Again, threads offer poor performance because their parallel execution is limited by GIL. Unfortunately greenlets don’t seem to do much better. Why? Django views are a mixed bag.
Database access is clearly I/O-bound. This is where the cooperative multitasking model of greenlets gives us the advantage over preemptive multitasking of threads. Both are bound by GIL but with an extra CPU core to spare the greenlet pool is more efficient when it comes to switching tasks. With both CPU cores saturated to 100% the threads double their processing power while for greenlets it barely makes a difference.
This is because rendering templates is where Django becomes CPU-bound. Event loop-based worker pools are not particularly good at dealing with CPU-bound tasks. This is why greenlets can only help you so much. In a future article I may be able to demonstrate this better by comparing the performance of Django templates to that of Jinja2 .
The latency comparison looks similar:

As you can see with 4 processes both the throughput and the latency match those of Gunicorn. Whil