Creating an implementation for a dynamic language using just in time compilation (JIT)techniques involves a lot of compromisesmainly between complexity of the implementation, speed, warm-up time and memory usage.
Especially speed is a difficult trade-off because it’s very easy to end-up spending more time optimizing a piece of code and emitting the assembly than we will ever be able to save by executing faster than executing it in a less optimized way.
This causes most JIT language implementations to use an approach of different tiers approaches to running the code and different amount of optimizations done depending on how often the specific piece of code gets executed. Thereby reducing the chancethat more timewill get spend transforming the code in to a more efficient representation than it would take to execute it in a lessefficient representation.
baseline just in time compilerWe noticed that our interpreter is interpreting code quite slowly while the LLVMtier takes a lot of time to JIT(even with the object cache which made it much faster) so it was obvious that we either have to speed the interpreter up or introduce a new tier in between.There are well-known problems with our interpreter, mainly it’s slow because it does not represent the code in a contiguous block of memory (bytecode) but instead it involves a lot of pointer chasing because we reuse our AST nodes. Fixing this would be comparable easy but we still thought that this will only improve the performance a little bit but will not give us the performance we want.
About a year ago we introduced a new execution tier instead, the baseline jit (bjit). Itis used forpython code which is executed a medium numberof times and therefore lives between the interpreter and the LLVMJIT tier. In practice this means mostcode which executes more than 25 times will currently end-up in the bjit and if it gets executed more than about 2500 times we will recompile it using the LLVM tier.
The main goal of the bjit isto generate reasonable machine code very fast and making heavy use ofinline caches to get good performance (more on this further down).Itinvolved a numberof design decisions (some may change in the future) but what we currently ended up with:
reuse our inline cache mechanism it transform the bjit from only being able to remove the interpretation overhead (which is quite low for python it depends on the workload but probablynot more than 20%) to a JITwhich actually is able to improve the performance by a much larger factor generate machine code for a basic block at a time only generating code for blocks which actually get executed reducesthe time to generate code and memory usage at the expense ofnot being able to do optimizations across blocks(at the moment) highly coupled to the interpreter and usingthe same frame format making it very easy and fastto switch between the interpreter and bjit at every basic block start we can fallback to the interpreter for blocks which contain operationswhich we are unable to JIT or for blocks which are unreasonable to JIT because the may be very large and generating code for them would cost too much memory makes it easy to tier up to the bjit when we interpret afunction which contains a loop with a large amount of iterations does not use type analysis and all code it generates makes no assumptions abouttypes thismakes it always safe to execute code in the bjit type specific code is only inside the ICs and always contains a call to a generic implementation in case the assumptions don’t hold all types are boxed / real python objects it collects type information which we will use inLLVM tier togenerate more optimized code later on if the function turns out to be hot if an assumption in the LLVM tier turns out to be wrongwe will deoptimize to the interpreter/bjit Inline Cachethe inline cache mechanism is used in the LLVMtier and in the baseline JITand is currently responsible for most of the performance improvements over the cpython interpreter (which does not use this technique). It removes most of the dynamic dictionary lookups and additional branching which a “normal”python interpreter often has todo. For every operation where we can use ICs we will provide a block of memory and fill it with a lot of nops and a call to the generic implementation of the operation.Therefore the first time we execute the code we will call into the generic implementation but itwill trace the execution of the operation using the arguments supplied. It thenfills in the block of memory a more optimized type specific version of the operation which we can usethe next time we will hit this IC slot if the assumptions the trace made still hold.
Here is a simple diagram of how a IC with two slots couldlook like:

A simple example will make it easier to understand what we are doing.
For the python function:
def f(a, b): return a + bTheCFGwill look like this:
Block 0 'entry'; Predecessors: Successors: #0 = a #1 = b #2 = #0+#1 return #2We will now look at the IC for #2 = #0+#1
For example if we call f(1, 1) for the first time the C++ function binop() will trace the execution and fill in the memory block with the code to do an addition between two python int objects (it uses a C++ helper function called intAddInt() ):

Notice the guard comparisons inside the first IC slot, they make sure that we will only use the more optimized implementation of the operation if it’s safe to do so (in this case the arguments have the same types and the types did not get modified since the trace got created) and otherwise jump to the nextIC slot. Or if there is no optimized version call the generic implementation which is always safe to execute.
Most code is not very dynamic which means filling inoneortwoslots with optimized versions of a operation is enough to catch all encountered cases.
For example if we later on call f("hello ", "world") we will add a new slot in the IC:

We use ICs for nearly all operations not only for binary oneslike the example showed.We also use them for stuff like global scope variable lookup, retrieving and setting attributes and much more (we also support more than two slots). Not all traces call helper functions like we have seen in theexample some are inlined in the slot.
Pyston willoverwrite slots if they already generated slots turn out to be invalid or unused because they assumption of the trace don’t holdanymore. Some code (luckily this is uncommon) is highly dynamic in this cases we will try to fill in the slot with a less aggressive version if possible one which makes less assumption. If not possible we will just always call the generic version (like cpython always does).
The code we emit inside the ICs has similar trade offs to the bjit code mainly it needs to get emittedvery fast. We prefergenerating smaller code instead offaster one because of the fixed size of the inline cache. It’s better to generate a smaller version which allows us to embed more slots if necessary and trashesthe instruction cache less.
lots of ideas for improvementsBoth the inline cache mechanism and the bjit have a lot of room for improvements. Some of the ideas we have are:
directl