In case you missed it, Marius recently wrote a post on the Pyston blog about our baseline JIT tier. Our baseline JIT sits between our interpreter tier and our LLVM JIT tier, providing better speed than the interpreter tier but lower startup overhead than the LLVM tier.
There's been some discussion over on Hacker News , and the discussion turned to a commonly mentioned question: if LuaJIT can have a fast interpreter, why can't we use their ideas and make python fast? This is related to a number of other questions, such as "why can't Python be as fast as javascript or Lua", or "why don't you just run Python on a preexisting VM such as the JVM or the CLR". Since these questions are pretty common I thought I'd try to write a blog post about it.
The fundamental issue is:
Pythonspends almost all of its time in the C runtimeThis means thatit doesn't really matter how quickly you execute the "Python" part of Python. Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them. Another analogy I give is that executing Python is more similar to rendering HTML than it is to executing JS -- it's more of a description of what the runtime should do rather than an explicit step-by-step account of how to do it.
Pyston's performance improvements come from speeding up the C code, not the Python code. When people say "why doesn't Pyston use [insert favorite JIT technique here]", my question is whether that technique would help speed up C code. I think this is the most fundamental misconception about Python performance: we spend our energy trying to JIT C code, not Python code. This is also why I am not very interested in running Python on pre-existing VMs, since that will only exacerbate the problem in order to fix something that isn't really broken.I think another thing to consider is that a lot of people have invested a lot of time into reducing Python interpretation overhead. If it really was as simple as "just porting LuaJIT to Python", we would have done that by now.
I gave a talk on this recently, and you can find the slides here and a LWN writeup here (no video, unfortunately). In the talk I gave some evidence for my argument that interpretation overhead is quite small, and some motivating examples of C-runtime slowness (such as a slow for loop that doesn't involve any Python bytecodes).
One of the questions from the audience was "are there actually any people that think that Python performance is about interpreter overhead?". They seem to not read HN :)