Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

A look back: How Lily has evolved over 5 years of development

$
0
0

This July, it will have been five years since I pushed the initial commit for Lily . Since then, I've been building and refining the language. In a couple of weeks, I'll be pushing out a new release and I'll talk about that.

Throughout Lily's entire development, I have been the sole developer of it. I started Lily with a couple of vague ideas, and little more. At the time I had not written any code that would be going into production (and I still haven't). My biggest project was a telnet client! I don't have a formal education in CS either (I taught myself everything, much of it along the way).

Five years is a long time in a software project. What this post will do is to explain why certain decisions were made. Their tradeoffs are included, to some degree, with it tailing off in their implementation. This post is less about "how", and more about "why".

Failed ideas, as well as successes, are included. Finally, these ideas are included in roughly chronological order. I'm going from start to finish using the log (my memory is fuzzy in some areas). My hope with this is that it will clarify some of the design decisions that I've made. However, since this is a recap of something five years old, bear in mind that there's going to be quite a bit to talk about.

Year 0 The beginning:

I started off with two requirements:

The language should be statically-typed.

Static typing was a must. php can be transformed into C++ using Facebook's HipHop. Complete static typing would allow better optimization than a dynamic language. The safety was at the time seen as an accidental benefit.

But I don't want a compiled language. I want people to have a script and just run it, just like with PHP. Turn-around time wasn't a big deal back then. It was more about avoiding any kind of compiler, and eventually getting a mod_lily that could run on Apache.

There'a also a different expectation when building an interpreter versus building a compiler. You build a compiler, or a compiler frontend, and you inevitably deal with problems compiling this block of code on some hardware, or that doesn't optimize correctly. An interpreter, I assumed, would be a lot easier. I build a vm, and it's portable. I'd build my own stuff too, so that I would have maximum control.

The language should be able to have 'code between tags', in the same vein as PHP.

That, I feel, is a big selling point for PHP. Maybe you don't want a framework to build something. You know a little HTML, or you only -need- a little bit of coding here and there. You can do that. Maybe you want a framework. Okay, you can have pages that are a single tag, and that tag can call your framework function.

Design

The interpreter needed some division, some idea of barriers between different parts to make sure it didn't result in parser slowly controlling everything.

Lexer: Do the scanning, and the template handling code.

Parser: Be the brains, control everything.

Symtab: Hold builtin classes

Ast: Handle expressions

Emitter: Expression walking, and block entry/exit.

Vm: Do things.

Raiser: Handle errors

Many important decisions were made, some good, some less so.

The syntax tree pool was created.

The pool was responsible for managing the current expression, the root (the one with the lowest precedence), and handling the actual merging. Rather than create new trees for new expression, each full expression would finish off by clearing the pool. In doing so, Lily would only get to keep information for each expression that was run.

Having a pool to manage expressions also meant a huge reduction in the amount of memory a long-lived program would take.

Who determines the correctness of an expression?

Initially, parser verified the form, the syntax tree would verify the argument count of a function, and the emitter would verify arguments when calling a function.

This was later refined, so that emitter would verify arguments and types.

The idea became more clear:

Parser checks expression form, the pool manages merging, and emitter does verification.

The basic types:

They are: int, number, str, and list. The names were chosen for reasons that I do not recall. I think that, at the time, I disliked camel-case names for classes because it felt too Java.

Who has type information?

Initially, there were no generics. It made sense to have parser completely unaware of type information. Parser's job was to build up expressions and go through blocks. Verify form, but not content. Often times, even in the beginning with parenthetical expressions, it was difficult to figure out the type of some expression without walking the tree.

Once the tree is walked, of course, then you've kind of done emitter's job. So emitter has type information, because it uses that to make sure that you don't, say, add a string to an integer.

The vm, initially, had type information as well. Early on, it was easy to do that because there were no generics. It made debugging easier to have type information in the vm.

So parser has no type information, emitter does, and so does the vm.

What do opcodes look like?

When it comes to doing opcodes, there are two different things you can go with:

You can have small opcodes, and a stack, like this:

push r1

push r2

add # Which pops 2, and pushes back 1

Or you can have something like this:

add r1, r2, r3 # Add r1 and r2, push to r3.

wherin r1, r2, and r3 are variables that are local to the current frame of execution.

Both strategies have their advantages. I ended up doing something like this:

add &r1, &r2, &r3

Clever me said that this strategy was a great idea, because instead of creating different sets of locals for each frame, I'd have a pool of them. I'd be pulling symbols directly from symtab, and also intermediates from emitter and writing their addresses directly into the opcode stream.

This meant that opcodes were initially 64-bits wide, and that function calls required saving and restoring to avoid corruption when calling a function. But calls of any sort were not going to happen in year 0, so it wasn't a problem yet.

It was not a good decision. A lot of pain was felt trying to do saving and restoring.

What about debugging instructions? I created an internal function to give a dump of what the internal structure of opcodes looked like. It allowed me to determine if something was the result of the vm going wrong, or it was with the emitter going wrong. To aid with line numbering, I decided to put line numbers at the front of every opcode that I determined needed it. I didn't want to create a line table like Lua did, because it was easier to say that the line number of something is at [1] relative to any opcode. No need to do any special lookups. How should the structure of the interpreter be laid out?

Initially, data was global. However, about two months into development, a decision was made to isolate everything about the interpreter into different structures, and pass those structures down. A parse state, a lexer state, etc.

The idea came from reading about Lua doing it, and thinking about the benefits of it. Having interpreter state into a structure the way I did it, meant that multiple interpreters could exist alongside each other at the same time, so long as I didn't do any writing of global data.

What about allocation failures? The strategy, in the beginning, was to try to free any local information, and later shutdown the interpreter if any all

Viewing all articles
Browse latest Browse all 9596

Trending Articles