Deriving a Virtual Machine

After reading Atsushi Igarashi’s paper Deriving Compilers and Virtual Machines for a Multi-Level Language, I gave myself the following exercise: given an interpreter of a small language, derive a virtual machine and a corresponding compiler. Such techniques probably date back to Ager et al., where they built connections between not one but five pairs of denotational semantics and existing abstract machines (Krivine’s, CEK, CLS, SECD, and Categorical Abstract Machine). They continued to explore a number of variations in subsequent papers. Igarashi’s paper also deals with a more difficult topic — to derive an abstract machine for a language with quotes/unquotes. As a starting point to appreciating their work, I tried this small example one could do in one afternoon.

Let us start with this simple language with numerical values, a binary operator for addition, lambda abstraction and variables represented by de Bruijn notation, and application:

data Exp = Val Int
         | Var Int
         | Plus Exp Exp
         | Lam Exp 
         | App Exp Exp deriving Show

The expression (λf a b. f a + f b) ((λc d. c + d) 1) 2 3, for example, would be represented by:

  (App (App (App 
      (Lam (Lam (Lam (Plus (App (Var 2) (Var 1)) 
                           (App (Var 2) (Var 0))))))
      (App (Lam (Lam (Plus (Var 0) (Var 1))))
           (Val 1)))
    (Val 2)) (Val 3))

An expression evaluates to either a number or a closure:

data Val = Num Int
         | Clos Env Exp deriving Show

And the evaluator can be defined, in an almost standard manner, in the continuation passing style:

type Cont = Val -> Val

type Env = [Val]

eval :: Exp -> Env -> Cont -> Val
eval (Val n) _ k = k (Num n)
eval (Var n) env k = k (env !! n)
eval (Plus e1 e2) env k = 
   eval e1 env (\(Num i) -> 
        eval e2 env (\(Num j) -> 
              k (Num (i + j))))
eval (Lam e) env k =
   k (Clos env e)
eval (App e1 e2) env k =
   eval e1 env (\(Clos env' body) ->
      eval e2 env (\v -> eval body (v : env') k))

Like in many derivations, we have cheated a little from the beginning. The use of de Bruijn indices makes env act like a stack, while CPS abstracts the control flow of the virtual machine to be derived.

Defunctionalising the Continuation

The derivation relies on Reynolds’ defunctionalisation: to enumerate the instances where a higher order function is used, and represent them by first-order structures. As the first step, we aim to replace the continuation Cont, a synonym of Val -> Val, by some datatype:

data Cont = Cont 0 | Cont1 ... | Cont2 ... | ...

and to define a function appK :: Cont -> Val -> Val pattern-matching Cont and performing the actions that were to be done in eval. Where there used to be a continuation application in eval, we replace it with an application of appK:

eval :: Exp -> Env -> Cont -> Val
eval (Val n) _ k = appK k (Num n)
eval (Var n) env k = appK k (env !! n)

The case for Plus is more complicated. In the original definition:

eval (Plus e1 e2) env k = 
   eval e1 env (\(Num i) -> 
        eval e2 env (\(Num j) -> 
              k (Num (i + j))))

the red subexpression should be replaced by a first-order data structure. We call it Cont1, leaving the free variables e2, env, and k as its parameters:

eval (Plus e1 e2) env k = 
   eval e1 env (Cont1 e2 env k)

the function appK, upon matching Cont1, carries out what was supposed to be computed in eval:

appK (Cont1 e2 env k) (Num i) =
    eval e2 env (\(Num j) -> k (Num (i + j)))

However, the red subexpression should be replaced by a first order construct too. Introducing a new constructor Cont2 and abstracting over the free variables i and k, we get:

appK (Cont1 e2 env k) (Num i) =
    eval e2 env (Cont2 i k)
appK (Cont2 i k) (Num j) = appK k (Num (i + j))

The next few cases are dealt with in the same way. Eventually we get the code below. The modified parts are marked in red:

data Cont = Cont0
          | Cont1 Exp Env Cont
          | Cont2 Int Cont
          | Cont3 Exp Env Cont     
          | Cont4 Exp Env Cont  deriving Show

eval :: Exp -> Env -> Cont -> Val
eval (Val n) _ k = appK k (Num n)
eval (Var n) env k = appK k (env !! n)
eval (Plus e1 e2) env k = eval e1 env (Cont1 e2 env k)
eval (Lam e) env k = appK k (Clos env e)
eval (App e1 e2) env k =
   eval e1 env (Cont3 e2 env k)

appK :: Cont -> Val -> Val
appK Cont0 v = v
appK (Cont1 e2 env k) (Num i) =
    eval e2 env (Cont2 i k)
appK (Cont2 i k) (Num j) = appK k (Num (i + j))
appK (Cont3 e2 env k) (Clos env' body) =
    eval e2 env (Cont4 body env' k)
appK (Cont4 body env' k) v = 
    eval body (v : env') k

Separating the Evaluator from the Virtual Machine

The functions eval and appK are mutually recursive. Later, however, appK is to evolve into a part of the virtual machine and eval the compiler. We do not want to mix the two stages. The next task is therefore to separate them.

In the definition of appK, the function eval is called in three places. In all these cases, eval is applied to an expression captured in Cont. Rather than performing the application in appK, we can also make it happen earlier in eval and store the partially applied result in Cont:

eval :: Exp -> Env -> Cont -> Val
eval (Val n) _ k = appK k (Num n)
eval (Var n) env k = appK k (env !! n)
eval (Plus e1 e2) env k = eval e1 env (Cont1 (eval e2) env k)
eval (Lam e) env k = appK k (Clos env (eval e))
eval (App e1 e2) env k =
   eval e1 env (Cont3 (eval e2) env k)

appK :: Cont -> Val -> Val
appK Cont0 v = v
appK (Cont1 c2 env k) (Num i) = 
    c2 env (Cont2 i k)
appK (Cont2 i k) (Num j) = appK k (Num (i + j))
appK (Cont3 c2 env k) (Clos env' cb) =
    c2 env (Cont4 cb env' k)
appK (Cont4 cb env' k) v = 
    cb (v : env') k 

The datatypes Cont and Val no longer store expressions, but partially applied instances of eval. Define:

type Compt = Env -> Cont -> Val

The datatypes are updated to:

data Val = Num Int
         | Clos Env Compt

data Cont = Cont0
          | Cont1 Compt Env Cont
          | Cont2 Int Cont
          | Cont3 Compt Env Cont     
          | Cont4 Compt Env Cont

Notice that while Env is a data stack, Cont, having a list-like structure, acts as a control stack. Constructors Cont1, Cont3, and Cont4 store a pointer (Env) into the data stack. What about Compt? As we will soon see, by applying another defunctionalisation, Compt becomes a list of instructions. The pointer in Cont with type Compt can therefore be seen as an instruction pointer.

Defunctionalising the Computation

By defunctionalising, we replace Compt by a first-order data structure, and create a function appC :: Compt -> Env -> Cont -> Val. This time, however, we try to choose more telling names for the introduced instructions. The first two clauses of eval, for example, is split into:

data Compt = Lit Int | Access Int | ...

eval :: Exp -> Compt
eval (Val m) = Lit m
eval (Var n) = Access n

appC :: Compt -> Env -> Cont -> Val
appC (Lit m) env k = appK k (Num m)
appC (Access n) env k = appK k (env !! n)

Examining the more interesting case eval (Plus e1 e2) env k = eval e1 env (Cont1 (eval e2) env k), we realise that we only have to abstract the two calls to eval as parameters:

data Compt = Lit Int | Access Int | Push Compt Compt ...

eval (Plus e1 e2) = Push1 (eval e1) (eval e2)

appC (Push1 is1 is2) env k = appC is1 env (Cont1 is2 env k)

The transformed program is given below. We also rename the constructors of Cont to more telling names, roughly following the naming by Igarashi :

data Cont = Halt
          | EvOpnd Compt Env Cont
          | EvSum Int Cont
          | EvArg Compt Env Cont     
          | EvBody Compt Env Cont  deriving Show

data Compt = Lit Int 
          | Access Int
          | Close Compt
          | Push1 Compt Compt
          | Push2 Compt Compt

eval :: Exp -> Compt
eval (Val m) = Lit m
eval (Var n) = Access n
eval (Plus e1 e2) = Push1 (eval e1) (eval e2)
eval (Lam e) = Close (eval e)
eval (App e1 e2) = Push2 (eval e1) (eval e2)

appC :: Compt -> Env -> Cont -> Val
appC (Lit m) env k = appK k (Num m)
appC (Access n) env k = appK k (env !! n)
appC (Close is) env k = appK k (Clos env is)
appC (Push1 is1 is2) env k = appC is1 env (EvOpnd is2 env k)
appC (Push2 is1 is2) env k = appC is1 env (EvArg is2 env k)

appK :: Cont -> Val -> Val
appK Halt v = v
appK (EvOpnd c2 env k) (Num i) =
    appC c2 env (EvSum i k)
appK (EvSum i k) (Num j) = appK k (Num (i + j))
appK (EvArg c2 env k) (Clos env' cb) =
    appC c2 env (EvBody cb env' k)
appK (EvBody cb env' k) v = 
    appC cb (v : env') k

Now, functions appC and appK together constitute a virtual machine, while eval compiles a program to its machine code Compt. The sample expression (λf a b. f a + f b) ((λc d. c + d) 1) 2 3, for example, is compiled to:


Push2 
  (Push2 
    (Push2 
      (Close 
        (Close
          (Close 
            (Push1 
              (Push2 
                (Access 2)
                (Access 1))
              (Push2 
                (Access 2)
                (Access 0))))))
      (Push2 
        (Close 
          (Close 
            (Push1 
              (Access 0)
              (Access 1))))
        (Lit 1)))
    (Lit 2))
  (Lit 3)

Once you understand how it works, the derived result is not so surprising. Igarashi sort of mentioned that it is not too difficult to add recursion. See his paper for more interesting results, such as extension to multi-level languages — programs that generate programs.

5 thoughts on “Deriving a Virtual Machine

  1. Christophe Poucet

    Interesting article :) I have found it very informational to have a better understanding how these different techniques work. It’s usually easy to roll an interpreter but going from there to a compiled system is not always evident and your article illustrates the different steps in a crystal-clear fashion.

    Thanks!

    P.S: I do think you have a small typo the first time you write the following:
    appK (Cont1 e2 env k) (Num i) =
    eval e2 env (Cont i k)

    should be

    appK (Cont1 e2 env k) (Num i) =
    eval e2 env (Cont2 i k)

    Reply
  2. MigMit

    Sorry, but… I don’t see the point. Let’s look at our data types:

    data Exp = Val Int | Var Int | Plus Exp Exp | Lam Exp | App Exp Exp
    data Compt = Lit Int | Access Int | Push1 Compt Compt | Close Compt | Push2 Compt Compt

    Isn’t that exactly the same thing (up to renaming of constructors)? Moreover, if we identify this two types, then eval seems to be an identity – OK, it forces almost everything, but that’s all id does. I don’t see, why you call id a “compiler”.

    Reply
  3. Dave

    After a quick skim of the Igarashi paper, I think the difference is they produce linear Compt terms — and it’s this linearization of the initial tree-structured term that prevents eval from trivializing to the identity.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>