Github Classroom - Direct Github
In this assignment you’ll implement a small language called Boa, which implementes a Bitwise Offset Arrangement of different values. It also uses C function calls to implement some user-facing operations, like reading an input and reporting errors.
As usual, there are a few pieces that go into defining a language for us to compile.
A description of the concrete syntax – the text the programmer writes
A description of the abstract syntax – how to express what the programmer wrote in a data structure our compiler uses.
The semantics—or description of the behavior—of the abstract syntax, so our compiler knows what the code it generates should do.
The concrete syntax of Boa is:
<expr> :=
| (let (<bindings>) <expr>)
| (if <expr> <expr> <expr>)
| <identifier>
| <number>
| true
| false
| (add1 <expr>)
| (sub1 <expr>)
| (isNum <expr>)
| (isBool <expr>)
| (+ <expr> <expr>)
| (- <expr> <expr>)
| (* <expr> <expr>)
| (< <expr> <expr>)
| (> <expr> <expr>)
| (== <expr> <expr>)
<bindings> :=
| (<identifier> <expr>)
| (<identifier> <expr>) <bindings>
The abstract syntax of Boa is an OCaml datatype, and corresponds nearly one-to-one with the concrete syntax.
type prim1 =
| Add1
| Sub1
| IsNum
| IsBool
type prim2 =
| Plus
| Minus
| Times
| Less
| Greater
| Equal
type expr =
| ELet of (string * expr) list * expr
| EIf of expr * expr * expr
| EId of string
| ENumber of int
| EBool of bool
| EPrim1 of prim1 * expr
| EPrim2 of prim2 * expr * expr
There are four main changes that ripple through the implementation:
The representation of values requires a definition. We’ll use the following representations for the Boa runtime:
true
will be represented as the constant 0x0000000000000002
false
will be represented as the constant 0x0000000000000000
1
in the rightmost bit, with the actual two’s complement
value shifted to the left by one (as in class).
So, for example, 2
is represented as 0x0000000000000005
, -5
is represented as 0xFFFFFFFFFFFFFFF7
. The representable values are -(262) to 262-1, which correspond to the 63-bit signed 2s-complement integers.If integers exceed the range of representable values, two types of errors must be handled:
Unable to parse non-representable integer literal error
For example, the following Boa program
9999999999999999999999999999
should throw Non-representable number 9999999999999999999999999999
before compilation.
Runtime overflow error
For example, the following Boa program
(+ 4611686018427387903 10)
should throw overflow
at runtime.
You will implement a pre-defined variable input
, which should evaluate to a
user-provided inputted value, which can be an integer number, true, or false.
You need to parse the input value in main.c
and check for two parsing errors:
input must be a boolean or a number
input is not a representable number
Keep in mind that the exact error strings should be printed to standard error.
You can use a call like fprintf(stderr, "msg", ...)
.
If an argument isn’t provided to the executable, the default value of input
should be false
.
After the input value is parsed, it will be passed to our_code_starts_here
as a function
argument. In x86-64, this means it’s passed in the register RDI
. To make
input
variable accessible to Boa, we treat it as a variable, which means it
needs to take up some space on the stack. We’ve shown one way for you to do
this in compile.ml
, by moving it into the first available location on the
stack, adding input
to the initial environment, and using a higher initial
stack index.
Note that the input value is provided as a command-line argument to the generated executable.
For example, ./output/somefile.run 4
makes it so input
evaluates to 4.
For example, ./output/somefile.run 46abc
prints error input must be a boolean or a number
in stderr.
For example, ./output/somefile.run 9999999999999999
prints error input is not a representable number
in stderr.
Production compilers typically separate checking from code generation rather
than interleaving them (as we did in Anaconda). To support this, Boa’s
implementation comes with a function, check
, whose purpose is to check your
program before compilation. check
should to return a list of strings
containing an exact series of error messages. We will be testing against this
function so be sure to match the expected output.
The errors you will need to check here are:
Unbound Identifier ie. EId
not in the scope of an ELet
Error string = “Variable identifier {id name} unbound”
Multiple bindings in the same let binding list: ELet([("x", ENumber(2)); ("y", ENumber(5)); ("x", ENumber(5))], EId(x))
Error string = “Multiple bindings for variable identifier {id name}”
Note that once an error is found, it should be added to the return string list and check should still continue to find any other errors.
We will be also asking you to handling errors during the runtime of the program. The errors you will need to check here are:
-
, +
, *
, <
, and >
should raise an error (by printing it out) with
the substring "expected a number"
if the operands don’t evaluate to numbers
(you can print more than this if you like, but it must be a substring)add1
and sub1
should raise an error with the substring "expected a
number"
if the argument doesn’t evaluate to a numberif
should raise an error with the substring "expected a boolean"
if the
conditional does not evaluate to a boolean.+
, -
, and *
should raise an error with the substring "overflow"
if
the result overflows, and falls outside the range representable in 63 bits.
The jo
instruction (not to be confused with the Joe Instructor) which
jumps if the last instruction overflowed, is helpful here.These error messages should be printed on standard error, so use a call like:
fprintf(stderr, "Error: expected a number")
in the case of
(if 54 true false)
# prints (on standard error) something like:
Error: expected a boolean in if, got 54
I recommend raising an error by adding some fixed code to the end of your
generated code that calls into error functions you implement in main.c
. For
example, you might insert code like:
internal_error_non_number:
mov rdi, rax
push 0
call error_non_number
Which will store the value in rax
in rdi
(the register for the first
argument), move rsp
appropriately (see the Oct 10 lecture for some discussion of stack alignment), and perform a jump into
error_non_number
function, which you will write in main.c
as a function
of one argument. If you look closely you may notice that we aren’t updating
RSP for our local vars, but we still call another function. This overwrites
our program’s stack space but isn’t observable by our programs, since we exit
after printing errors. Future PAs will fix this hack.
ILabel of string
— Create a location in the code that can be jumped to
with jmp
, jne
, and other jump commands
Example: this_is_a_label:
ICmp of arg * arg
— Compares the two arguments for equality. Set the
condition code in the machine to track if the arguments were equal, or if
the left was greater than or less than the right. This information is used
by jne
and other conditional jump commands.
Example: cmp [rsp-4], 0
Example: cmp rax, [rsp-8]
IJne of string
— If the condition code says that the last comparison
(cmp
) was given equal arguments, do nothing. If it says that the last
comparison was not equal, immediately start executing instructions from
the given string label (by changing the program counter).
Example: jne this_is_a_label
IJe of string
— Like IJne
but with the jump/no jump cases reversed
IJmp of string
— Unconditionally start executing instructions from the
given label (by changing the program counter)
Example: jmp always_go_here
Ijl of string
- Jump if the last comparison said the first value was less
than the second. Otherwise do nothing.
Ijg of string
- Jump if the last comparison said the first value was greater
than the second. Otherwise do nothing.
IJo of string
- Jump if the last operation set the overflow condition code,
otherwise do nothing.
IJno of string
- As above, but only jump when the condition code is not set.
cmp
and Jumps for IfWhen compiling an if expression, we need to execute exactly one of the branches (and not accidentally evaluate both!). A typical structure for doing this is to have two labels: one for the else case and one for the end of the if expression. So the compiled shape may look like:
cmp rax, 0 ; check if rax is equal to 0
je else_branch
; commands for then branch go here
jmp end_of_if
else_branch:
; commands for else branch go here
end_of_if:
Note that if we did not put jmp end_of_if
after the commands for the then
branch, control would continue and evaluate the else branch as well. So we
need to make sure we skip over the else branch by unconditionally jumping to
end_of_if
.
When creating labels, we can’t simply use the same identifier
names and label names over and over. The assembler will get confused if we
have nested if
expressions, because it won’t know which end_of_if
to jmp
to, for example. So we need some way of generating new names that we know
won’t conflict.
You’ve been provided with a function gen_temp
(meaning “generate
temporary”) that takes a string and appends the value of a counter to it,
which increases on each call. You can use gen_temp
to create fresh names
for labels and variables, and be guaranteed the names won’t overlap as long as
you use base strings don’t have numbers at the end.
For example, when compiling an if
expression, you might call gen_temp
twice, once for the else_branch
label, and once for the end_of_if
label.
This would produce output more like:
cmp rax, 0 ; check if rax is equal to 0
je else_branch1
; commands for then branch go here
jmp end_of_if2
else_branch1:
; commands for else branch go here
end_of_if2:
And if there were a nested if expression, it might have labels like
else_branch3
and end_of_if4
.
Sized
You may run into errors that report that the size of an operation is ambiguous. This could happen if you write, for example:
cmp [rbp-8], 0
Because the assembler doesn’t know if the program should move an eight-byte
zero, a one-byte zero, or something in between into memory starting at
[rbp-8]
. To solve this, you can supply a size:
cmp [rbp-8], DWORD 0
This tells the assembler to use the “double word” size for 0, which
corresponds to 64 bits (there are other sizes; for now we’ll typically
work in 64-bit-sized chunks). To get a sized argument, you can use the
Sized
constructor from arg
.
Const64
Just like Boa, OCaml uses 63-bit integers with a tag bit for its default
int
type. This means that OCaml can’t represent 64-bit constants like
our representation of -1
or 2^62-1
when they are encoded. OCaml does
support 64-bit integers, though, with its int64
type
(just like C has different-sized numbers). Const64
takes one of these
so that you can represent the big constants we need in an instruction
In Boa, the main place you’ll likely need this is in ENumber
, and other
than that you won’t need to construct such large numbers in OCaml.
An example that compares Int64
to regular int
is:
# Int64.mul (Int64.of_int 4611686018427387903) (Int64.of_int 2);;
- : int64 = 9223372036854775806L
# 4611686018427387903 * 2;;
- : int = -2
An example of constructing a Const64
is:
Const64(Int64.mul (Int64.of_int 4611686018427387903) (Int64.of_int 2)))
HexConst64
Sometimes it’s nice to read things in hex notation as opposed to decimal
constants. This is like Const64
but prints its value in hex notation
instead of decimal.
IPush
, IPop
These two instructions manage values on the stack. push
decrements rsp
by 8 and sets the given value at updated address in rsp
. pop
retrieves the value at rsp
into the given register and increments rsp
by 8.
ICall
A call does two things:
- Pushes the next _code_ location onto the stack (just like a `push`),
which becomes the return pointer
- Performs an unconditional `jmp` to the provided label
IShr
, ISar
, IShl
: Bit shifting operations
IAnd
, IOr
, IXor
: Bit masking operations
near
: Perform a near jump rather than the default short jump. The default short
jump is too close to be useful during compilation. We have already
filled out jump instructions so you don’t need to worry about
“short jump out of range” error.
When Boa returns, it saves the return value in register RAX. In C, return value goes into RAX
by convention. To print the return value, you need to fill out print
function. Since values
are tagged in Boa, you need to do bitwise manipulations before using printf.
For example, if the return value is 0x0000000000000002
or 0x0000000000000000
, you should print true\n
or false\n
.
If the return value is a 0xFFFFFFF7
, you should print -5\n
.
Otherwise print the value in using this printf format Unknown value: %#010x\n
.
well_formed_e
is supposed to return a list of static errors from your source program found during compilation. This makes it so you can collect them all in a list without just using failwith
at the first error.
check
is meant to failwith the errors gathered in well_formed_e
so compilation terminates and you can see what errors your source program had.
You can assume that an id is a valid string of form [a-zA-z][a-zA-Z0-9]*
. You will, however, have to check that the string does not match any of the language’s reserved words.
numbers are the binary representation shifted left by 1 bit with the lower bit set as 1
Value | Representation (Binary) |
---|---|
3 |
[0b0000000000000000000000000000000000000000000000000000000000000111] |
5 |
[0b0000000000000000000000000000000000000000000000000000000000001011] |
12 |
[0b0000000000000000000000000000000000000000000000000000000000011001] |
42 |
[0b0000000000000000000000000000000000000000000000000000000001010101] |
Or in HEX,
Value | Representation (HEX) |
---|---|
3 |
[0x0000000000000007] |
5 |
[0x000000000000000b] |
12 |
[0x0000000000000019] |
42 |
[0x0000000000000055] |
These are the same as they were for Anaconda. Your tests should
focus on t
tests. In addition, to test command-line argument,
you can call t_i
. To test command-line argument parsing errors,
use terr_i
. To test parser, use t_parse
.
An old friend is helpful here, too: gdb
(or lldb
on Mac OSX).
You can run gdb output/some_test.run
in order to get a little more feedback
on tests that fail with -10
as their exit code (which usually indicates a
segfault). This can sometimes tip you off quite well as to how memory is off,
since sometimes you’ll see code trying to jump to a constant that’s in your
code, or other information about the specific instruction that fails.
You can use set disassembly-flavor intel
to have gdb display assembly in the
intel syntax we use, and otherwise debug assembly as usual in gdb.