Tuesday, 27 August 2013

TNDJG:0004: I converted my processor design to Bluespec and it went more than twice as fast!


The Cake ML Project, https://cakeml.org/, is developing a fully-verified ML system that includes the hardware, run-time system and compiler. An important hardware component is the processor that executes the bytecode.
 
Last year I wrote an implementation of the  processor in standard Verilog and used it for a couple of demos.  The processor is a simple stack machine with the top three items from the stack held in on-chip registers.  The other registers are the stack pointer, the program counter and the free space pointer.


Overview block diagram of the system:


My Verilog implementation was 600 lines long (the module HWML_CPU_CORE in cpucore0.zip). 

Last week I manually converted the processor core from Verilog to Bluespec with the help of a few emacs regexps (the file Hwmlcore.bsv in bsvcore0.zip). It came out shorter at just over 400 lines, but most surprisingly, its instructions-per-clock metric had more than doubled! And interestingly, the design had returned to about 700 lines of RTL at the output of the Bluespec compiler (toy-bluespec-compiler).

Both implementations were written fairly quickly with no special attention to performance.  Being a direct implementation of a simple stack machine, the main performance bottleneck should be the stack memory read/write port.  For instructions that pop two items from the stack, such as JNZ, two stack RAM reads are needed.

The test application program I was using was a simple, massively recursive implementation of Fibonacci implemented by Magnus Myreen.  The difference in performance is apparent from the execution times of the first five instructions.

The CPU is released from reset after four clock cycles, each of 10 ns. The Bluespec implementation execute its fifth instruction after a further 90 ns whereas the older implementation requires 180 ns to get to the same point:

Instruction timings Bluespec version:
55: sp=0000 tos = [[ xxxxxxxx xxxxxxxx xxxxxxxx ]]  pc='h0000  ins='h4a Push 0a
75: sp=0001 tos = [[ 0000000a xxxxxxxx xxxxxxxx ]]  pc='h0001  ins='h44 Push 04
95: sp=0002 tos = [[ 00000004 0000000a xxxxxxxx ]]  pc='h0002  ins='h12 Call 4
115: sp=0002 tos = [[ 00000003 0000000a xxxxxxxx ]]   pc='h0004  ins='h0c Swap
135: sp=0002 tos = [[ 0000000a 00000003 xxxxxxxx ]]  pc='h0005  ins='h40 Push 00

Instruction timings original version:
65 sp=00000000 tos= [[ xxxxxxxx xxxxxxxx xxxxxxxx  ]] pc=10000000  i=4a 
105 sp=00000001 tos= [[ 0000000a xxxxxxxx xxxxxxxx  ]] pc=10000001  i=44 
145 sp=00000002 tos= [[ 00000004 0000000a xxxxxxxx  ]] pc=10000002  i=12 CALL
185 sp=00000002 tos= [[ 10000003 0000000a xxxxxxxx  ]] pc=00000004  i=0c SWAP
225 sp=00000002 tos= [[ 0000000a 10000003 xxxxxxxx  ]] pc=00000005  i=40 

Overall, using Bluespec, the 2831 instructions consumed 6804 clocks, an IPC of 0.42 compared with the IPC 0.16 achieved in the original design.

Analysis

A careful look at the fetch/execute control logic and memory interface design in each of the two variants reveals that the original design had an extra two clock-cycles' worth of pipeline delay.  A little more care over the manual design could perhaps have avoided this.  I was rather hoping the Bluespec compiler would have done something fancy, such as merging the actions of several rules into a single clock cycle.  But nonetheless, it is clear that the Bluespec version is shorter, easier to read and modify and gave better performance.


David Greaves - August 2013. 

Wednesday, 12 June 2013

TNDJG:0003 Hello World x86_64 assembly language program for linux


 What is the shortest Hello World program on linux?

$ cat syscall.s
# Minimal program on linux x86_64
# (C) 2004 DJ Greaves.
# Compile and run with: as -o a.o syscall.s;ld -o a.out a.o; ./a.out

#int main()
#{
#  char *s = "Hello Worldly\n";
#  write(1, s, strlen(s));
#  exit(0);
#}


# Calling convention uses %rdi, %rsi, %rdx, %rcx, %r8 and %r9 in order.
# Low syscall function numbers are 0=read,1=write,2=open,3=close,4=stat


      
        .global _start
_start:
        mov    $14,%rdx
        mov    $message,%rsi
        mov    $1,%rdi # stdout file description number is 1
        mov    $1,%rax # Kernel function number for write is 1
        syscall

        movq   $0,%rdi # Zero return code
        mov    $60,%rax # Kernel function number for _exit is 60
        syscall

        ret

message:
        .string "Hello Worldly\n"


$
as -o a.o syscall.s
ld -o a.out a.o
./a.out
Hello Worldly
$
strace ./a.out
execve("./a.out", ["./a.out"], [/* 60 vars */]) = 0
write(1, "Hello Worldly\n", 14Hello Worldly
)         = 14
_exit(0)                                = ?
$

Thursday, 2 February 2012

Dining Philosophers Deadlock Example Using Bluespec

The dining philosophers were invented in 1965 by Edsger Dijkstra and the problem they pose has become the classical example of a system that is liable to deadlock.



Five philosophers are evenly spaced around a round table.   Each has a bowl of soup but there are only five spoons and each needs (for some strange reason) two spoons to eat.  So they cannot all eat at once, but must take it in turns.



In Bluespec Verilog, a spoon is easily coded as follows. The pickup method is blocking, so any philosopher attempting to pickup a fork that is already held by his neighbour will wait until his neighbour puts it down.



interface Spoon_if;
  method Action pickup();
  method Action putdown();
endinterface

module spoon (Spoon_if) ;

  Reg#(Bool) inuse <- mkReg(?);

  method Action pickup() if (!inuse);
   action
     inuse <= True;
   endaction
  endmethod

  method Action putdown();
   action
     inuse <= False;
   endaction
  endmethod

endmodule 

The complete system can be described as follows:


(*synthesize *)
module philoBENCH (Empty) ;

       Spoon_if spoon0 <- spoon;
       Spoon_if spoon1 <- spoon;
       Spoon_if spoon2 <- spoon;
       Spoon_if spoon3 <- spoon;
       Spoon_if spoon4 <- spoon;

       Diner_if din0 <- mkDiner (3, 7, spoon0, spoon1);
       Diner_if din1 <- mkDiner (4, 4, spoon1, spoon2);
       Diner_if din2 <- mkDiner (2, 9, spoon2, spoon3);
       Diner_if din3 <- mkDiner (3, 6, spoon3, spoon4);
       Diner_if din4 <- mkDiner (3, 6, spoon4, spoon0);

endmodule: philoBENCH


And the behaviour of an individual philosopher can be coded as follows:

module mkDiner #(UInt#(15) on, UInt#(15) seed) (Spoon_if left, Spoon_if right, Diner_if i);

  Reg#(Bool) eating <- mkReg(?);
  Reg#(UInt#(15)) timer <- mkReg(0);

  Random_if gen <- mkRandom_gen(seed);

  rule foo (timer != 0);
     timer <= timer - 1;
  endrule


  Stmt seq_behaviour =
  (seq while (True) seq
     action UInt#(15) x <- gen.gen(); timer<=x & 31; endaction await(timer== 0);

     left.pickup();
     noAction;

     action UInt#(15) x <- gen.gen(); timer<=x & 31; endaction await(timer== 0);

     right.pickup();

     eating <= True;
     timer <= on; await(timer==0);
     eating <= False;

     noAction;

     left.putdown();
     noAction;
     right.putdown();
     noAction;
  endseq endseq);

  FSM fsm <- mkFSM (seq_behaviour);

  rule kickoff  ;
     fsm.start;
  endrule


endmodule
 

What happens ?
 
 
Not suprisingly, after a little while the system reaches deadlock, as shown in this waveform trace:

The obvious source of deadlock is that each philosopher first picks up his left fork and then tries for his right.  This is clearly a bad policy: in concurrency terms, each spoon is a lock and the resulting cyclic ring of dependencies is known as a `knot' in the lock graph.


The Solution ?

There are many solutions the problem:  one that appears to work is for one philosopher to first pick up his right fork.  We change the system configuration as follows:

       Diner_if din0 <- mkDiner (3, 7, spoon0, spoon1);
       Diner_if din1 <- mkDiner (4, 4, spoon1, spoon2);
       Diner_if din2 <- mkDiner (2, 9, spoon3, spoon2); // <-- Spoon swap!
       Diner_if din3 <- mkDiner (3, 6, spoon3, spoon4);
       Diner_if din4 <- mkDiner (3, 6, spoon4, spoon0);
 
Now when we run the system, it does not seem to lock up however long we simulate for.  Here's some example output:



A good and permanent fix ?

But can we be sure this system is now deadlock free?

Thursday, 10 November 2011

TNDJG:0001 TechNotes-DJG Established

Hello World!


So, it's time I stopped abusing my home page at www.cl.cam.ac.uk/users/djg11 with all sorts of random technical articles and tried out a proper bloggsite. 

I have three or four recent notes to upload straight away, ranging from washing machine repair to structural hazards in BlueSpec Verilog...

DJG
/*