TechNotes-DJG: August 2013

The Cake ML Project, https://cakeml.org/, is developing a fully-verified ML system that includes the hardware, run-time system and compiler. An important hardware component is the processor that executes the bytecode.

Last year I wrote an implementation of the processor in standard Verilog and used it for a couple of demos. The processor is a simple stack machine with the top three items from the stack held in on-chip registers. The other registers are the stack pointer, the program counter and the free space pointer.

Overview block diagram of the system:

My Verilog implementation was 600 lines long (the module HWML_CPU_CORE in cpucore0.zip).

Last week I manually converted the processor core from Verilog to Bluespec with the help of a few emacs regexps (the file Hwmlcore.bsv in bsvcore0.zip). It came out shorter at just over 400 lines, but most surprisingly, its instructions-per-clock metric had more than doubled! And interestingly, the design had returned to about 700 lines of RTL at the output of the Bluespec compiler (toy-bluespec-compiler).

Both implementations were written fairly quickly with no special attention to performance. Being a direct implementation of a simple stack machine, the main performance bottleneck should be the stack memory read/write port. For instructions that pop two items from the stack, such as JNZ, two stack RAM reads are needed.

The test application program I was using was a simple, massively recursive implementation of Fibonacci implemented by Magnus Myreen. The difference in performance is apparent from the execution times of the first five instructions.

The CPU is released from reset after four clock cycles, each of 10 ns. The Bluespec implementation execute its fifth instruction after a further 90 ns whereas the older implementation requires 180 ns to get to the same point:

Instruction timings Bluespec version:

55: sp=0000 tos = [[ xxxxxxxx xxxxxxxx xxxxxxxx ]]  pc='h0000  ins='h4a Push 0a
75: sp=0001 tos = [[ 0000000a xxxxxxxx xxxxxxxx ]]  pc='h0001  ins='h44 Push 04
95: sp=0002 tos = [[ 00000004 0000000a xxxxxxxx ]]  pc='h0002  ins='h12 Call 4
115: sp=0002 tos = [[ 00000003 0000000a xxxxxxxx ]]   pc='h0004  ins='h0c Swap
135: sp=0002 tos = [[ 0000000a 00000003 xxxxxxxx ]]  pc='h0005  ins='h40 Push 00

Instruction timings original version:

65 sp=00000000 tos= [[ xxxxxxxx xxxxxxxx xxxxxxxx  ]] pc=10000000  i=4a 
105 sp=00000001 tos= [[ 0000000a xxxxxxxx xxxxxxxx  ]] pc=10000001  i=44 
145 sp=00000002 tos= [[ 00000004 0000000a xxxxxxxx  ]] pc=10000002  i=12 CALL
185 sp=00000002 tos= [[ 10000003 0000000a xxxxxxxx  ]] pc=00000004  i=0c SWAP
225 sp=00000002 tos= [[ 0000000a 10000003 xxxxxxxx  ]] pc=00000005  i=40

Overall, using Bluespec, the 2831 instructions consumed 6804 clocks, an IPC of 0.42 compared with the IPC 0.16 achieved in the original design.

Analysis

A careful look at the fetch/execute control logic and memory interface design in each of the two variants reveals that the original design had an extra two clock-cycles' worth of pipeline delay. A little more care over the manual design could perhaps have avoided this. I was rather hoping the Bluespec compiler would have done something fancy, such as merging the actions of several rules into a single clock cycle. But nonetheless, it is clear that the Bluespec version is shorter, easier to read and modify and gave better performance.

David Greaves - August 2013.

TechNotes-DJG

Tuesday, 27 August 2013

TNDJG:0004: I converted my processor design to Bluespec and it went more than twice as fast!

Analysis