Magic of computation
A CPU is the part of a computer that coordinates all the other parts of it. It reads the program from memory, determines what each instruction of the program means, and reads, writes or otherwise manipulates memory, disk, screen, and other devices according to each simple instruction. CPUs are where the magic of computation happens. And computation is magic; the only magic that exists in the real world: the special kind of magic that is indistinguishable from sufficiently advanced technology.
It is easy to take for granted that a small device bought from a shop can display text or images from remote locations. But general purpose computers are marvellous machines that were unknown to humanity for thousands of years. The realisation that physical machines can manipulate intangible data has a history from the Antikythera mechanism to Babbage’s Analytical Engine. All these machines were special purpose (though the Analytical Engine was Turing complete). A general computer can be programmed to mimic any one of these, and an infinite variety more.
Modern CPUs are massively complicated and contain millions of individual design parts. The driving force behind this complexity is not the discovery of new kinds of computation; it is the demand for speed. Faster computers do permit new applications. However, even a slow, simple, primitive computer such as the one described here can execute the same algorithms as any more sophisticated computer.
Motivation, inception and development
Computer hardware, especially CPUs, is mysterious to most of us. Most people, even most programmers, are happy for it to remain so.
I was introduced to logic gates at school. There was no real computer science curriculum in those days, so what kind of exposure you got to computer science was a combination of your own study and your fortune in having a proactive teacher who knew something beyond how to use a word processor. My teacher was Martyn Leda who knew lots of things besides word processing. One of the classes was “Computer Programming” which introduced us to a plethora of languages and included some low level things like assembly programming, and some circuit design. I think it was after learning about turning gates into half adders into full adders and so on that I began to whimsically “design” my first CPU. The combinational logic was obvious, and I knew something about flip flops, but everything about how the CPU actually got work done was obscured by liberally applied handwaving.
It was also the case that my inspiration was the Intel 8086, a CISC design. Those chips were designed for an era and purpose in which hardware was expensive and programming effort was relatively cheap. Consequently, they were far from simple designs and featured many complications to make the most of hardware. (Later chips are more complicated still, internally: they support the original 8086 instruction set and its successors, but they do so by emulating those chips on top of a completely different lower level computer design.)
Years later at university I took a class on hardware, based on the textbook Computer Organisation by Patterson and Hennessy. This book covers a large part of computing hardware design, including a description of a complete MIPS processor from its logic gates to its assembly language. That book gave me one of those rare learning experiences in which a curtain is drawn back, and the hitherto mysterious system behind it is revealed. The system is still as marvellous as before. Now there is also wonder at how it was created, and moreover, that it can be comprehended by people like me.
That course was completely theoretical. It was not until I began learning Verilog that the few missing parts of my mental CPU design began to drop into place and I had a realistic chance of seeing it work. Kris (who got me started with Verilog) suggested (perhaps jokingly) that I should make a “4 bit CPU”.
The actual design was different, and inspired not only by the 8086, and by MIPS, but also by Charles Thacker’s Tiny Computer 3. A short PDF describes this computer and even contains 2 pages of Verilog code. My CPU turns out to be quite different and far less concise, efficient and capable.
I’ve put the Verilog source on Github. The README file there includes a description of the architecture. At the end of this post I’ll list some of the outstanding limitations and outline some possible next steps.
Over a few days I created the basic CPU, testing it in Icarus. It took as much work again to get it running on an FPGA! A number of changes had to be made:
- Unsynthesisable constructs: these were mostly
$displayand some of ways values were initialised. The worst was my use of
$readmemhto load the program memory. This should be synthesisable but Xilinx had so much trouble I ended up copying and pasting the hex values from my program into the Verilog source.
- An outer layer for FPGA use. The CPU is useless and arguably meaningless if it can’t communicate with the real world. In simulation, it can print to the console. On an FPGA, it cannot do that, so it needs to communicate using the device’s IO capabilities, such as the seven segment display.
- Timing issues: in many places I’d relied on the simulator’s use of an event queue to perform simulation updates in a useful order. In the real world, updates are applied simultaneously, so on the FPGA nothing was happening in the order I expected. (Verilog does allow timing directives to make the simulation reflect real world conditions, but I have not mastered these.)
- More timing issues: in many places I’ve derived new clock signals using combinational logic. This doesn’t work so well in the real world. Arguably it’s because there will be so much more skew with the derived signals. More practically in my case, it’s because the synthesis tools cannot analyse the derived clocks and hence have no idea what the timing constraints on them are. I could run my CPU on the FPGA by stepping down the clock rate by a factor of a million — there was some Verilog code that would count the ticks of the FPGA’s source clock at 50MHz, and update a derived CPU clock whenever the count reached a certain number. As I reduced the slowdown and the CPU ran faster, eventually it would start computing garbage. The solution was to replace derived clocks with the use of a single source clock, and enable signals. That signal clock is described in the constraints file, and the synthesiser can recognise all the circuits that depend on it and ensure the paths through them are shorter than 50MHz.
Machine in action
Often a new computer or virtual machine is demonstrated by running a port of Quake or other well known program on it. Sadly, such feats were well beyond my CPU’s capabilities at the time I first got it running in hardware (a few weeks after beginning to learn Verilog).
This video shows the CPU in operation, computing the Fibonacci sequence. The display is in hex. This does make the output somewhat strange, but it a lot easier for a binary computer to generate. (Circuitry to turn a binary number into a sequence of decimal digits for display is not trivial!)
Here is the source for that Fibonacci program. This is not assembly language — there is no assembler for this machine yet. This is hand typed machine code. True, I did not have to enter it one bit at a time by toggling a switch. Compared to programmers of yesteryear I was blessed with a text editor to type it in, a hexadecimal based code format, and a very simple instruction set.
0000 // FIBONACCI EXAMPLE E300 // Set Limit = 32768 F380 // " E701 // Set Increment = 1 F700 // " E600 // Set Counter = 0 F600 // " E100 // Main loop: Set A = 0 F100 // " E201 // Set B = 1 F200 // " B001 // Fibonacci loop: Output 0 to port 1 B201 // Output B to port 1 3432 // Check if Limit < B D405 // If so, exit loop 0512 // Set C = A + B 0102 // Set A = B 0205 // Set B = C C0F9 // Jump to Fibonacci loop 0667 // Increment Counter B600 // Output Counter to port 0 C0F2 // Jump to Main loop
Each 4-digit number is an instruction. The first digital is the instruction, for instance
add. The next three digits are the register numbers to operate on, or constant values to use. Line 16 means “add the values from registers 1 and 2 and store the result in register 5″. The most difficult instruction to write was the jump
COF2 on line 22.
C is jump; but the offset is made from the last two digits and addresses are modulo 256. Jumping by
F2 = 242 is the same as jumping by -14. The effect is to move the instruction pointer back to line 8.
Although the instruction set contains
store instructions, the CPU has no actual data memory! Many interesting things can be computed using a handful of registers, but I admit that I’m not sure I can port Quake to it without some support for arrays.
Additionally, the CPU cannot manipulate program memory, save for executing it. It can’t be read, it can’t be written, and it can’t even by addressed properly. The instruction pointer is hidden and all jumps are relative. This is a consequence of the design: the instruction format provides only 4 bits for the opcode, yielding a maximum of 16 instructions. Even when additional instructions are easily implemented, it is hard to find distinct opcodes for them.
I believe that arrays and function pointers, at least, will need to be supported for the CPU to be capable of supporting C programs. Writeable memory is also a requirement for proper general purpose use: instead of reading a hardcoded program memory, the CPU should boot up, read the program from somewhere, and start running it.
The instruction format is constricted but there is a nascent “ports” system that provides expandability. There is not standard assignment but on the FPGA I’ve wired some ports to the seven segment display. Others could be wired to other devices. Similarly, the memory address space could be partitioned into program memory, data memory, and video memory.
As shown above, the instruction set is relatively easy to program by hand. But an assembler would make it easier. A compiler would be even better!
It should be noted it is far easier to muse on these possible improvements than it is to go forth and implement them. The CPU project is primarily for learning and amusement. When motivation depends on amusement it is easy to slack off when all the really interesting bits have been done. But some other FPGA-related projects have suggested a use for a small microprocessor core. Even with HDLs like Verilog, there are still many algorithms that are easier to implement as programs than as circuits.