Designing a Processor
Most machines, from the simplest hammers to the most sophisticated appliances,
- have a finite set of possible operations (e.g., a car can accelerate, turn, and decelerate)
- use control input to determine the operation to use (e.g., the steering wheel and pedals provide input to a car)
- interact with data to determine the results of the operation (e.g., the pitch of the roadway)
- result in modification of some part of their environment (e.g., the position and velocity of the car)
A general-purpose computing machine is a special machine where the control, data, and modified environment are all the same. This allows one to have the computer modify its own instructions, expanding the capability of the finite set of core operations to perform arbitrarily complex work. General-purpose computing machines have become so pervasive that they are now commonly called simply “computers,” a term formerly reserved for a human occupation.
A brief history of computers
The first known general-purpose computer to be designed was Charles Babbage’s Analytical Engine, designed and partially constructed in the 1830s. Although Babbage designed it, it was Ada King-Noel, Countess of Lovelace, who first realized it was general-purpose and published the first algorithm for it, thus making Babbage the first computer engineer and Lovelace the first computer scientist. Babbage’s machine was never completed and no subsequent work built on Babbage and Lovelace’s pioneering efforts.
In 1933 Kurt Gödel and Jacques Herbrand defined a class of general recursive functions; in 1936 Alonzo Church expanded on this to define the lambda calculus; and later in 1936 Alan Turing defined an abstract notion of a universal computing machine, later known as the Turing Machine. In his 1939 Ph.D. dissertation Turing (with his Ph.D. advisor Alonzo Church) proposed that these three are all equivalent, and that no more computationally-powerful machine can exist. Known as the Church-Turing thesis, the postulate that no more powerful machine can exist cannot be proven but is generally acknowledged as true. Of the three models, Turing’s had the advantage that Babbage’s had had 100 years earlier of being a description of a plausibly constructable machine, but was not realistic enough to actually be constructed until decades later, when people started creating them as historical novelties.
Although many people were involved in the development of functional universal computing machines1, the clearest progenitor to how most computers are designed is John von Neumann, as described in the 1945 publication First Draft of a Report on the EDVAC. This model is the subject of our next section.
Von Neumann architecture
The von Neumann architecture consists of the following parts:
-
The memory unit, which acts like an array or list of numbers. Given an address (an index into the array) it gives back the number stored there; given an address and a number, it stores the number at that address.
- The control unit, which has several parts:
- A program counter (sometimes also known as an instruction pointer), a register that holds the address where the next instruction is to be found in memory;
- an instruction register that holds the instruction currently being executed;
- logic to adjust the program counter based on its current contents and the contents of the instruction register; and
- logic to control what the ALU does and how to send data to and from memory based on the contents of the instruction register.
- The arithmetic logic unit (usually abbreviated as ALU), consisting of
- A small number of program registers for storing operands and results; and
- a set of logical gates implementing basic arithmetic operations using those registers as inputs and outputs.
- Ability to communicate with external devices, such as disks, keyboards, screens, etc.
Exercise: Write the basic von Neumann architecture simulator in the programming language of your choice. You should include:
-
A list or array of 256
0
s as a representation ofmemory
, in such a way you can easily change this to have different specific values inside it. -
A list or array of 4
0
s as a representation of the programregisters
. -
Two variables,
pc
andir
to store the contents of the program counter and instruction register, respectively. -
A
newPC
function that returns the value thepc
should have next based on the currentpc
andir
; you can just return-1
for now. -
A
work
function that uses the contents of their
to decide what to do at each step; you can have it do nothing at all for now. - A
cycle
function that does the following:- simulate a rising clock edge by
- loading register inputs into registers
- loading
newPC()
intopc
- load
memory[pc]
intoir
work()
- simulate a rising clock edge by
- A
run
function that repeatedly:- prints the value of
pc
- calls
cycle
- prints the value of
ir
andregisters
- stops if
pc
is negative or larger thanmemory
can access
- prints the value of
If you run the run
function you should see something like
pc = 0
ir = 0; r0=0, r1=0, r2=0, r3=0
We’ll come back to this basic simulator and flesh it out in subsequent sections.
Instruction-set architecture
The core idea of a general-purpose computer is that the contents of memory describe the actions to take. But memory just stores bytes, and there is no intrinsic right set of things to do on these bytes. Various computational theorists have proven than even very limited sets of operations can be “complete” in the sense that they can be used to emulate everything else, but the general model is to have a range of actions that let programmers easily make the computer do what they wish.
The set of instructions a computer provides, together with how those instructions are encoded, defines the computer’s Instruction Set Architecture or ISA.
The Basics
The most common three categories of actions are
- moves (corresponding to assignment operators in most programming languages),
- maths (corresponding to mathematical operators in most programming languages), and
- jumps (corresponding to control constructs in most programming languages).
Moves
One of the pillars of the imperative programming paradigm2 is the assignment operator =
. All it does is take a value which exists in one place and put it into another place, but as you have already seen that operation allows a great deal of flexibility.
When building a machine, making circuitry that is as general as programming’s =
is not trivially possible. Recall that our work()
function must decide what to do based only on information it can see, generally meaning the number stored in ir
. register[0] = register[1]
is going to have to have a different number for it than is register[2] = register[7]
; and both will be different from register[3] = memory[131]
or register[3] = memory[register[0]]
.
That said, once we enumerate all of the moves we think are important, we can make them all happen in work()
with a long but simple if
-statement, or equivalently with a large mux in hardware.
Maths
After =
, the next building block of most imperative programming languages is binary operators. These are big networks of gates: +
we’ve already looked at, and -
is very similar; *
is basically one +
for each bit in the input numbers; and /
and %
, while quite a bit more complicated still, can still be handled with a lot of gates.
There are other operators we’ll need too, like >=
and its friends, and boolean operators as well, but in the end they’re all just a bunch of gates.
To make the machine use them, though, we have an additional complexity. If there were a lot of ways to combine a source and destination for moves, there are even more ways to combine two sources and a destination for maths. Even using just 8 registers, there are 8^3^ = 512 possible combinations of register[?] = register[?] + register[?]
, and more when we get to operands in memory or literals.
That said, the work()
for maths is not dramatically different from the work()
for move; there is a lot more circuitry in if (ir == 12345): register[2] = register[3] * register[4]
than in if (ir == 12345): register[2] = register[3]
, but the logic needed to make it work is still basically a big mux.
Jumps
While moves and maths have direct correlation to components of common programming languages, jumps are somewhat less direct.
The concept of a jump is a program instruction that specifies the new pc
as something other than the next instruction. These can enable loops:
repeat forever: | 1. do stuff |
do stuff | 2. more stuff to |
more stuff too | 3. run instruction 1 next |
and conditionals
if x < y: | 1. if not(x < y) skip to 3 |
do stuff | 2. do stuff |
more stuff too | 3. more stuff too |
and also, by combining those, all the other kind of control constructs: while
loops are loops with a conditional redirection, function calls and returns are (mostly just) jumps to and from the function code, etc.
There is no work()
in most jumps; instead, they add a mux to the newPC()
function.
Encoding instructions
There is no one “right” way to encode instructions into binary, but it is common to do this by having particular sets of bits with particular meanings. For example, we can specify one of 4 registers using 2 bits; one of 8 operations using 3 bits; etc. Thus, a simple way to create an instruction encoding is to figure out all the information any action will need, figure out how many bits we need to encode that, and then reserve a set of bits in the instruction for that information.
Example: Suppose we want the following operations:
- Move a literal value into a register
- Jump to a specific address if a given register is non-zero
- Move into a register a value found at a memory address stored in another register
- Move a value from a register into a memory address stored in another register
a = a op b
whereop
is one of+
,-
,*
, and<
; anda
andb
are registers
That’s eight operations, so we need 3 bits to encode which one is which. We also need 4 bits to encode up 2 registers and 8 more to encode a literal value, for a total of 15 bits. 15 bits fits into 2 bytes, so we can make our encoding be the following:
-
Look up the high-order 3 bits of the first byte on the following table:
bits meaning 000
move a literal into a register 001
jump to address if register non-zero 010
move from memory to register 011
move from register to memory 100
add one register to another 101
subtract one register from another 110
multiply one register by another 111
set one register to 1 if it is less than another, 0 otherwise -
The low-order 4 bits of the first byte give two registers; if the action changes a register, that is given by the higher-order two bits; if a register is not changed, it is given in the lower-order two bits.
-
The second byte is the literal value (for
000
) or address (for001
).
This is of course not the only encoding we could make, but it does contain all the needed information.
Using this encoding, to encode
x = 3
while x < 50
x *= x
we’d first translate that into steps our example ISA has access to
- r0 = 3
- r1 = 50
- r0 = r0 × r0
-
r2 = r0
since we don’t have this operation, we’ll do it as
a. r2 = 0 b. r2 = r2 + r0
- r2 = 1 if r2 < r1 else 0
- if (r2 ≠ 0) jump to step 3
and then encode those instructions into binary (using _
for unspecified bits
000 00__ 00000011
000 01__ 00110010
110 0000
000 1000 00000000
100 1000
111 1001
001 10__ ????????
Finally we have to decide where in memory we are putting this so we know what value to put into the ????????
jump address. Let’s put it at address 0, so the index of the multiply instruction is 4
001___10 00000100
Now we assume all _
are 0, prepend a 1 bit to each instruction, and turn that into hex so we can put it into a simulator and get the following contents of memory:
0x80,0x03, 0x84,0x32, 0xE0, 0x88,0x00, 0xC8, 0xF9, 0x98,0x04
Exercise: Create an encoding for the following code that computes r1 = r2 / r3
by implementing the following:
r1 = 0
while r1 * r2 < r3
r1 += 1
We can change the loop to a jump by doing
r1 = -1
r1 += 1
- if
r1 * r2 < r3
, go to step 2 again
and reduce that to the few operations we have available to us
- r2 = a numerator
- r3 = a denominator
- r1 = −1
- r0 = 1
- r1 = r1 + r0
- r0 = r1
- r0 = r0 × r2
- r0 = 1 if r0 < r3 else 0
- if (r0 ≠ 0) jump to step 4
Encode the above sequence of instructions into bytes. Remember to use the two’s complement encoding of −1.
Exercise: Modify your simulator from the last exercise to implement the example instruction set encoding listed above. In particular,
- Refer to the section on Masks to remember how to get specific bits out of a number in code.
- In
loop()
we’ll still haveir
becomememory[pc]
(the first byte only) for simplicity. newPC
will generally returnpc + 2
since instructions are 2 bytes long; but if the top three bits ofir
are000
and the register indicated by the low-order two bits ofir
is not0
, it will return the second byte of the instruction (i.e.,memory[pc+1]
) instead.work()
will need to consider the high-order three bits ofir
and act as per the table in the example above.
Verify that your simulator runs the two example programs above. The code from the example should put 81 (0x51) into r0
; the code from the exercise should put r2 / r3
into r1
.
Fixed- or variable-length
Not all instructions need the same amount of information to be represented. Should instructions that need less information take up fewer bytes than instructions that need a lot of information?
For such a simple question, there does not appear to be a simple answer. Fixed-length instructions make some of what the computer does simpler, theoretically providing cost and power benefits; variable-length instructions make more efficient use of memory, also theoretically providing cost and power benefits. Intel and AMD are known for variable-length instruction sets, while IBM and Motorola have more fixed-length instruction sets.
Exercise: Modify your simulator so it increases the pc
by 2 only if the instruction needs 2 bytes (i.e., 000
and 001
); otherwise only increase the pc
by 1.
This should let us shorten our example program
0x00,0x03, 0x04,0x32, 0xC0,0x00, 0x68,0x00, 0xE9,0x00, 0x22,0x04
to just
0x00,0x03, 0x04,0x32, 0xC0, 0x68, 0xE9, 0x22,0x04
removing 3 of the original 12 bytes (a 25% space saving). How much space does this save in the division program?
Addressing modes
How should we be able to represent operands of operations? Many different representations are useful, leading to a a diversity of addressing modes:
- Register
- The most fundamental and universal addressing mode is to operate on values stored in registers.
- Immediate
- Immediate addressing refers to putting the value directly in the instruction. This roughly correlates to using literals in source code.
Immediate addressing is never used for destinations of operations. Theoretically it would be possible to store a result into the program memory itself, but this is not done by any mainstream ISA.
- Memory
- Because access to memory is generally through a low-capacity interface, most ISAs limit at most one operand to be in memory and some limit all memory addressing to move operations only.
There are multiple sub-types of memory addressing. The three basic building-blocks are:
- Immediate address
- The address is available as a literal value in the code. This is useful for encoding global variables and function addresses.
- Register address
- The address is available as the contents of a program register. This is useful for passing references to mutable values in variables.
- Relative address
- For various reasons (space efficiency, relocatable code, etc) it is sometimes convenient to compute addresses relative to some reference address, such as the current program counter or a dedicated base address register.
These three blocks can be usefully combined in various ways. For example, one common addressing mode in Intel’s x86 architecture is “Relative Immediate1 + Register1 + Register2 × 2Immediate2”, where Immediate2 is limited to numbers no larger than 4. While that may seem strangely specific and ideosyncratic, it is useful for several common code patterns, such as
x[i].y
.
Picking the right set of addressing modes in an ISA is challenging. Fixed-length encodings in particular often face problems in these, as something as simple as a plain immediate value requires that instructions contain more bits than does any legal constant. To void extra code bloat to make every instruction large enough to fit a full-length immediate value, mainstream ISAs that use fixed-length encoding generally can represent only a small subset of integer literals as immediate values and require all others to be constructed by instruction sequences, sometimes adding customized instructions to enable that construction to be relatively efficient. Variable-length ISAs do not suffer from that limitation, but generally have quite complicated rules on how to determine the meaning of each instruction anyway so there is little gain in clarity.
Extra State
In addition to program registers and memory, processors generally store additional information. We’ve already seen a few of these: the PC and IR are registers but not program registers because they cannot be directly addressed by programs. Modern processors have hundreds of other similar hidden support state, a few of which are important enough to deserve additional conversation.
Condition codes
Many processors have a set of registers set automatically by some operations, called condition codes. The can be used to determine how certain other operations behave.
In the x86 series of ISAs, for example, every mathematical operation not only computes its resulting value but also sets several condition code bits that describe the result of the computation. These are used to determine if all conditional jumps (and other conditional operations) are executed or not. Thus instead of writing
r0 = 1 if r0 < r3 else 0
if (r0 ≠ 0) jump to step 4
in x86 you’d write the equivalent of
r0 -= r3
if (last result < 0) jump to step 4
The ARM family of ISAs also use similar condition codes, but unlike x86 where only a few instructions are conditional, in ARM 7 almost every instruction is conditional.
Other ISAs, such as PowerPC and MIPS, do not make as extensive use of condition codes, instead using explicit condition operands in conditional operations. Even these still have some condition codes, however, to track things like integer overflow and other unusual conditions.
The stack
There was a period in which enough code was written directly in machine language that hardware developers chose to start adding higher-level abstractions directly into their ISAs. While all of these still exist in older ISAs like x86, most are no longer much used. One, however, is very pervasive: the stack operations.
The concept of a stack, as implemented by the x86 ISA, is easier to explain in code than words:
stack_memory = big contiguous chunk of memory
stack_pointer = length_of(stack_memory)
how to push(value):
1. stack_pointer -= 1
2. stack_memory[stack_pointer] = value
how to pop():
1. result = stack_memory[stack_pointer]
2. stack_pointer += 1
3. return result
Example: Consider the following code:
push(3)
push(4)
push(5)
push(6)
pop()
x = pop()
pop()
At the end of the code x
contains the value 5
and the stack still contains one value (the 3
).
Exercise: Consider the following code:
push(3)
push(4)
x = pop()
push(5)
y = pop()
z = pop()
What are in x
, y
, and z
at the end of the code?3
In x86-64, the stack_pointer
in the above pseudocode is a program register, rsp
. The ISA instruction pop rax
does the equivalent of rax = pop()
in the above pseudocode: that is, it reads a value from memory, stores that value in rax
, and also increases the value in rsp
. push
and pop
, as well as related instructions like call
and ret
, are extremely prevalent in code in the wild today.
Complicated operands
For various reasons, some historical and some performance-based, some instructions assume particular registers are part of their operation. This is particularly common in x86 and x86-64.
Example: Intel’s ISA manual, volume 2, defines the 64-bit version of the div
instruction as dividing rdx:rax
by the source operand and storing the quotient in rax
, the remainder in rdx
. In other words, this is a 128-bit integer divided by a 64-bit integer with both result and remainder returned, using three source registers and two destination registers; but only one of the source registers is able to be specified.
This results in assembly code for answer = numerator / denominator
looking something like
- move numerator into
rax
- zero out
rdx
idiv
denominator- move
rax
into answer
It is also fairly common to have the computer processor logically or physically divided into distinct subprocessors, each with a specialized set of instructions and registers. x86, for example, uses separate registers for floating-point maths than for integer maths, and has more registers for various vector operations and so on.
RISC vs CSIC
In the 1990s it was common for computer architecture textbooks to describe two kinds of ISAs: the RISC (reduced instruction set) and the CISC (complex instruction set). These do not describe any major production ISAs, but are still common terms in computer architecture discussions. A summary follows:
RISC | CISC | Today we see |
---|---|---|
Fixed-length encoding | Variable-length encoding | Both |
Few, simple instructions | Many, complicated instructions | Mostly CISC |
Few addressing modes | Many addressing modes | More CISC |
Uniform operand notation | Idiosyncratic operand notation | Both |
Many program registers | Few program registers | More RISC |
Clean new design | Backwards-compatible design | Both |
All program registers are interchangeable | Some operations only work with specific registers | More CISC |
1 instruction = 1 cycle | some instructions take several cycles | CISC |
Immediate values cannot represent all numbers | Immediate values can represent all numbers | Both |
All maths operands are registers | One maths operand may be in memory | Both |
Some instructions may not follow others | No limitations on instruction ordering | More CISC |
Conditions are in registers | There are special “condition codes” | More CISC |
Function arguments usually passed in registers | Function arguments usually passed on the stack | More RISC |
There are also architectures with significant design decisions beyond those in this table, such as the compiler-controlled instruction scheduling of the Itanium architecture, but they are beyond the scope of this course.
Pipelining and beyond
Executing most instructions involved several steps, some of which depend on others. For example, the push
instruction involves the following:
While each instruction has its own dependency graph of steps, they can all be fit into the following sequence:
- Fetch an instruction from memory
- Decode what the instruction intends to do
- Retrieve values from the register file
- Execute any math or logic operations the instruction requires
- Access memory
- Update values in the register file
Naming the steps.
Every computer architecture text I have seen names the steps in the above sequence, but the exact set of steps and names, as well as the set of work included in each step, varies. The names “Fetch”, “Decode”, and “Execute” are almost always included, but which one includes accessing program registers varies. Some texts also name steps “Memory”, “Writeback”, and/or “PC Update”.
Official ISA documentation also tends to use these names in some form, though often with additional steps as well.
This decomposition of work into steps is not merely academic; processor throughput of instructions can be dramatically increased by pipelining these steps. “Pipeline” is the name used in hardware design for the manufacturing notion of an assembly line: when the part of the chip that fetches instructions finishes fetching the first instruction it passes it off to the next part of the chip and proceeds to fetch the second instruction. Thus, at any given time the processor might have many instructions in different stages of completion.
There are many additional issues that arise when pipelining. A few of the more common techniques used to resolve these issues include:
- Stall
- Instruction dependencies may need to cause one stage to stand idle, waiting for an earlier instruction in a later stage to finish.
- Forward
- One instruction may grab information from a half-completed instruction further down the pipeline.
- Reorder
- A later instruction may be executed before an earlier one to give otherwise-stalled stages something to do.
-
There is indirect evidence that many of them were anonymous women: many women in part because most people with the profession “computer” were women and because there was a shortage of male labor during the second world war when computers were being pioneered; anonymous because the culture of the day did not generally acknowledge female input or provide scope for their leadership. ↩
-
It is likely that if you have learned to program but have not learned about programming paradigms, you learned a variant of the imperative paradigm. There are other approaches to programming that do not depend on variable assignment, the two most famous being the functional paradigm (where recursion is the key to expressivity) and the logical paradigm (where power is obtained from clever application of horn clause reduction rules in formal logic). Computer hardware that uses these paradigms directly is unknown to the author of this text. ↩
-
Answer: x = 4, y = 5, z = 3 ↩