Part 1: Computer Architecture Fundamentals
(instruction set architectures, caching, pipelines, hyperthreading)
We all think of the CPU as the “brains” of a computer, but what does that actually mean? What is going on inside with the billions of transistors to make your computer work? In this new four-part mini series we’ll be focusing on computer hardware design, covering the ins and outs of what makes a computer work.
The series will cover computer architecture, processor circuit design, VLSI (very-large-scale integration), chip fabrication, and future trends in computing. If you’ve always been interested in the details of how processors work on the inside, stick around because this is what you want to know to get started.
We’ll start at a very high level of what a processor does and how the building blocks come together in a functioning design. This includes processor cores, the memory hierarchy, branch prediction, and more. First, we need a basic definition of what a CPU does. The simplest explanation is that a CPU follows a set of instructions to perform some operation on a set of inputs. For example, this could be reading a value from memory, then adding it to another value, and finally storing the result back to memory in a different location. It could also be something more complex like dividing two numbers if the result of the previous calculation was greater than zero.
When you want to run a program like an operating system or a game, the program itself is a series of instructions for the CPU to execute. These instructions are loaded from memory and on a simple processor, they are executed one by one until the program is finished. While software developers write their programs in high-level languages like C++ or Python, for example, the processor can’t understand that. It only understands 1s and 0s so we need a way to represent code in this format.
Programs are compiled into a set of low-level instructions called assembly language as part of an Instruction Set Architecture (ISA). This is the set of instructions that the CPU is built to understand and execute. Some of the most common ISAs are x86, MIPS, ARM, RISC-V, and PowerPC. Just like the syntax for writing a function in C++ is different from a function that does the same thing in Python, each ISA has a different syntax.
These ISAs can be broken up into two main categories: fixed-length and variable-length. The RISC-V ISA uses fixed-length instructions which means a certain predefined number of bits in each instruction determine what type of instruction it is. This is different from x86 which uses variable length instructions. In x86, instructions can be encoded in different ways and with different numbers of bits for different parts. Because of this complexity, the instruction decoder in x86 CPUs is typically the most complex part of the whole design.
Fixed-length instructions allow for easier decoding due to their regular structure, but limit the number of total instructions that an ISA can support. While the common versions of the RISC-V architecture have about 100 instructions and are open-source, x86 is proprietary and nobody really knows how many instructions there are. People generally believe there are a few thousand x86 instructions but the exact number isn’t public. Despite differences among the ISAs, they all carry essentially the same core functionality.
Example of some of the RISC-V instructions. The opcode on the right is 7-bits and determines the type of instruction. Each instruction also contains bits for which registers to use and which functions to perform. This is how assembly instructions are broken down into binary for a CPU to understand.
Now we are ready to turn our computer on and start running stuff. Execution of an instruction actually has several basic parts that are broken down through the many stages of a processor.
The first step is to fetch the instruction from memory into the CPU to begin execution. In the second step, the instruction is decoded so the CPU can figure out what type of instruction it is. There are many types including arithmetic instructions, branch instructions, and memory instructions. Once the CPU knows what type of instruction it is executing, the operands for the instruction are collected from memory or internal registers in the CPU. If you want to add number A to number B, you can’t do the addition until you actually know the values of A and B. Most modern processors are 64-bit which means that the size of each data value is 64 bits.
64-bit refers to the width of a CPU register, data path, and/or memory address. For everyday users that means how much information a computer can handle at a time, and it is best understood against its smaller architectural cousin, 32-bit. The 64-bit architecture can handle twice as many bits of much information at a time (64 bits versus 32).
After the CPU has the operands for the instruction, it moves to the execute stage where the operation is done on the input. This could be adding the numbers, performing a logical manipulation on the numbers, or just passing the numbers through without modifying them. After the result is calculated, memory may need to be accessed to store the result or the CPU could just keep the value in one of its internal registers. After the result is stored, the CPU will update the state of various elements and move on to the next instruction.
This description is, of course, a huge simplification and most modern processors will break these few stages up into 20 or more smaller stages to improve efficiency. That means that although the processor will start and finish several instructions each cycle, it may take 20 or more cycles for any one instruction to complete from start to finish. This model is typically called a pipeline since it takes a while to fill the pipeline and for liquid to go fully through it, but once it’s full, you get a constant output.
Example of 4-stage pipeline. The colored boxes represent instructions independent of each other.
The whole cycle that an instruction goes through is a very tightly choreographed process, but not all instructions may finish at the same time. For example, addition is very fast while division or loading from memory may take hundreds of cycles. Rather than stalling the entire processor while one slow instruction finished, most modern processors execute out-of-order. That means they will determine which instruction would be the most beneficial to execute at a given time and buffer other instructions that aren’t ready. If the current instruction isn’t ready yet, the processor may jump forward in the code to see if anything else is ready.
In addition to out-of-order execution, typical modern processors employ what is called a superscalar architecture. This means that at any one time, the processor is executing many instructions at one time in each stage of the pipeine. It may also be waiting on hundreds more to begin their execution. In order to be able to execute many instructions at once, processors will have several copies of each pipeline stage inside. If a processor sees that two instructions are ready to be executed and there is no dependency between them, rather than wait for them to finish separately, it will execute them both at the same time. One common implementation of this is called Simultaneous Multithreading (SMT), also known as Hyper-Threading. Intel and AMD processors currently support two-way SMT while IBM has developed chips that support up to eight-way SMT.
To accomplish this carefully choreographed execution, a processor has many extra elements in addition to the basic core. There are hundreds of individual modules in a processor that each serve a specific purpose, but we’ll just go over the basics. The two biggest and most beneficial are the caches and branch predictor. Additional structures that we won’t cover include things like reorder buffers, register alias tables, and reservation stations.
The purpose of caches can often be confusing since they store data just like RAM or an SSD. What sets caches apart though is their access latency and speed. Even though RAM is extremely fast, it is orders of magnitude too slow for a CPU. It may take hundreds of cycles for RAM to respond with data and the processor would be stuck with nothing to do. If the data isn’t in the RAM, it can take tens of thousands of cycles for data on an SSD to be accessed. Without caches, our processors would grind to a halt.
Processors typically have three levels of cache that form what is known as a memory hierarchy. The L1 cache is the smallest and fastest, the L2 is in the middle, and L3 is the largest and slowest of the caches. Above the caches in the hierarchy are small registers that store a single data value during computation. These registers are the fastest storage devices in your system by orders of magnitude. When a compiler transforms high-level program into assembly language, it will determine the best way to utilize these registers.
When the CPU requests data from memory, it will first check to see if that data is already stored in the L1 cache. If it is, the data can be quickly accessed in just a few cycles. If it is not present, the CPU will check the L2 and subsequently search the L3 cache. The caches are implemented in a way that they are generally transparent to the core. The core will just ask for some data at a specified memory address and whatever level in the hierarchy that has it will respond. As we move to subsequent stages in the memory hierarchy, the size and latency typically increase by orders of magnitude. At the end, if the CPU can’t find the data it is looking for in any of the caches, only then will it go to the main memory (RAM).
On a typical processor, each core will have two L1 caches: one for data and one for instructions. The L1 caches are typically around 100 kilobytes total and size may vary depending on the chip and generation. There is also typically an L2 cache for each core although it may be shared between two cores in some architectures. The L2 caches are usually a few hundred kilobytes. Finally, there is a single L3 cache that is shared between all the cores and is on the order of tens of megabytes.
When a processor is executing code, the instructions and data values that it uses most often will get cached. This significantly speeds up execution since the processor does not have to constantly go to main memory for the data it needs. We will talk more about how these memory systems are actually implemented in the second and third installment of this series.
Besides caches, one of the other key building blocks of a modern processor is an accurate branch predictor. Branch instructions are similar to “if” statements for a processor. One set of instructions will execute if the condition is true and another will execute if the condition is false. For example, you may want to compare two numbers and if they are equal, execute one function, and if they are different, execute another function. These branch instructions are extremely common and can make up roughly 20% of all instructions in a program.
On the surface, these branch instructions may not seem like an issue, but they can actually be very challenging for a processor to get right. Since at any one time, the CPU may be in the process of executing ten or twenty instructions at once, it is very important to know which instructions to execute. It may take 5 cycles to determine if the current instruction is a branch and another 10 cycles to determine if the condition is true. In that time, the processor may have started executing dozens of additional instructions without even knowing if those were the correct instructions to execute.
To get around this issue, all modern high-performance processors employ a technique called speculation. What this means is that the processor will keep track of branch instructions and guess as to whether the branch will be taken or not. If the prediction is correct, the processor has already started executing subsequent instructions so this provides a performance gain. If the prediction is incorrect, the processor stops execution, removes all incorrect instructions that it has started executing, and starts over from the correct point.
These branch predictors are some of the earliest forms of machine learning since the predictor learns the behavior of the branches as it goes. If it predicts incorrectly too many times, it will begin to learn the correct behavior. Decades of research into branch prediction techniques have resulted in accuracies greater than 90% in modern processors.
While speculation offers immense performance gains since the processor can execute instructions that are ready instead of waiting in line on busy ones, it also exposes security vulnerabilities. The famous Spectre attack exploits bugs in branch prediction and speculation. The attacker would use specially crafted code to get the processor to speculatively execute code that would leak memory values. Some aspects of speculation have had to be redesigned to ensure data could not be leaked and this resulted in a slight drop in performance.
The architecture used in modern processors has come a long way in the past few decades. Innovations and clever design have resulted in more performance and a better utilization of the underlying hardware. CPU makers are very secretive about the technologies in their processors though, so it’s impossible to know exactly what goes on inside. With that being said, the fundamentals of how computers work are standardized across all processors. Intel may add their secret sauce to boost cache hit rates or AMD may add an advanced branch predictor, but they both accomplish the same task.
Part 2: CPU Design Process
(schematics, transistors, logic gates, clocking)
Now that we know how processors work at a high level, it’s time to dig inside to understand how the internal components are designed. This article is the second part in our series on processor design. If you haven’t read part one yet, you’ll want to go over that first or else some of the concepts in here won’t make sense.
As you probably know, processors and most other digital technology are made of transistors. The simplest way to think of a transistor is of a controllable switch with three pins. When the gate is on, electricity is allowed to flow through the transistor. When the gate is off, current can’t flow. Just like the light switch on your wall, but much smaller, much faster, and able to be controlled electrically.
There are two main types of transistors used in modern processors: pMOS and nMOS. An nMOS transistor allows current to flow when the gate is charged or set high, and a pMOS transistor allows current to flow through when the gate is discharged or set low. By combining these types of transistors in a complementary way, we can create CMOS logic gates. We won’t get into the nitty gritty details of how transistors physically work in this article, but we’ll touch on it in Part 3 of the series.
A logic gate is a simple device that takes inputs, performs some operation, and outputs a result. For example, an AND gate will turn its output on if and only if all the inputs to the gate are on. An inverter or NOT gate will turn its output on if the input is off. We can combine these two to create a NAND or not-and gate which turns its output on if and only if none of the inputs are on. There are other gates with different logic functionality like OR, NOR, XOR, and XNOR.
Below we can see how two basic gates are designed from transistors: an inverter and a NAND gate. In the inverter, there is a pMOS transistor on top connected to the power line and an nMOS transistor on the bottom connected to ground. The pMOST transistors are drawn with a small circle connecting to their gate. Since we said that pMOS devices conduct when the input is off and nMOS devices conduct when the input is on, it is easy to see that the signal at Out will always be the opposite of the signal at In. Looking at the NAND gate, we see that it requires four transistors and that the output will be on as long as at least one of the inputs is off. Connecting transistors to form simple networks like this is the same process used to design more advanced logic gates and other circuitry inside processors.
With building blocks as simple as logic gates, it can be difficult to see how they are transformed into a functioning computer. This design process involves combining several gates to create a small device that may perform a simple function. You can then connect many of these devices to form something that performs a slightly more advanced function. The process of combining individual components to create a working design is exactly what is used today to create modern chips. The only difference is that a modern chip has billions of transistors.
For a quick example, we’ll look at a basic adder 1-bit full adder. It takes three inputs – A, B, and Carry-In, and produces two outputs – Sum and Carry-Out. The basic design uses five logic gates and they can be linked together to create any size adder you want. Modern designs improve on this by optimizing some of the logic and carry signals, but the fundamentals are still the same.
The Sum output is on if either A or B is on but not both, or if there is a carry-in signal and either A and B are both on or both off. The carry-out is a bit more complex. It is active when either A and B are both on at the same time, or if there is a carry-in and one of A or B is on. To connect multiple 1-bit adders to form a wider adder, we just need to connect the carry out of the previous bit to the carry in of the current bit. The more complex the circuits get, the messier the logic gets, but this is the simplest way to add two numbers. Modern processors use more complex adders, but those designs are too complicated for an overview like this. In addition to adders, processors also contain units for division, multiplication, and floating-point versions of all of these operations.
Combining a series of gates like this to perform some function on inputs is known as Combinational Logic. This type of logic isn’t the only thing found in computers though. It wouldn’t be very useful if we couldn’t store data or keep track of the state of anything. To do this, we need sequential logic which has the ability to store data.
Sequential logic is built by carefully connecting inverters and other logic gates such that their outputs feed back to the input of the gates. These feedback loops are used to store one bit of data and are known as Static RAM or SRAM. It is called static RAM, as opposed to dynamic in DRAM, because the data being stored is always directly connected to positive voltage or ground.
A standard way to implement a single bit of SRAM is with 6 transistors shown below. The top signal, marked WL for Word Line, is the address and when it is enabled, the data stored in this 1-bit cell is sent to the Bit Line, marked BL. The BLB output is known as Bit Line Bar, is just the inverted value of Bit Line. You should be able to recognize the two types of transistors and that M3 and M1 form an inverter along with M4 and M2.
SRAM is what is used to build the super fast caches and registers inside processors. It is very stable, but requires six to eight transistors to store each bit of data. This makes it extremely expensive to produce in terms of cost, complexity, and chip area compared to DRAM. Dynamic RAM, on the other hand, stores data in a tiny capacitor rather than using logic gates. It is called dynamic because the voltage of the capacitor can change dynamically since it is not connected to power or ground. There is just a single transistor that is used to access the data stored in the capacitor.
Because DRAM only requires a single transistor per bit and the design is very scalable, it can be packed densely and cheaply. One drawback to DRAM is that the charge in the capacitor is so small that it needs to be refreshed constantly. This is why when you turn off your computer, the capacitors all drain and the data in your RAM is lost.
Companies like Intel, AMD, and Nvidia certainly don’t release schematics for how their processors work, so it’s impossible to show full diagrams like this for a modern processor. However, this simple adder should give you a good idea of how even the most complex parts of a processor can be broken down into logic gates, storage elements, and then transistors.
Now that we know how some of the components of a processor are constructed, we need to figure out how to connect everything up and synchronize it. All of the key components in a processor are connected to a clock signal. This alternates between high and low at a predefined interval known as the frequency. The logic inside processor typically switches values and performs calculations when the clock goes from low to high. By synchronizing everything together, we can ensure that data always arrives at the correct time so that there aren’t any glitches in the processor.
You may have heard that you can increase the clock to a processor, known as overclocking, to increase its performance. This performance gain comes from switching the transistors and logic inside a processor faster than it was designed for. Since there are more cycles per second, more work can get done and the processor will have a higher performance. This is true up to a certain point though. Modern processors typically run between 3.0GHz and 4.5GHz and that hasn’t seemed to change in the past decade. Just like a metal chain is only as strong as the weakest link, a processor can only run as fast as the slowest part. By the end of each clock cycle, every single component in a processor needs to have finished its operation. If any parts aren’t done yet, the clock is too fast and the processor won’t work. Designers call this slowest part the Critical Path and it is what sets the maximum frequency a processor can run at. Above a certain frequency, the transistors simply cannot switch fast enough and will start glitching or producing incorrect outputs.
By increasing the supply voltage to a processor, we can speed up the switching of the transistors, but that only works up to a certain point as well. If we apply too much voltage, we risk burning up the processor. When we increase the frequency or voltage of a processor, it will always generate more heat and consume more power. This is because processor power is directly proportional to frequency and proportional to the square of voltage. To determine the power consumption of a processor, we usually think of each transistor as a small capacitor that must be charged or discharged whenever it changes value.
Power delivery is such an important part of a processor that in some cases, half of the physical pins on a chip may be just for power or ground. Some chips may pull more than 150 amps at full load and all that current has to be managed extremely carefully. To put this amount of power into perspective, a CPU generates more heat per unit area than a nuclear reactor.
The clock in modern processors accounts for roughly 30-40% of its total power since it is so complex and must drive so many different devices. To conserve energy, most lower-power designs will turn off portions of the chip when they are not in use. This can be done by turning off the clock, known as Clock Gating, or turning off the power, known as Power Gating.
Clocks present another challenge to designing a processor since as their frequencies keep increasing, the laws of physics start getting in the way. Even though the speed of light is extremely fast, it is not fast enough for high-performance processors. If you were to connect the clock to one end of the chip, by the time the signal reached the other end, it would be out of sync by a considerable amount. To keep all portions of the chip in time, the clock is distributed using what is called an H-Tree. This is a structure that ensures all endpoints are the exact same distance from the center.
It may seem extremely tedious and complex to design every single transistor, clock signal, and power connection in a chip, and that would certainly be true. Even though companies like Intel, Qualcomm, and AMD have thousands of engineers, it would not be possible for them to manually design every aspect of a chip. To put together chips on such a scale, they use a variety of advanced tools to generate the designs and schematics for them. These tools will typically take a high level description of what the component should do, and determine an optimal hardware configuration to meet those requirements. There has been a recent trend towards a technique called High Level Synthesis which allows developers to specify the functionality they want in code, and then have computers figure out how to optimally achieve it in hardware.
Just like you can define computer programs through code, designers can also define hardware through code. Languages such as Verilog and VHDL allow hardware designers to express the functionality of whatever circuit they are making. Simulation and verification is done on these designs and if everything passes, they can be synthesized down into the specific transistors that will make up the circuit. While verification may not seem as flashy as designing a new cache or core, it is considerably more important. For every design engineer that a company employs, there may be five or more verification engineers.
Verification of a new design often takes much more time and money than building the actual chip itself. Companies spend so much time and money on verification because once a chip goes into production, there’s no way to fix it. With software you can just issue a patch, but hardware doesn’t work that way. For example, Intel had a bug in the floating point division unit of some Pentium chips and it ended up costing them the equivalent of $2 billion today.
It can be difficult to wrap your mind around how one chip can have several billion transistors and what they all do. When you break down the chip into its individual internal components, it gets a bit easier. Transistors make logic gates, logic gates are combined into functional units that perform a specific task, and these functional units are connected together to form the computer architecture we talked about in Part 1.
Much of the design work is automated, but this should give you a new appreciation for just how complex that new CPU you bought is.
This second installment in our series covered the CPU design process. We talked about transistors, logic gates, power and clock delivery, design synthesis, and verification. In the Part 3 we’ll see what is required to physically build a chip. Each company likes to brag about how advanced their fabrication process is (Intel 10nm, Apple and AMD 7nm, etc), but what do those numbers actually mean? Stay tuned.
Part 3: Laying Out and Physically Building the Chip
(VLSI and silicon fabrication)
This is the third installment in our CPU design series. In the first part, we covered computer architecture and how a processor works from a high level. The second part took a look at how some of the individual components of a chip were designed and implemented. Part three goes one step further to see how architectural and schematic designs are turned into physical chips.
How do you transform a pile of sand into an advanced processor? Let’s find out.
As we discussed before, processors and all other digital logic are made out of transistors. A transistor is an electronically controlled switch that we can turn on or off by applying or removing voltage from the gate. We discussed how there are two main types of transistors: nMOS devices which allow current when the gate is on, and pMOS devices which allow current when the gate is off. The base structure of a processor that the transistors are built into is silicon. Silicon is known as a semiconductor because it doesn’t fully conduct or insulate; it’s somewhere in the middle.
To turn a wafer of silicon into a useful circuit by adding transistors, fabrication engineers use a process called doping. The doping process involves adding carefully selected impurities to the base silicon substrate to change its conductivity. The goal here is to change the way electrons behave so that we can control them. Just like there are two types of transistors, there are two main corresponding types of doping.
The fabrication process of a wafer before the chips are packaged. Photo Credit: Evan Lissoos
If we add a precisely controlled amount of electron donor elements like arsenic, antimony, or phosphorus, we can create an n-type region. Since the silicon area where these elements were applied now has an excess of electrons, it will become negatively charged. This is where the name n-type and the “n” in nMOS comes from. By adding electron acceptor elements like boron, indium, or gallium to the silicon, we can create a p-type region which is positively charged. This is where the “p” in p-type and pMOS come from. The specific processes to add these impurities to the silicon are known as Ion Implantation and Diffusion and they are bit beyond the scope of this article.
Now that we can control the electrical conductivity of certain parts of our silicon, we can combine the properties of multiple regions to create transistors. The transistors used in integrated circuits, known as MOSFETs (Metal Oxide Semiconductor Field Effect Transistors), have four connections. The current we are controlling flows through the Source and Drain. In an n-channel device it typically goes in the drain and out the source while in an p-channel device, it typically flows in the source and out the drain. The Gate is the switch used to turn the transistor on and off. Finally, the Body of the device isn’t relevant to processor so we won’t discuss it here.
The physical structure of an inverter in silicon. Each colored region has different conductivity properties. Note how the different silicon components correspond to the schematic on the right
The technical details of how transistors work and how the different regions interact is enough to fill a graduate level college course, so we’ll just touch the basics. A good analogy for how they work is a drawbridge over a river. The cars, electrons in our transistor, would like to flow from one side of the river to the other, the source and drain of our transistor. Using an nMOS device as an example, when the gate is not charged, the drawbridge is up, the electrons can’t flow across the channel. When we lower the drawbridge, we form a road over the river and the cars can move freely. The same thing happens in a transistor. Charging the gate forms a channel between the source and drain allowing current to flow.
To be able to precisely control where the different p and n regions of the silicon are, manufacturers like Intel and TSMC use a process called photolithography. This is an extremely complex multi-step process and companies spend billions of dollars perfecting it to be able to build smaller, faster, and more energy efficient transistors. Imagine a super-precise printer that can be used to draw the patterns for each region onto the silicon.
The process of building transistors into a chip starts with a pure silicon wafer. It is then heated in a furnace to grow a thin layer of silicon dioxide on the top of the wafer. A light-sensitive photoresist polymer is then applied over the silicon dioxide. By shining light at certain frequencies onto the photoresist, we can strip the photoresist in the areas we want to dope. This is the lithography step and is similar to how printers work to apply ink to certain areas of a page, just at a much smaller scale.
The wafer is etched with hydrofluoric acid to dissolve the silicon dioxide where the photoresist was removed. The photoresist is then removed, leaving behind just the oxide layer beneath. The doping ions can then be applied to the wafer and will only implant themselves where there are gaps in the oxide.
This process of masking, imaging, and doping is repeated dozens of times to slowly build up each feature level in a semiconductor. Once the base silicon level is done, metal connections will be fabricated on top to connect the different transistors together. We’ll cover more about these connections and metal layers in a bit.
Of course, chip makers don’t just do this process of making transistors one at a time. When a new chip is designed, they will generate masks for each step in the fabrication process. These masks will contain the locations of each element of the billions of transistors on a chip. Multiple chips are grouped together and fabricated at once on a single die.
Once a wafer is fabricated, the individual dies are sliced up and packaged. Depending on the size of a chip, each wafer may fit hundreds or more chips. Typically, the more powerful the chip being produced, the larger the die will be and the fewer chips the manufacturer will be able to get from each wafer.
It’s easy to think that we should just make massive chips that are super powerful and with hundreds of cores, but that isn’t possible. Currently, the single biggest factor preventing us from making bigger and bigger chips are defects in the manufacturing process. Modern chips have billions of transistors and if a single part of one is broken, the whole chip may need to be thrown away. As we increase the size of processors, the chance that a chip will be faulty increases.
The actual yields that companies get from their fabrication processes are closely held secrets, but anywhere from 70% to 90% is a good estimate. It is common for companies to over-engineer their chips with extra functionality since they know some parts won’t work. For example, Intel may design an 8-core chip but only sell it as a 6-core chip since they estimate that one or two cores may be broken. Chips with an unusually low number of defects are usually set aside to be sold at a higher price in a process known as binning.
One of the biggest marketing terms associated with chip fabrication is the feature size. For example, Intel is working towards a 10nm process, AMD is using a 7nm process for some GPUs, and TSMC has started work on a 5nm process. What do all of these numbers mean though? Traditionally, the feature size represents the minimum width between the drain and source of a transistor. As technology has advanced, we’ve been able to shrink our transistors to be able to fit more and more on a single chip. As transistors get smaller, they also become faster and faster.
When looking at these numbers, it’s important to note that some companies may base their process size on different dimensions than the standard width. This means that different sized processes from separate companies may actually result in the same sized transistor. On the other hand, not all transistors in a given process are the same size either. Designers may choose to make some transistors bigger than others based on certain trade offs. For a given design process, a smaller transistor will be faster since it takes less time to charge and discharge the gate. However, smaller transistors can only drive a very small number of outputs. If a certain piece if logic is going to drive something that requires a lot of power, such as an output pin, it will need to be made much bigger. These output transistors may be orders of magnitude larger than the internal logic transistors.
A die shot of a recent AMD Zen processor. Several billion transistors make up this design.
Designing and building the transistors is only half of the chip though. We need to build wires to connect everything according to the schematic. These connections are made using metal layers above the transistors. Imagine a multi-level highway interchange with on-ramps, off-ramps, and different roads crossing each other. That’s exactly what is going on inside a chip, albeit on a much smaller scale. Different processes will have different numbers of metal interconnect layers above the transistors. As transistors get smaller, more metal layers are needed to be able to route all the signals. TMSC’s upcoming 5nm process has a reported 15 metal layers. Imagine a 15-level vertical highway interchange and that will give you an understanding for just how complex the routing is inside a chip.
The microscope image below shows the lattice formed by seven metal layers. Each layer is flat and as they go higher, the layers get bigger to help reduce resistance. In between each layer are small metal cylinders known as vias that are used to jump up to a higher layer. Each layer typically alternates in direction from the one below it to help reduce unwanted capacitances. The odd metal layers may be used to make horizontal connections while the even layers may be used to make vertical connections.
As you can imagine, all these signals and metal layers get incredibly difficult to manage very quickly. To help solve this issue, computer programs are used to automatically place and route the transistors. Depending on how advanced the design is, programs can even translate functions in high-level C code down to the physical locations of every wire and transistor. Typically, chip makers will let computers generate most of the design automatically and then they will go through and optimize certain critical sections by hand.
When companies want to build a new chip, they will start their design with standard cells that the fabrication company provides. For example, Intel or TSMC will provide designers with basic parts like logic gates or memory cells. The designers can then combine these standard cells into whatever chip they want to build. They will then send the foundry, the place where the raw silicon is turned into functioning chips, the layouts of the chip’s transistors and metal layers. These layouts are turned into masks which are used in the fabrication process we covered above. Next we’ll see what this design process might look like for an extremely basic chip.
First we see the layout for an inverter which is a standard cell. The slashed green rectangle at the top is the pMOS transistor and the transparent green rectangle at the bottom is the nMOS transistor. The vertical red wire is the polysilicon gate, the blue areas are metal 1, and the purple areas are metal 2. The input A comes in on the left and the output Y goes out on the right. The power and ground connections are made at the top and bottom on metal 2.
Combining several gates, here we have a basic 1-bit arithmetic unit. This design can add, subtract, and perform logical operations on two 1-bit inputs.The slashed blue wires that go vertically are metal 3 layers. The slightly larger squares on the ends of the wires are vias that connect two layers.
Finally, putting together many cells and about 2,000 transistors, we have a basic 4-bit processor with 8-bytes of RAM on four metal layers. Looking at how complex this is, one can only imagine the difficulty of designing a 64-bit CPU with megabytes of cache, multiple cores, and 20+ pipeline stages. Given that today’s high-performance CPUs can have upwards of 5-10 billion transistors and a dozen metal layers, it is not an exaggeration to say that they are literally millions of times more complex than this.
This should give you an appreciation for why your new CPU was an expensive piece of tech or why it takes AMD and Intel so long in between product releases. It typically takes anywhere from 3 to 5 years for a new chip to go from the drawing board to the market. That means that today’s fastest chips are made with several year old technology and that we won’t see chips with today’s state of the art fabrication technology for many years.
With that, we are all done with our deep dive into how processors are built.
In the next fourth and last installment of the series we’ll return from the physical domain and look at current trends in the industry. What are researchers working on now to make the next generation of computers even faster?
In the third part of the series we explored how the physics of how transistors work, how their individual components are built in silicon, and how they are connected to create useful circuits and chips.
Part 4: Current Trends and Future Hot Topics in Computer Architecture (Sea of Accelerators, 3D integration, FPGAs, Near Memory Computing)
Despite continuous improvements and incremental upgrades taking place with each new generation, processors haven’t had any industry shifting advancements for a long time. The switch from vacuum tubes to transistors was huge. The switch from individual components to integrated circuits was huge. After that though, there haven’t been other similar paradigm shifts on that scale.
Yes, transistors have gotten smaller, chips have gotten faster, and performance has increased hundredfold, but we’re starting to see diminishing returns…
This is the fourth and last installment in our CPU design series, giving you an overview on computer processor design and manufacturing. Starting from the top down, we looked at how computer code is compiled into assembly language and how that is converted into binary instructions that the CPU can interpret. We looked at how processors are architected and process instructions. Then we looked at the various structures that make up a CPU.
Going a bit deeper, we saw how those structures are built and how the billions of transistors work together inside a processor. We looked at how processors are physically made from raw silicon. We learned the basics of semiconductors and what the inside of a chip really looks like. If you missed any of this, here’s an index of the series:
Moving on to part four. Since companies don’t share their research or details of their current technology, it is hard to get a sense of exactly what is in the CPU in your computer. What we can do however, is look at current research and where the industry is headed.
One famous representation of the processor industry is Moore’s Law. This describes how the number of transistors in a chip doubles roughly every 18 months. This was true for a very long time, but is starting to slow down. Transistors are getting so small that we are nearing the limit of what physics will allow. Without a groundbreaking new technology, we’ll need to explore different avenues to achieve future performance boosts.
Moore’s Law over 120 Years
What makes this graph all the more interesting is that the last 7 (most recent) data points are all Nvidia GPUs, not general purpose CPUs.
One direct result of this breakdown is that companies have started to increase core count rather than frequency to improve performance. This is the reason we are seeing octa-core processors becoming mainstream rather than 10GHz dual core chips. There simply isn’t a whole lot of room left for growth beyond just adding more cores.
On a completely different note, Quantum Computing is an area that does promise lots of room for growth in the future. I’m no expert on this and since the technology is still being created, there aren’t many real “experts” anyway. To dispel any myths, quantum computing isn’t something that will provide you with 1,000fps in a real-life like render or anything like that. For now, the main advantage to quantum computers is that it allows for more advanced algorithms that were previously not feasible.
One of IBM’s Quantum Computer prototypes
In a traditional computer, a transistor is either on or off which represents a 0 or a 1. In a quantum computer, superposition is possible which means the bit can be both 0 and 1 at the same time. With this new capability, computer scientists can develop new methods of computation and will be able to solve problems we don’t currently have the compute capabilities for. It’s not so much that quantum computers are faster, it’s that they are a new model of computation that will let us solve different types of problems.
The technology for this is still a decade or two away from the mainstream, so what are some trends we are starting to see in real processors right now? There are dozens of active research areas but I’ll touch on a few that are the most impactful in my opinion.
A growing trend that we’ve been impacted by is heterogeneous computing. This is the method of including multiple different computing elements in a single system. Most of us benefit from this in the form of a dedicated GPU in our systems. A CPU is very customizable and can perform a wide variety of computations at a reasonable speed. A GPU, on the other hand, is designed specifically to perform graphics calculations like matrix multiplication. It is really good at that and is orders of magnitude faster than a CPU at those types of instructions. By offloading certain graphics calculations from the CPU to the GPU, we can accelerate the workload. It’s easy for any programmer to optimize software by tweaking an algorithm, but optimizing hardware is much more difficult.
But GPUs aren’t the only area where accelerators are becoming common. Most smartphones have dozens of hardware accelerators designed to speed up very specific tasks. This computing style is known as a Sea of Accelerators and examples include cryptography processors, image processors, machine learning accelerators, video encoders/decoders, biometric processors, and more.
As workloads get more and more specialized, hardware designers are including more and more accelerators into their chips. Cloud providers like AWS have started providing FPGA cards for developers to accelerate their workloads in the cloud. While traditional computing elements like CPUs and GPUs have a fixed internal architecture, an FPGA is flexible. It’s almost like programmable hardware that can be configured to whatever your computing needs are.
If you want to do image recognition, you can implement those algorithms in hardware. If you want to simulate how a new hardware design will perform, you can test it on the FPGA before actually building it. An FPGA offers more performance and power efficiency than GPUs, but still less than an ASIC (application specific integrated circuit). Other companies like Google and Nvidia are developing dedicated machine learning ASICs to speed up image recognition and analysis.
Die shots showing the makeup of several common mobile processors
Looking at the die shots of some fairly recent processors, we can see that most of the area of the CPU is not actually the core itself. A growing amount is being taken up by accelerators of all different types. This has helped speeding up very specialized workloads in addition to the benefit of huge power savings.
Historically, if you wanted to add video processing to a system, you’d just add a chip to do it. That is hugely inefficient though. Every time a signal has to go out of a chip on a physical wire to another chip, there is a large amount of energy required per bit. On its own, a tiny fraction of a Joule may not seem like a lot, but it can be 3-4 orders of magnitude more efficient to communicate within the same chip versus going off chip. We have seen the growth of ultra low power chips thanks to the integration of these accelerators into the CPUs themselves.
Accelerators aren’t perfect though. As we add more of them to our designs, chips become less flexible and start to sacrifice overall performance for peak performance in certain workloads. At some point, the whole chip just becomes a collection of accelerators and then it isn’t a useful CPU anymore. The tradeoff between specialized performance and general performance is always being fine tuned. This disconnect between generalized hardware and specific workloads is known as the specialization gap.
While some think we may be at the peak of a GPU / Machine Learning bubble, we can likely expect more of our computations to be offloaded to specialized accelerators. As the cloud and AI continue to grow, GPUs appear to be our best solution so far to achieve the massive amounts of compute needed.
Another area where designers are looking for more performance is memory. Traditionally, reading and writing values has been one of the biggest bottlenecks for processors. While fast, large caches can help, reading from RAM or your SSD can take tens of thousands of clock cycles. Because of this, engineers often view memory access as more expensive than the actual computation itself. If your processor wants to add two numbers, it first needs to calculate the memory addresses where the numbers are stored, find out what level of the memory hierarchy has the data, read the data into registers, perform the computation, calculate the address of the destination, and write back the value to wherever it is needed. For simple instructions that may only take a cycle or two to complete, this is extremely inefficient.
A novel idea that has seen a lot of research is a technique called Near Memory Computing. Rather than fetching small bits of data from memory to bring to the fast processor for compute, researchers are flipping this idea around. They are experimenting with building small processors directly into the memory controllers on your RAM or SSD. By doing the computation closer to the memory, there is the potential for huge energy and time savings since data doesn’t need to be transferred around as much. The compute units have direct access to the data they need since they are right there in the memory. This idea is still in its infancy, but the results look promising.
One of the hurdles to overcome with near memory computing is manufacturing process limitations. As covered in Part 3 the silicon fabrication process is very complex with dozens of steps involved. These processes are typically specialized to produce either fast logic elements or dense storage elements. If you tried to create a memory chip using a compute-optimized fabrication process, you would have extremely poor density in the chip. If you tried to build a processor using a storage fabrication process, you would have very poor performance and timing.
An example of 3D integration showing the vertical connections between transistor layers.
One potential solution to this issue is known as 3D Integration. Traditional processors have one very wide layer of transistors, but this has its limitations. As the name implies, 3D Integration is the process of stacking several layers of transistors on top of each other to improve density and reduce latency. Vertical columns built on different manufacturing processes can then be used to connect between the layers. This idea was proposed a long time ago but the industry lost interest because of major difficulties in its implementation. Recently, we have seen 3D NAND storage technology and a resurgence of this as a field of study.
In addition to physical and architectural changes, one trend that will affect the entire semiconductor industry is a greater focus on security. Until recently, security in our processors was somewhat of an afterthought. This is similar to the how the internet, email, and many other systems we rely on were designed with almost no regard to security. Any security present now was bolted on after the fact to make us feel safer. With processors, this has come back to bite companies, particularly Intel.
The Spectre and Meltdown bugs are perhaps the most famous example of designers adding features that greatly speed up a processor, despite not fully understanding the security risks involved. Current processor design is placing a much greater emphasis on security as a key part of the design. With increased security there often comes a performance hit, but given the harm these major security bugs can have, it’s safe to say that we are better off focusing just as much on security as we do on performance.
In previous parts of this series, we touched on techniques such as High Level Synthesis that allow designers to first specify their designs in a high level programming language, and then have advanced algorithms determine the optimal hardware configuration to carry out that function. As design cycles get vastly more expensive each generation, engineers are looking for ways to help speed up their development. Expect this trend of software-aided hardware design to continue to grow in its capabilities down the road.
While it’s impossible to predict the future, the innovative ideas and research fields we have talked about here should serve as a roadmap for what we can expect in future processor designs. What we can say for sure, is that we are nearing the end of regular manufacturing process improvements. To continue to increase performance each generation, designers will need to come up with even more complex solutions.
We hope this four-part series has piqued your interest in the fields of processor design, manufacturing, verification, and more. There’s a never-ending amount of material to cover and each one of these articles could fill an upper level university course if we tried to cover it all. Hopefully you’ve learned something new and have a better understanding of how complex computers are at every level. If you have suggestions for topics you’d like us to take a deep dive into, we’re always open to suggestions.