The Processor Future is Multicore

Senior Editor

It’s not hard to predict that next-generation processors will have smaller details, more transistors on each square millimeter of silicon, and do more clever things, such as powering down under-used sections even between keystrokes. They will also have many cores or CPUs. The latest chips, for example, have up to four cores and by 2010 that figure could rise to eight and more in engineering desktop computers.

Even now, Intel has given hardware and software developers a peek at a prototype 80-core processor with one teraflop of computational ability. The company, however, has set no launch date. Nvidia Corp. sells a Quadro Plex graphic processor sporting 128 cores. And Cisco Systems Inc., sensing a coming surge in network traffic, is toying with a 188-core router. But to see what tomorrow’s processors hold for engineers, take a closer look at today’s multicore designs.

Start with microarchitecture
The term refers to the design of a chip, determines how real estate is portioned to different functions, and how sections interact. For instance, Intel’s quad-core has 12 Mbytes of onboard cache. Expect caches to get larger with future rollouts because they improve performance and efficiency by increasing the probability that each execution core can quickly access needed data.

Intel’s recent dual and quad-core processors, called Penryn, each provide greater efficiency with 6 and 12-Mbyte storage capacity by supporting a 24-way set associative organization. The codename Penryn refers to a line of processors for desktops, laptop computers, and high-end servers. These processors feature dual cores for general computing tasks, a dual-core mobile version, quad-cores for engineering desktops, and a quad-core version for servers.

The great leap forward for Penryn, according to Intel, is the 45-nm, high-k technology. The 45 nm refers to the approximate width of the transistors. “Some are smaller and some larger, but the industry accepted method of measuring transistors put their width at 45 nm,” says Intel spokesman George Alfs. The previous dimension limit was about 65 nm, and the next processor family, due about 2010, will build with smaller details which will let all values mentioned here jump a bit.

And high-k implies Intel’s process technology that has a higher dielectric constant (k). The high-k property lets the company build transistors that are smaller yet have improved leakage characteristics and lower active power. That translates into less heat and lower power consumption. Intel credits the low gate leakage to its use of hafnium and metal gates. It reduces transistor leakage by over 90%.

Another plus: “The 45-nm high-k technology delivers more than a 20% improvement in transistor switching speed,” says Alfs. That translates to 3.5 GHz on the high end. To further speed communications, the processor uses a 1,600-MHz front-side bus. As expected, compute times are shortened all around. The 65-nm, quad-core Xeon processors for instance, provide 2.5 times the performance of previous Xeons. On a desktop, Core 2 Duo processor-based systems provide up to 40% more performance over previous design and use less energy. Laptops gain up to twice the performance in multitasking, as well as greater energy efficiency for longer working periods. “What’s more, Intel has moved to 100% lead-free materials in the 45-nm designs and is making the move to halogen-free products to meet environmental goals,” says Alfs.

The two developments (45-nm details and hafnium) send benefits cascading through the processor. “For example, they significantly increase transistor density to about double that of the 65-nm technology. This is expected to keep Moore’s Law valid for many years,” adds Alfs. For instance, about 410 million transistors comprise dual-core processors and some 820 million make up the quad-core design. The recent technology also allows up to 50% larger L2 data cache (secondary memory not on the processor), about 6 Mbytes on dual-core processors and 12 Mbytes on quad-core chips.

Other chip benefits include:

Wide dynamic execution. The term refers to a core that can execute four instructions per clock cycle. Previous designs operated on only three instructions at once.
Intelligent power. Processors can reduce their power consumption when idle through different C-state levels. These adjust the idle power for battery-life requirements according to the processor’s work load. It is efficient enough that the processor can enter even the deepest sleep modes between keystrokes,” says Alfs. The low-power periods add up, translating to longer battery life for portables.

“This ‘sleep’ state is the lowest power state a processor can reach and significantly helps extend battery life. It improves Penryn’s performance substantially over the previous generation of mobile-platform chips,” adds Alfs. Upon entering deep power-down, the processor flushes cache, saves the processor state internally, and shuts off power to cores and L2 cache. The chipset continues to let memory traffic flow, but doesn’t wake the processor. When the core is needed, voltage ramps up, the clocks turn on, the processor resets and restores the microarchitecture state, and resumes executing instructions. But the deeper a C-state, the higher the energy cost of the transition to and from it.

Here’s more processor smarts. “Too many transitions to deep C-states can yield a net energy loss. To prevent this, Penryns include intelligence to determine when idle-period savings justify the energy cost of shutting down and restarting a processor. If it doesn’t, the power-down request is demoted to a shallower power management state,” he explains.

Another example of power management shows up in single-thread applications. “The processor is capable of boosting the performance of one core due to an increased thermal envelope after the operating system has put a second core to sleep,” says Alfs. This increases the speed at which single-threaded applications can be processed, thus improving their performance.

Faster support for operating system primitives. “This refers to the acceleration of certain functions commonly used by the operating system, such as interrupt masking control, time-step access, and locked semaphore access,” says Alfs.
Advanced digital media. Alfs says the cores deliver record-setting performance on industry benchmarks for desktop, mobile, and mainstream server platforms. Penryn processors also include 47 new SSE4 (Streaming SIMD (single instruction, multiple data) Extension 4) instructions that further improve media and speed computing tasks. “We found they give a good boost in video encoding. The processor can take an array of data that was once processed serially and do it all at once. So instead of eight cycles for some processes, it can be done in one. The video application Divx, for example, gets a 60% boost in performance,” he says.

In addition, the SSE4 streaming- load instruction improves the bandwidth for reading data from a graphics frame buffer. By fetching a full cache line (64 bytes at a time as opposed to 8 bytes and keeping it in a temporary buffer), the streaming-load instruction allows up to eight times theoretical improvement in read bandwidth.

A new dividing algorithm will be of interest to engineers and designers because technical software involves a lot of division. “The Radix- 16 divider roughly doubles the divider speed over previous generations for scientific computations, 3D transformations, and other mathematically intensive functions. The faster technique speeds division in floating-point and integer operations,” says Alfs. An older Radix 4 algorithm computes two bits of quotient per iteration. The Radix 16 algorithm computes four bits every iteration and cuts latency by 50%.

In the future
“Intel is planning processors into the next decade, in what it calls a “tick-tock” cadence of silicon and microarchitecture. Each “tick” represents new silicon technology with an enhanced microarchitecture. The corresponding “tock” represents a brand new microarchitecture. The cycle repeats about every two years,” says Alfs. The Penryn family, with its Intel 45-nm high-k silicon technology, is the latest “tick”.

The following “tock” will be new microarchitecture code-named Nehalem. Intel says Nehalem’s scalability improves on the Penryn design through:

• Dynamically managed cores, threads, cache, interfaces, and power that use four-instructions per clock cycle.
Simultaneous multithreading further boosts performance and energy efficiency.
Scalable performance from one to 16 (or more) threads and from one to eight (or more) cores.

After Nehalem, processors will be based on the company’s upcoming 32-nm silicon technology.

Intel Corp. President and CEO Paul Otellini recently showed the industry’s first working test chips built using 32-nm technology, with transistors so small that more than 4 million, he says, fit on the period at the end of this sentence. Intel’s 32-nm technology is on track to begin production next year. Intel’s 32-nm test chips house over 1.9 billion transistors.

One step closer to the optical computer

Researchers at Intel have announced a silicon-based light detector that is better than those made of more expensive materials. It can detect flashes of light at a rate of 40 Gbits/sec. Most fiber-optic devices operate at 10 Gbits/sec. The new detector is said to be more efficient and produce a cleaner signal than other detectors operating at the same speed. Because silicon detectors could be manufactured by standard techniques, researchers could build detectors that are about 1% the cost of those in today’s networks, which are made of materials such as indium-gallium arsenide.

Intel also says it has demonstrated a silicon-based laser and a silicon modulator (a device that encodes data onto light) operating at 40 Gbits/sec. The goal, says director of Intel’s silicon-photonics lab Mario Paniccia, is to combine all three devices (laser, modulator, and detector) on one silicon chip. The chip would be relatively inexpensive, because it could be manufactured with processes familiar to the microchip industry. If used in existing fiber-optic networks, photonic chips could reduce the cost of Internet bandwidth. Built into computers, they could move and transmit data at speeds greater than the 3.5 GHz of current processors.

Software simplifies programming for multicore controllers

If software developers are to get the most out of multicore processors, they’ll have to retune their products by writing it in threads, strings of software that can execute almost independent of other threads. Most cores now simultaneously handle at least two threads each, so a fourcore processor can tackle eight. Developers at National Instruments, Austin, Tex. (ni.com), say their parallel-dataflow language LabView 8.5 can help by mapping applications to multicore and FPGA architectures for data streaming, control, analysis, and signal processing. Building on the automatic multithreading of earlier versions, Lab- View 8.5 is said to scale user applications based on the total number of available cores. It comes with thread-safe drivers and libraries to boost throughput in RF, high-speed digital I/O, and mixed-signal test applications.

The software also delivers symmetric multiprocessing (two or more cores sharing memory) in a real-time environment. It lets designers of embedded and industrial systems load-balance tasks across multiple cores. LabView 8.5 is said to let users assign portions of code to specific processor cores to fine-tune real-time systems or isolate timecritical sections of code on a dedicated core. To debug and efficiently code real-time multicore development, users have features that show timing relationships between sections of code and

Moore’s Law and your brain

This often repeated “law” was actually just an observation by Intel founder Gordon Moore when he noted that the number of transistors on a processor was doubling about every 1.5 to 2 years. While many predicted the end of the line for the law, manufacturers keep finding clever ways to keep it going. More importantly, processor performance has also doubled every year.

Here’s one implication of the trend from futurist Ray Kurzweil. He estimates the human brain’s compute speed at 1014 to 1016 transactions/sec. If computer performance doubles every year, 1014 transactions/sec will be available in a supercomputer by 2010, and in a $1,000 computer by 2020. But Kurzweil isn’t sure if there will be software to mimic the brain even by then.

Penryn processors on an Intel 45-nm hafniumbased high-k metal-gate wafer are each a bit smaller than a dime. The processors use 410 million transistors for each dual-core chip, and 820 million for quad-core designs. The original Pentium processor (circa 1993) had only 3.1 million transistors.

Intel researchers have developed a silicon-based light detector that reads optically transmitted data at 40 Gbits/sec. Light passes through a silicon waveguide (bottom) to a strip of germanium between two aluminum pads (white squares, center). Voltage applied to the pads switch the detector on and off. Current passing through a third aluminum pad (top white square) indicates how much light has struck the detector.

The bar chart shows how CFD code Fluent 6.3.26, from Ansys Inc. performed more calculations in the same period thanks to processor improvements. The left bar represents a baseline run on a computer with a dual-core Intel Xeon 5160. The processor has 4-Mbyte cache and a 1,333-MHz front side bus (FSB). The middle bar comes from a computer with a quad-core Xeon, 8-Mbyte cache, and the same FSB. The right bar is from a quad-core processor with 12-Mbyte cache and a 1,600-MHz FSB. All processors ran at 3 GHz.

The parallel processing Quadro FX 5600 (above) and Quadro Plex VCS Model IV graphics cards from Nvidia Corp., Santa Clara, Calif. (nvidia.com), sport 128 1.35-GHz processors. The cards have 750 million transistors that the company says lets them display larger graphics faster than ever. The cards are aimed at medical imaging, oil and gas explorers, and high-end film effects.

The mobile market through 2011. The global market for mobile computing hit $55 billion in 2005 and $63 billion in 2006, according to information firm BCC Research Group, Wellesley, Mass. At an average annual growth rate of 7%, the market should reach almost $90 billion by 2011. BCC also says smart phones (those with Internet access, e-mail, and TV) will have the highest growth rate, 15.7% over the next three years and reach $17.8 billion. But the largest market share belongs to notebook computers. In 2006, laptops held 84% of the global mobile computing market. By 2012, this will reach more than 96%, worth $69 billion.