Paul Dvorak
Senior Editor
It’s not hard to predict that next-generation
processors will have smaller details, more
transistors on each square millimeter of
silicon, and do more clever things, such as
powering down under-used sections even
between keystrokes. They will also have
many cores or CPUs. The latest chips, for
example, have up to four cores and by 2010
that figure could rise to eight and more in
engineering desktop computers.
Even now, Intel has given hardware and
software developers a peek at a prototype
80-core processor with one teraflop of computational
ability. The company, however,
has set no launch date. Nvidia Corp. sells
a Quadro Plex graphic processor sporting
128 cores. And Cisco Systems Inc., sensing
a coming surge in network traffic, is toying
with a 188-core router. But to see what tomorrow’s
processors hold for engineers, take
a closer look at today’s multicore designs.
Start with microarchitecture
The term refers to the design of a chip, determines
how real estate is portioned to different
functions, and how sections interact.
For instance, Intel’s quad-core has 12 Mbytes
of onboard cache. Expect caches to get larger
with future rollouts because they improve
performance and efficiency by increasing
the probability that each execution core can
quickly access needed data.
Intel’s recent dual and quad-core processors,
called Penryn, each provide greater
efficiency with 6 and 12-Mbyte storage capacity
by supporting a 24-way set associative
organization. The codename Penryn refers
to a line of processors for desktops, laptop
computers, and high-end servers. These
processors feature dual cores for general computing
tasks, a dual-core mobile version,
quad-cores for engineering desktops,
and a quad-core version for servers.
The great leap forward for Penryn, according to Intel, is the 45-nm, high-k technology.
The 45 nm refers to the approximate
width of the transistors. “Some are
smaller and some larger, but the industry accepted
method of measuring transistors
put their width at 45 nm,” says Intel spokesman
George Alfs. The previous dimension
limit was about 65 nm, and the next processor
family, due about 2010, will build with
smaller details which will let all values mentioned
here jump a bit.
And high-k implies Intel’s process technology
that has a higher dielectric constant
(k). The high-k property lets the company
build transistors that are smaller yet have
improved leakage characteristics and lower
active power. That translates into less heat
and lower power consumption. Intel credits
the low gate leakage to its use of hafnium
and metal gates. It reduces transistor leakage
by over 90%.
Another plus: “The 45-nm high-k technology delivers more than a 20%
improvement in transistor switching
speed,” says Alfs. That translates
to 3.5 GHz on the high end.
To further speed communications,
the processor uses a 1,600-MHz
front-side bus. As expected,
compute times are shortened all
around. The 65-nm, quad-core
Xeon processors for instance, provide
2.5 times the performance
of previous Xeons. On a desktop,
Core 2 Duo processor-based systems
provide up to 40% more performance
over previous design and
use less energy. Laptops gain up
to twice the performance in multitasking,
as well as greater energy efficiency for longer working periods.
“What’s more, Intel has moved
to 100% lead-free materials in the
45-nm designs and is making the
move to halogen-free products to
meet environmental goals,” says
Alfs.
The two developments (45-nm
details and hafnium) send benefits
cascading through the processor.
“For example, they significantly increase
transistor density to about
double that of the 65-nm technology.
This is expected to keep
Moore’s Law valid for many years,”
adds Alfs. For instance, about
410 million transistors comprise
dual-core processors and some
820 million make up the quad-core
design. The recent technology also
allows up to 50% larger L2 data
cache (secondary memory not on
the processor), about 6 Mbytes
on dual-core processors and
12 Mbytes on quad-core chips.
Other chip benefits include:
- Wide dynamic execution. The
term refers to a core that can execute
four instructions per clock
cycle. Previous designs operated
on only three instructions at once.
- Intelligent power. Processors can
reduce their power consumption
when idle through different C-state
levels. These adjust the idle power for battery-life requirements according
to the processor’s work
load. It is efficient enough that
the processor can enter even the
deepest sleep modes between keystrokes,”
says Alfs. The low-power
periods add up, translating to longer
battery life for portables.
“This ‘sleep’ state is the lowest
power state a processor can
reach and significantly helps extend
battery life. It improves Penryn’s performance substantially
over the previous generation of
mobile-platform chips,” adds Alfs.
Upon entering deep power-down,
the processor flushes cache, saves
the processor state internally, and
shuts off power to cores and L2
cache. The chipset continues to let
memory traffic flow, but doesn’t
wake the processor. When the core
is needed, voltage ramps up, the
clocks turn on, the processor resets and restores the microarchitecture
state, and resumes executing
instructions. But the deeper a
C-state, the higher the energy cost
of the transition to and from it.
Here’s more processor smarts.
“Too many transitions to deep
C-states can yield a net energy loss.
To prevent this, Penryns include
intelligence to determine when
idle-period savings justify the energy
cost of shutting down and restarting
a processor. If it doesn’t, the power-down request is demoted
to a shallower power management
state,” he explains.
Another example of power management
shows up in single-thread
applications. “The processor is capable
of boosting the performance
of one core due to an increased
thermal envelope after the operating
system has put a second core
to sleep,” says Alfs. This increases
the speed at which single-threaded
applications can be processed, thus improving their performance.
- Faster support for operating system
primitives. “This refers to the
acceleration of certain functions
commonly used by the operating
system, such as interrupt masking
control, time-step access, and
locked semaphore access,” says
Alfs.
- Advanced digital media. Alfs says
the cores deliver record-setting
performance on industry benchmarks
for desktop, mobile, and
mainstream server platforms.
Penryn processors also include
47 new SSE4 (Streaming SIMD
(single instruction, multiple data)
Extension 4) instructions that
further improve media and speed
computing tasks. “We found they
give a good boost in video encoding. The processor can take an array
of data that was once processed
serially and do it all at once. So
instead of eight cycles for some
processes, it can be done in one.
The video application Divx, for example,
gets a 60% boost in performance,”
he says.
In addition, the SSE4 streaming-
load instruction improves the
bandwidth for reading data from a
graphics frame buffer. By fetching
a full cache line (64 bytes at a time
as opposed to 8 bytes and keeping
it in a temporary buffer), the
streaming-load instruction allows
up to eight times theoretical improvement
in read bandwidth.
A new dividing algorithm will
be of interest to engineers and designers
because technical software involves a lot of division. “The Radix-
16 divider roughly doubles the
divider speed over previous generations
for scientific computations,
3D transformations, and other
mathematically intensive functions.
The faster technique speeds
division in floating-point and integer
operations,” says Alfs. An older
Radix 4 algorithm computes two
bits of quotient per iteration. The
Radix 16 algorithm computes four
bits every iteration and cuts latency
by 50%.
In the future
“Intel is planning processors
into the next decade, in what it
calls a “tick-tock” cadence of silicon
and microarchitecture. Each
“tick” represents new silicon technology
with an enhanced microarchitecture.
The corresponding
“tock” represents a brand new microarchitecture.
The cycle repeats
about every two years,” says Alfs.
The Penryn family, with its Intel
45-nm high-k silicon technology,
is the latest “tick”.
The following “tock” will be new
microarchitecture code-named
Nehalem. Intel says Nehalem’s
scalability improves on the Penryn
design through:
- Dynamically managed cores,
threads, cache, interfaces, and
power that use four-instructions
per clock cycle.
- Simultaneous multithreading
further boosts performance and
energy efficiency.
- Scalable performance from one
to 16 (or more) threads and from
one to eight (or more) cores.
After Nehalem, processors will
be based on the company’s upcoming
32-nm silicon technology.
Intel Corp. President and CEO
Paul Otellini recently showed the
industry’s first working test chips
built using 32-nm technology, with
transistors so small that more than
4 million, he says, fit on the period
at the end of this sentence. Intel’s
32-nm technology is on track to
begin production next year. Intel’s
32-nm test chips house over
1.9 billion transistors.
One step closer to the optical computer
Researchers at Intel have announced a silicon-based light detector that
is better than those made of more expensive materials. It can detect
flashes of light at a rate of 40 Gbits/sec. Most fiber-optic devices operate
at 10 Gbits/sec. The new detector is said to be more efficient and
produce a cleaner signal than other detectors operating at the same
speed. Because silicon detectors could be manufactured by standard
techniques, researchers could build detectors that are about 1% the
cost of those in today’s networks, which are made of materials such as
indium-gallium arsenide.
Intel also says it has demonstrated a silicon-based laser and a silicon
modulator (a device that encodes data onto light) operating at 40
Gbits/sec. The goal, says director of Intel’s silicon-photonics lab Mario
Paniccia, is to combine all three devices (laser, modulator, and detector)
on one silicon chip. The chip would be relatively inexpensive, because
it could be manufactured with processes familiar to the microchip
industry. If used in existing fiber-optic networks, photonic chips
could reduce the cost of Internet bandwidth. Built into computers,
they could move and transmit data at speeds greater than the 3.5 GHz
of current processors.
|
Software simplifies programming for
multicore controllers
If software developers are to get the most out of multicore processors,
they’ll have to retune their products by writing it in threads, strings of
software that can execute almost independent of other threads. Most
cores now simultaneously handle at least two threads each, so a fourcore
processor can tackle eight. Developers at National Instruments,
Austin, Tex. (ni.com), say their parallel-dataflow language LabView
8.5 can help by mapping applications to multicore and FPGA architectures
for data streaming, control, analysis, and signal processing.
Building on the automatic multithreading of earlier versions, Lab-
View 8.5 is said to scale user applications based on the total number
of available cores. It comes with thread-safe drivers and libraries to
boost throughput in RF, high-speed digital I/O, and mixed-signal test
applications.
The software also delivers symmetric multiprocessing (two or more
cores sharing memory) in a real-time environment. It lets designers
of embedded and industrial systems load-balance tasks across multiple
cores. LabView 8.5 is said to let users assign portions of code to
specific processor cores to fine-tune real-time systems or isolate timecritical
sections of code on a dedicated core. To debug and efficiently
code real-time multicore development, users have features that show
timing relationships between sections of code and |