Hot Chips 24: Don’t you wish your ASICs were hot like mine?

 

The 2012 edition of the Hot Chips conference will take place in Silicon Valley on August 27-29. That’s a long way off but the time for submitting a presentation proposal is now. If you are working on a really hot design that will be available by August of this year or shortly thereafter, then you should be telling the world about it, starting with the hundreds and hundreds of technophiles, press, and analysts that attend the Hot Chips conference every year. We always pack a few surprises into each conference. Perhaps one of those surprises should be yours.

 

If you can’t wait to tell the world about the next new thing, then we can help. Deadline for presentation proposals is April 9, which will be here almost tomorrow. Here are the topic categories that the Hot Chips Program Committee is particularly interested in:

 

General Purpose Processor Chips

  • High-Performance and Low-Power
  • Multi-Core and Highly-Reliable Systems

Mobile and Embedded Devices

  • Graphics/Multimedia/Game
  • SoC, Security, and DSP chips

Communications and Networking

  • Wireless LAN/WAN/PAN
  • Network and I/O Processors

Other Chips

  • FPGAs and FPGA-Based Systems
  • Memory Technologies and Chipsets

Software for multi-Core and Heterogeneous Systems

  • Programming models, Runtime Systems
  • Compilers and Operating Systems
  • Performance and Power Debug and Evaluations

Other Technologies

  • Power and Thermal Management
  • Packaging and Testing
  • Display Technologies
  • On-Chip Optics & Sensors
  • Novel Computing Technologies

 

Presentations at Hot Chips are 30-minute talks. Presentation slides will be published in the Hot Chips Proceedings. Presenters need not submit written papers, but can submit a paper in addition to the presentation if desired. A select group will be invited to submit a paper for inclusion in a special issue of IEEE Micro.

 

A limited number of Student Posters describing applied research performed at a university will be accepted for presentation at the conference. Last year was the first year that the Hot Chips Conference experimented with Student Posters and the experiment was declared a huge success by attendees, students, and the Program Committee. So we’re doing it again this year.

 

Student poster submissions consist of 4 slides and a one-page summary and must be submitted by July 1, 2012. One outstanding poster will receive a Best Poster Award.

 

Submissions must specify “Presentation” or “Student Poster” and should consist of a title, extended abstract (maximum of two pages), and the presenter’s contact info (name, affiliation, job title, address, phone(s), fax, and email). Please indicate whether you have submitted, intend to submit, or have already presented or published a similar or overlapping submission to another conference or journal.

 

Also indicate if you would like the submission to be held confidential or if we can publicize it so everyone knows you have secured a presentation slot at Hot Chips. (Good PR!) Submissions marked confidential remain confidential until the first day of the conference. After you present in front of hundreds of people including press and analysts, it’s no longer confidential.

 

The Program Committee selects submissions on the basis of performance of the device (Just how hot is this chip?), the degree of design or device innovation, use of advanced technology, potential market significance, and anticipated interest to the Hot Chips audience. Research and software contributions are evaluated with similar criteria. The Committee also want to know the product’s status—design, development, tape out, first silicon, production, shipping, etc.

 

Submit extended abstracts in either a plain text file or a PDF (with images and figures if available). Again there’s a two-page limit for submissions and that includes a requirement for 10-point fonts or larger. (Yeah, we’re on to that trick.)

 

Submit proposals at:  https://www.softconf.com/c/hotchips24/

 

Proposal submissions will get a thumbs up or thumbs down decision from the Hot Chips 2012 Program Committee by May 1, 2012. Send questions relating to the program to the program chairs at program2012@hotchips.org and questions relating to conference operation or organization to the general chair, Larry Lewis, at: info2012@hotchips.org.

 

Hot Chips 2012 is sponsored by the Technical Committee on Microprocessors and Microcomputers of the IEEE Computer Society and the Solid State Circuits Society.

 

Hot Chips 2011 presentations and videos are now open to the public at: www.hotchips.org/archives/hot-chips-23

 

Check the HOT CHIPS 24 web page for updates: www.hotchips.org

 

Early registration for HC 23 is extended through August 9, 2011 www.hotchips.org

Early registration for HC 23 is extended through August 9, 2011, 5:00 PM PDT. Online registration is available through August 12, 2011, and this is the last day for refunds. Onsite registration is available August 17, 18, and 19, 2011. Register at www.hotchips.org

 

Hot Chips 23 (August 17-19) to feature six student posters this year

The Hot Chips 23 Program Committee has voted to accept six student posters for this year’s conference. This will be the first year for student posters. The six student poster topics are:

 

55. A Few Ways Can Take You a Long Way

 

59. VENICE: A Compact Vector Processor for FPGA Applications

 

61. The Utility of Fast Active Messages on Many-Core Chips

 

62. Efficient Fetch Mechanism by Employing Instruction Register

 

63. Tessellation Operating System: Building a real-time, responsive, high-throughput client OS for many-core architectures

 

64. Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

 

You still have time to register at www.hotchips.org

 

Hot Chips 23 Abstracts

Here are some of the edited abstracts for the presentations to be made this August at Hot Chips 23:

 

S2.1: Universal Programmatic Access to High Quality Random Numbers

 

Entropy is valuable in a variety of system applications, the first of which that comes to mind being in cryptography. Historically, computing platforms have had a perennial problem of the absence of any high quality/high performance hardware “entropy source”. Intel is in the process of resolving these issues by deploying a solution consisting of two parts:

 

  • A RdRand instruction making entropy directly available to ALL SW on Intel platforms in any privilege level/mode and
  • A HW Digital Random Number Generator (DRNG) producing NIST SP 800-90 compliant and FIPS 140-2 certifiable entropy that supplies random numbers to the RdRand instruction.

 

 

S2.2: Tilera, TILE-Gx ManyCore Processor: Hardware Acceleration Interfaces and Mechanisms

 

Tile Processors are a class of general-purpose, power-efficient manycore processors from Tilera that address embedded and cloud computing applications. Tile processors employ switched, on-chip mesh interconnects and coherent caches. Tile processor SoCs use their mesh networks to connect the processors to on-chip Ethernet and PCIe ports, memory controllers, and hardware accelerators. Cache coherence across the cores, the memory, I/O, and accelerators allows for standard shared-memory programming.

 

This talk will provide an overview of the key architectural features of Tilera’s next generation 64-bit TILE-Gx processors, and focus on the architecture of the accelerator engines and how they interface to the rest of the manycore chip.

 

As background, Tilera launched its first generation 64-core TILE64 processor in 2007 and its second generation of processors, the 64-core TILEPro64 and 36-core TILEPro36, in 2008. TILE-Gx processors belong to Tilera’s third generation family with processors ranging from 16-cores to 100-cores on one chip.

 

The TILE-Gx processors are implemented in 40nm technology and clock at speeds to 1.5GHz with a nominal clock frequency of 1.2 GHz. At 1.5GHz, The TILE-Gx100 can deliver 450 billion 64-bit operations per second. It supports subword arithmetic and can achieve 1.2 Tera-Ops/sec for 16-bit operations. As in previous Tile processors, TILE-Gx combines DSP and networking functions in a general-purpose processor.

 

For DSP functions, the device delivers up to 600 GMACs/second. The Gx100 includes 32MB of distributed, on-chip coherent cache and per-tile TLBs for instruction and data. Each core contains a 32KB L1 instruction cache, a 32-KB L1 data cache, and a 256KB L2 cache.

 

Each TILE-Gx processor make its 256KB L2 cache available to all other processor cores on the device, forming a coherent and distributed chip-level L3 cache. Thus, if a CPU within a core has a cache miss in its L2 cache, it can then query the 25.6MB L3 cache formed by the union of all the other L2 caches over the on-chip mesh interconnect. The large distributed L3 cache serves to minimize accesses to off-chip DRAM while offering low power on-chip data access. Multiple “islands” of configurable distributed caches support virtualization, allowing multiple coherence domains and independent operating systems to run on subsets of Tile processors.

 

The cores run SMP Linux and other standard operating systems and support virtual memory through instruction and data TLBs. Because they are cache coherent and use full-featured, general-purpose cores, TILE-Gx manycores run off-the-shelf applications written in ANSI C/C++, and support pthreads, OpenMP, TBB, MPI and other standard parallel programming approaches.

 

TILE-Gx processors include on-chip DDR3 memory controllers as well as high-speed I/O interfaces.  For example, the TILE-Gx100 implements four DDR3 memory controllers that provide an aggregate of ½ Tbps of peak DDR3 memory bandwidth. The chips expose 40 bits of physical addresses so that up to 1 TByte external DRAM can be provisioned on a single processor.  More than 190 Gbps of I/O bandwidth is delivered through three PCIe ports and a high speed Ethernet subsystem offering multiple XAUI, SGMII and Interlaken interfaces.

 

 

 

S2.3 Cavium Nitrox III: Building a 40Gbps Next Generation Virtualized Security Processor

 

With the market rapidly moving to the encryption of all web traffic, new devices that can handle the huge volumes of encrypted data are in high demand. Many online email accounts are already fully encrypted and a beta of encrypted search has already been deployed. Previously, encrypting online credit card transactions was one of the main drivers for encrypting traffic. But now, with hacks that can grab social networking sessions from the unwitting Wi-Fi user at the local coffee shop, the move to encrypt all Internet traffic is afoot. Today, suppliers of Internet appliances report that 30% of web traffic is encrypted, and is expected to grow to well over 80% in the next few years.

 

In addition to the tremendous increase in volume, the strength of encryption required is also increasing. As of Jan 2011, the National Institute of Standards and Technology recommends that traffic be encrypted with 2048-bit rather than 1024-bit RSA keys. This 2X increase in key size requires almost an order of magnitude in additional compute power.

 

All of these things drive the need for a new virtualized security processor.

 

This talk will describe how Cavium designed the NITROX III: a 64 core, 40Gbps, 200K+ RSA Ops/sec device. NITROX III provides a 16-fold increase in throughput compared with current state-of-the-art processors and achieves greater than 200K RSA operations per second with 1024-bit keys. It will cover the reuse of existing cores and some of the methods employed to reduce multiple passes of silicon, as well as how traditional thinking had to change to support a hardware virtualized environment.

 

 

 

S6.1: Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms

 

With clock speed scaling stalled due to power density restrictions and centralized computing services highlighting growing energy costs, how should algorithms and software change to reflect this changing architectural landscape? Most undergraduates in computer science and applied mathematics are trained to design and evaluate algorithms based on counting the number of arithmetic or logical operations but the dominant cost in computer systems is actually memory operations. Today an off-chip DRAM access costs at least an order of magnitude more than an arithmetic operation in both energy and time. This gap will grow even larger in future systems.

 

Communication-Avoiding (CA) algorithms minimize data movement between memory hierarchy levels and between processors in a parallel setting. While there is extensive literature describing techniques such as tiling that minimize cache misses through loop reordering and other semantics-preserving transformations, our approach is to rethink algorithms from a higher level, sometimes using completely different algorithms with different numerical properties to reduce high-latency communication. In many cases our new algorithms move much less data, making them much faster than conventional algorithms by large factors. We have established new lower bounds and have shown that our new algorithms move as little data as possible. These lower bounds provide performance scalability limits for a large class of algorithms.

 

 

 

S6.2: Electrons, Photons, Phonons, Waves, Bits, and Industrial Design: Microsoft Kinect

 

Kinect was the accumulation of many yet-to-be-invented technologies when the project began. At a high level of abstraction, the Kinect hardware is a set of sensors that provide input to a processor. However, the individual sensors and their integration into one package that creates state-of-the-art depth, video, and audio sensing created challenges beyond any other typical consumer electronic device.

 

The presentation will describe the system design and the validation, manufacturing, and test challenges. A key focus will be on the components and subsystems that a few years ago were beyond state-of-the-art. Kinect results in producing them economically in quantities of millions. An in-depth understanding of what is required to make it all work will be surprising.

Evolving application requirements, and limited developer and customer experience with technologies like Kinect added to the challenge. The design has proven to be exceptional for its primary use of game play, and developers and customers have exploited Kinect for applications beyond the imaginations of its creators.

 

 

 

S8.2: One Billion Packet per Second Frame Processing Pipeline

 

In recent years, the demands on Ethernet switch silicon have increased dramatically on two distinct fronts. In terms of performance, applications demand ever-higher port densities, integrated 40G Ethernet, efficient multicast, and low-latency switching. Meanwhile, ongoing convergence and virtualization trends demand support for a broad set of new networking protocols. This combination of unprecedented performance and functional complexity poses significant challenges for the frame processing circuits in today’s Ethernet switches.

 

This paper describes Fulcrum Microsystems’ solution to this challenge, the new FlexPipeTM frame processing pipeline. Using Fulcrum’s asynchronous design technology and a novel TCAM-based micro-architecture, FlexPipe offers a microcode-configurable pipeline that can coherently process over 1 billion frames per second in 65nm process technology. It supports a wide range of networking protocols, such as IP, MPLS, TRILL, GRE, MAC-in-MAC, NAT, VEPA, VNTag as well as sophisticated load-balancing techniques. With microcode changes alone, the pipeline can adapt to support any number of other emerging protocols and proprietary features, all at full frame rate with extremely low forwarding latency.

 

 

 

S8.3: Sereno: A 2nd-Generation Virtualized Network Interface Controller

 

The Sereno ASIC is Cisco’s second generation Unified Computing System Virtual Interface Card (VIC) controller. Sereno offers server IO virtualization, networking and management features to deliver highly flexible connectivity to operating systems, hypervisor and virtual machines.

 

Sereno is capable of presenting a large number of virtual devices to the host, they appear as physical PCIe devices and are enumerated by a standard BIOS. Each virtual device may be configured as either an Ethernet interface (eNIC) or a Fibre Channel interface (fNIC). Each device is associated with unique policies, including class of service, min/max bandwidth allowed, VLAN assignment, and forwarding rules. Sereno does not require host PCIe SR-IOV support for IO virtualization but the ASIC does support the creation of virtual SR-IOV devices.

 

The Sereno ASIC is fabricated using the Texas Instruments 65 nm process. The silicon includes 17 million games and 37 Mbits of SRAM packaged on a 784 ball organic FC BGA.

 

 

 

S9.1 Movidius: 1TOS/W Software Programmable Media Processor

 

The rationale and architecture behind a new software programmable multimedia coprocessor for mobile devices is outlined. The proposed architecture differs significantly from the current model of a processor such as an ARM with hardware accelerators. The disadvantage with this model is that the cost and time required to design product derivatives are very high. A more attractive model is to build a software-programmable architecture, allowing functions which are traditionally implemented in fixed-function hardware to be implemented competitively both in terms of cost and power in software.

 

The focus of the proposed architecture is on power-efficient operation and area efficiency, allowing product derivatives to be implemented entirely in software. To guarantee sustained high performance and minimize power, the Movidius proprietary SHAVE processor contains wide and deep register-files coupled with a Variable-Length Long Instruction-Word (VLLIW) controlling multiple functional units including extensive SIMD capability for high parallelism and throughput at a functional unit and processor level.

 

On a System-on-Chip (SoC) level, eight SHAVE processors are combined with a software controlled memory subsystem and caches that can be configured to handle a large range of workloads with exceptionally high, sustainable on-chip bandwidth to support data and instruction supply to the 8 processors.

 

A bank of software-controlled DMA engines moves data between the processors, memory, and a large range of peripherals including cameras and LCD panels, mass-storage and communications devices. Additional programmable hardware acceleration is provided for hard-to-parallelize functions such as H.264 CABAC and SVC, as well as 3D graphics. New architectural features such as hardware support for random-accessible sparse data-structures are implemented for the first time, improving memory utilization and bandwidth efficiency. The resulting sustained single-precision IEEE 754 rate is 50GFLOPS/W.

 

A tree of software-controlled multiplexers allows the large number of onboard peripherals to be routed to the device pins maximizing flexibility to support a wide range of use-cases in a low-pin-count package.  The device supports 8-, 16-, 32-, and some 64-bit integer operations as well as fp16 (OpenEXR) and fp32 arithmetic and can execute an aggregated 1 TOPS/W maximum 8-bit equivalent operations while fitting in a low-cost plastic BGA package with integrated 16MB or 64MB Mobile DDR2 SDRAM.

 

As power efficiency is paramount and the device contains a total of 11 power islands. Eight islands are dedicated to the integrated SHAVE processors (one island per processor), allowing very fine-grained power control in software.

 

 

 

S9.3: Intel® Quick Sync Video Technology in the Second Generation Intel® Core™ Processor Family

 

The Second Generation Intel® Core™ processor, code name Sandy Bridge, is a winner of the Best of CES 2011 award. Its market success attributes significantly to its “Built-in Visuals”, including Intel® HD Graphics, Intel® Quick Sync Video, Intel® Clear Video HD Technology, Intel® InTru™ 3D Technology, Intel® Advanced Vector Extensions, and optionally Intel® Wireless Display. In particular, Intel Quick Sync Video technology is praised by many tech reviewers as being revolutionary and finally pushing fast video transcoding to the entire market.

 

PC users now create and consume more video content than ever; however, working with video content and converting it for different uses can be time-consuming. Intel Quick Sync Video is breakthrough hardware acceleration that lets you complete in minutes what used to take hours. It accelerates decoding and encoding for a significantly faster conversion time with excellent power efficiency, while also enabling the processor to complete other tasks, improving overall PC performance.\

 

Intel® Quick Sync Video graphics technology in the second generation Intel core processors is built upon novel graphics and media architecture innovations. It delivers break-though performance surpassing the current highest end and most expensive add-on graphics cards at much lower power, while maintaining superb visual quality rivaling software solutions. The performance, flexibility and power efficiency are achieved by the combination of media friendly parallel programming model and several levels of hardware accelerators.

 

Unlike the implicit parallel programming model employed in typical GPGPU languages like OpenCL, Sandy bridge processor graphics architecture supports a task-oriented parallel programming model that extracts maximum performance and is more nature for media programmers who think about matrix and vectors rather than pixels. The task-oriented explicit parallel programming model provides a rich toolset that includes scalar- and vector-based control flow, comprehensive matrix/vector data representations, thread-to-thread data communication, ordered and unordered thread synchronization, concurrent multiple-kernel programs, and encapsulation of closely-coupled hardware accelerators. Hardware managed thread scheduling, resource management and thread dependency control provide minimal overhead large-scale fine-grain multi-thread parallel computation.

 

For example, a video codec is implemented with a multi-thread programming on a macroblock “wavefront” execution order with one macroblock mapped to one hardware thread. This way, the macroblock dependency, which is critical for achieving high video coding efficiency, is honored.

 

The encapsulation of closely-coupled hardware accelerators, a target set of matrix oriented accelerators for media, are integrated in the Sandy Bridge processor graphics core, delivering high compute throughput while avoiding the cost and power pitfall of the GFLOPS race. Hardware accelerators like a motion estimation engine does the heavy lifting of repetitive computation, while leaving the intelligent and flexibility to the media kernel programs. Coarse level hardware accelerators are also provided to handle non-parallel computation and bit-oriented computations like entropy coding.

 

Finally , re-use of video playback functions like video decoding, video post processing provides the complete transcoding operations in the processor graphics without incurring expensive data movements between processor graphics and the host processor.

 

Intel® Quick Sync Video is a successful showcase technology of such an innovative graphics and media architecture in the Second Generation Intel® Core™ processors.

 

 

 

S12.2: Power management architecture of the Second Generation Intel® Core™ micro architecture codenamed Sandy Bridge

 

Sandy Bridge is the high end Second Generation Intel® Core™ micro architecture. It is an integrated die combining high performance general purpose CPU and high end media and graphics processor together with system on a chip functionality. This high integration introduces new challenges in power management and control. The Sandy Bridge delivers world class energy efficient design together with breakthrough power management architecture. The power management architecture is designed to maximize user experience under system physical constrains while maintaining low energy consumption. The user experience include throughput performance, sustained performance and visual experience together with energy efficiency and improved ergonomics. In this paper we will cover the Sandy Bridge power management architecture and main algorithms. This includes:

 

  • Turbo algorithms for responsiveness (instantaneous performance) and steady state throughput.
  • Power delivery controls
  • Power budget sharing between the CPU, Graphics, Interconnect, DDR etc.
  • Thermal management of the processing building blocks and the externalDDR
  • Platform control capabilities that enable platform embedded controller or data center central manager to control the package power.
  • Energy efficiency and average power

 

 

S12.3: AMD’s “Llano” A Series APU

 

The A Series APU (Accelerated Processor Unit), codenamed “Llano,” includes four x86 processor cores, L2 cache, a discrete-level DirectX®11 GPU core , UVD 3.0 (Unified Video Decoder) media acceleration, integrated Display Port, HDMI or DVI display, PCIe Gen1 or Gen2 IO, all implemented in a single-die 32nm SOI process. Low-power design was a key design metric and power gating is deployed pervasively in the design. Turbo Core technology is deployed to maximize processor performance based on the Thermal Design capabilities of the system design.

 

The A Series APU uses a 32nm update of the “Stars” x86 core which is the latest evolution of the 45nm AMD x86 core, with significant performance improvements in ILP (instruction level parallelism) and MLP (memory level parallelism) , the addition of a hardware integer divider, the addition of a new type of hardware data cache pre-fetcher and an improved floating point scheduler. These new 32nm x86 cores also support power gating and digital power monitoring.

 

The on-die DirectX®11 GPU core provides discrete level graphics and parallel compute performance with low-latency, high bandwidth direct access to a shared system memory. This enables up to 500 gigaflops of high performance computing. The GPU core can be dynamically power-gated. The UVD 3 block supports hardware codec acceleration allowing for low power dual-stream Blu-ray disc playback. The UVD 3 engine can be statically power gated under driver control.

 

Hot Chips 23 program now online. Check it out and register today www.hotchips.org

It is my great pleasure to invite you to this year’s Hot Chips 23 Symposium on High-Performance Chips, to be held on August 17-19 at Stanford University’s Memorial Auditorium in beautiful Palo Alto, California. I have waited to invite you until we published the preliminary presentation schedule, which can now find at www.hotchips.org.

 

This year’s presentations promise to be extremely interesting and I ask that you take a look at the attached Advanced Program schedule to see what’s in store for attendees. You’ll find all of your favorite companies presenting at Hot Chips 23 including ARM, Intel, AMD, Cavium, IBM, Xilinx, and Cisco.

 

Also please note the keynotes: Simon Segars of ARM will speak about ARM Processor Evolution and Steve Cousins of Willow Garage will be speaking about the challenge of building personal robots. You may be familiar with Segars. He’s an excellent speaker and we’re delighted to have him at Hot Chips 23. You may not be familiar with Cousins and Willow Garage. The company develops hardware and open-source software for personal robot experimentation and has put out several incredible videos of its PR2 robot doing mundane things like folding towels, shopping, and cooking breakfast—autonomously. I am certain you will find both keynotes fascinating.

 

You might also note our panel on Ecosystem Wars. We’ve known for some time that it’s no longer just about pipelines, interrupt latency, memory bandwidth, out-of-order execution, etc. It’s also about how you get the software developed in a multicore world. The panel discussion should prove both lively and informative.

 

Frankly, I’ll be shocked if you aren’t able to find something interesting (actually many interesting somethings) from the presentations made this year at the Hot Chips conference.

 

I expect to see you there. Why not register now, while you’re thinking about it? www.hotchips.org

 

Regards,

 

Steve Leibson

Hot Chips 23 Press Chair and EDA360 Evangelist

Hot Chips 23 program and registration now live!

I know you’ve been waiting for the Hot Chips 23 program to go up and the registration to go live. The program’s now viewable here and you can now register here.

 

Companies presenting at Hot Chips include (in presentation order) Cavium, IBM, ICT/The Chinese Academy of Science, Intel, Tilera, ARM, MoSys, Micron, Xilinx, XMOS, Tensilica, Microsoft, Aquantia, Fulcrum, Cisco, Movidius, Trident, Willow, Oracle, and AMD. That’s a whole lotta’ new chips to hear about.

 

Register here.

Hot Chips 23: How does August 17-19 work for you?

After an interesting dance with Stanford’s logistics people, Hot Chips 23 has announced the dates for this year’s annual unveiling of the hottest chips of the year.

The day for tutorials is Wednesday, August 17. The two-day conference program will take place on Thursday, August 18 and Friday, August 19.

Some of you will want to see the program before deciding to attend. However, there’s a core group of die-hards who never miss a Hot Chips conference. If you’re one of those people, you now have the dates to reserve on your social calendar. That’s about all you can do at the moment, because registration isn’t up and running just yet. Stay tuned to this blog. We’ll let you know when you can be one of the first to register.