Seminar Topics: Crusoe Processor

INTRODUCTION

Mobile computing has been the buzzword for quite a long time. Mobile computing devices like laptops, notebook PCs etc are becoming common nowadays. The heart of every PC whether a desktop or mobile PC is the microprocessor. Several microprocessors are available in the market for desktop PCs from companies like Intel, AMD, Cyrix etc. The mobile computing market has never had a microprocessor specifically designed for it. The microprocessors used in mobile PCs are optimized versions of the desktop PC microprocessor.

Mobile computing makes very different demands on processors than desktop computing. Those desktop PC processors consume lots of power, and they get very hot. When you're on the go, a power-hungry processor means you have to pay a price: run out of power before you've finished, or run through the airport with pounds of extra batteries. A hot processor also needs fans to cool it, making the resulting mobile computer bigger, clunkier and noisier. The market will still reject a newly designed microprocessor with low power consumption if the performance is poor. So any attempt in this regard must have a proper 'performance-power' balance to ensure commercial success. A newly designed microprocessor must be fully x86 compatible that is they should run x86 applications just like conventional x86 microprocessors since most of the presently available software has been designed to work on x86 platform.

Crusoe is the new microprocessor, which has been designed specially for the mobile computing market .It has been, designed after considering the above-mentioned constraints. A small Silicon Valley startup company called Transmeta Corp developed this microprocessor.

The concept of Crusoe is well understood from the simple sketch of the processor architecture, called 'amoeba’. In this concept, the x86 architecture is an ill-defined amoeba containing features like segmentation, ASCII arithmetic, variable-length instructions etc. Thus Crusoe was conceptualized as a hybrid microprocessor, i.e. it has a software part and a hardware part with the software layer surrounding the hardware unit. The role of software is to act as an emulator to translate x86 binaries into native code at run time. Crusoe is a 128-bit microprocessor fabricated using the CMOS process. The chip's design is based on a technique called VLIW to ensure design simplicity and high performance. The other two technologies using are Code Morphing Software and LongRun Power Management. The crusoe hardware can be changed radically without affecting legacy x86 software: For the initial Transmeta products, models TM3120 and TM5400, the hardware designers opted for minimal space and power

CRUSOE PROCESSOR VLIW HARDWARE

1. Basic principles of VLIW Architecture

VLIW stands for Very Long Instruction Word. VLIW is a method that combines multiple standard instructions into one long instruction word. This word contains instns that can be executed at the same time on separate chips or different parts of the same chip. It provides explicit parallelism, i.e. executing more than one basic (primitive) instn at a time. By using VLIW you enable the compiler, not the chip to determine which instructions can be run concurrently. This is an advantage because the compiler knows more information about the program than the chip does by the time the code gets to the chip.

Trace scheduling is an important technique in VLIW processing. i.e. the compiler processes the code and determines which path is the most frequently traveled, and then optimizes this path. Basic blocks that compose the path are separated from the other basic blocks. The path is then optimized and rejoined with the other basic blocks using split and rejoin blocks.

Dynamic scheduling is another important method when compiling VLIW code. The process called split-issue splits the code into two phases, phase one and two. This allows for multiple instns, instns having certain delays etc to execute at the same time. H/W support is needed to implement this, and needs delay buffers and temporary variable space (TVS) in the h/w. The TVS is needed to store results when they come in. The results computed in phase two are stored in temporary variables and are loaded into the appropriate phase one register when they are needed.

VLIW has been described as a natural successor to RISC, whose instn set consists of simple instructions (RISC-like). because it moves complexity from the hardware to the compiler, allowing simpler, faster processors. One objective of VLIW is to eliminate the complicated instruction scheduling. The compiler must assemble many primitive operations into a single "instruction word" such that the multiple functional units are kept busy.

2. Crusoe VLIW in Microprocessor

With the Code Morphing software handling x86 compatibility, Transmeta hardware designers created a very simple, high-performance, VLIW engine with two integer units, a floating-point unit, a memory (load/store) unit, and a branch unit. A Crusoe processor long instruction word, called a molecule, can be 64 bits or 128 bits long and contain up to four RISC-like nstructions,called atoms. All atoms within a molecule are executed in parallel, and the molecule format directly determines how atoms get routed to functional units; this greatly simplifies the decode and dispatch hardware. Figure 1 shows a sample 128-bit molecule and the straightforward mapping from atom slots to functional units. Molecules are executed in order, so there is no complex out-of-order hardware. To keep the processor running at full speed, molecules are packed as fully as possible with atoms. In a later section, we describe how the Code Morphing software accomplishes this.

The integer register file has 64 registers, %r0 through %r63. By convention, the Code Morphing software allocates some of these to hold x86 state while others contain state internal to the system, or can be used as temporary registers, e.g., for register renaming in software.

Superscalar out-of-order x86 processors, such as the Pentium II and III processors, also have multiple functional units that can execute RISC-like operations (micro-ops) in parallel. Figure 2 depicts the hardware these designs use to translate x86 instructions into micro-ops and schedule (dispatch) the micro-ops to make best use of the functional units. Since the dispatch unit reorders the micro-ops as required to keep the functional units busy, a separate piece of hardware, the in-order retire unit, is needed. To effectively reconstruct the order of the original x86 instructions, and ensure that they take effect in proper order. Clearly, this type of processor hardware is much more complex than the Crusoe processor’s simple VLIW engine.

Because the x86 instruction set is quite complex, the decoding and dispatching hardware requires large quantities of power-hungry logic transistors; the chip dissipates heat in rough proportion to their numbers.

CODE MORPHING SOFTWARE

The Code Morphing s/w is fundamentally a dynamic translation system, a program that compiles instructions for one ISA (here, the x86 target ISA) into instructions for another ISA (the VLIW host ISA). The Code Morphing s/w resides in a ROM and is the first program to start executing when the processor boots. The Code morphing S/W supports ISA, and is the only thing x86 code sees. The only program written directly for the VLIW engine is the CMS itself. Figure shows this.

Because the CMS insulates x86 programs including a PC’s BIOS and OS-from the h/w engine’s native instn set, which can be changed arbitrarily without affecting any x86 s/w at all. The only program that needs to be ported is the CMS itself. The feasibility of this concept has been demonstrated: the native ISA of the model TM5400 is an enhancement (Neither forward nor backward compatible) of the model TM3120’s ISA and runs a different version of CMS. The model TM3120 is aimed at Internet appliances and ultra-light mobile PCs, while the model TM5400 supports high-performance, full-featured 3-4lb. mobile PCs.

Hiding the chip’s ISA behind a s/w layer also avoids a problem that has in the past. A traditional VLIW exposes details of the processor pipeline to the compiler, hence any change to that pipeline would require all existing binaries to be recompiled to make them run on the new hardware. Even traditional x86 processors suffer from a related problem: while old applns will run correctly on a new processor, they usually need to be recompiled to take full advantage of the new processor implementation. This is not a problem on. Crusoe processors, since in effect, the Code Morphing software always transparently “recompiles” and optimizes the x86 code it is running.

The flexibility of the s/w translation approach comes at a price: the processor has to dedicate some of its cycles to running the Code Morphing s/w, cycles that a conventional x86 processor could use to execute appln code.

1. Decoding and Scheduling

Conventional x86 processors fetch x86 binary instns from memory and decode them into micro-ops, they are then reordered by out-of-order dispatch h/w and fed to the functional units for parallel execution.

Code Morphing can translate an entire group of x86 instns at once, creating a translation, whereas a superscalar x86 translates single instns in isolation. Moreover, while a traditional x86 translates each x86 instn every time it is executed, Transmeta’s software translates instns once, saving the resulting translation in a translation cache. The next time the (now translated) x86 code is executed, that saved information can be used.

An out-of-order processor has to translate and schedule instns every time they execute, it must do so very quickly. This limits the kinds of transformations it can perform. The Code Morphing approach, on the other hand, can reduce the cost of translation over many executions, allowing it to use much more sophisticated translation and scheduling algorithms.

2. Caching

The translation cache, along with the Code Morphing code, resides in a separate memory space that is inaccessible to x86 code. (For better performance, the Code Morphing s/w copies itself from ROM to DRAM at initialization time.) The size of this memory space can be set at boot time, or the operating system can make the size adjustable.

The CMS’s technique of reusing translations takes advantage of “locality of reference”. Specifically, the translation system exploits the high repeat rates (the number of times a translated block is executed on average) seen in real-life applications.

Furthermore, as an appln executes, Code Morphing “learns” more about the program and improves it so it will execute faster and faster. On typical applns, due to their high repeat rates, Code Morphing has the opportunity to optimize execution and reduce any initial translation overhead. As an e.g., consider a multimedia application such as playing a DVD-before the first video frame has been drawn; the DVD decoder will have been fully translated and optimized, incurring no further overhead during the playing time of the DVD.

3. Filtering

In typical applns, a very small fraction of the appln’s code (often less than 10%) accounts for more than 95% of execution time. So the optimizer’s full attention must be on the most frequently executed code but not waste it on code that executes only once.

The CMS includes in its arsenal a wide choice of execution modes for x86 code, ranging from interpretation (which has no translation overhead, but executes x86 code more slowly), through translation using very simple-minded code generation, all the way to highly optimized code (which takes longest to generate, but which runs fastest once translated).

4. Prediction and Path Selection

The Code Morphing s/w can gather feedback about the x86 program is by instrumenting translations: These informations are such as block execution frequencies, or branch history. This data can be used later to decide when and what to optimize and translate. For example, if a given conditional x86 branch is highly biased, the system can likewise bias its optimizations to favor the most frequently taken path. Thus knowing how often a piece of x86 code is executed helps to decide how much to try to optimize that code.

Current Intel and AMD x86 processors convert x86 instructions into RISC-like micro-ops that are simpler and easier to handle in a superscalar micro architecture. The micro-op translation adds at least one pipeline stage and requires the decoder to call a micro code routine to translate some of the most complex x86 instructions. Implementing the equivalent of that front-end translation in software saves Transmeta a great deal of control logic and simplifies the design of its chips.

Crusoe interposes an abstraction layer that hides internal details from the outside world. Thus x86 programmers can write s/w without needing any knowledge about a Crusoe system’s VLIW or CMS.

CRUSOE H/W SUPPORT FOR CODE MORPHING

Dynamic translation on conventional processors would result in unsatisfactory performance. In contrast, the Crusoe h/w can achieve excellent performance because it has been designed specifically with dynamic translation in mind. Since VLIW technique and CMS are employed in Crusoe microprocessor special h/w support is needed for its proper working. The three simple h/w features that support diff problems are given below:

1. Exceptions and Speculation

Without special h/w support, it is in general very difficult for a dynamic translation system to correctly model the exception semantics of the target ISA while at the same time achieving high performance. The reason is that exception semantics impose severe constraints on instn scheduling. Consider again the e.g. from the previous section, where the following x86 code:

A. addl %eax,(%esp)

B. addl %ebx,(%esp)

C. movl %esi,(%ebp)

D. subl %ecx,5

was translated into the following two molecules:

1. ld %r30,[%esp]; sub.c %ecx,%ecx,5

2. ld %esi,[%ebp]; add %eax,%eax,%r30; add %ebx,%ebx,%r30

In the x86 ISA, exceptions are precise: when one instn causes an exception, all instns preceding it must complete before the exception is reported, and none of the subsequent instns may complete. Observe that in the translation above, atoms occur out of order with respect to the original x86 code order. Imagine that during execution, the load instn in molecule 2,

corresponding to x86 instn (C), takes a page fault. However, by that time, the processor has already executed code in molecule 1 corresponding to instn (D), which violates the rules of precise exceptions. Solving this problem without special h/w support reduces the performance. Out-of-order processors, too, have this problem.

The Crusoe host processor provides a much simpler h/w solution that works hand-in-hand with the CMS. All registers holding x86 state are shadowed, i.e., there exist two copies of each register, a working and a shadow copy. Normal atoms only update the working copy of the register. When execution reaches the end of a translation without encountering an exception, a special commit operation copies all working registers into their corresponding shadow registers, committing the work done in the translation. On the other hand, if any x86-level exception occurs inside the translation, the runtime system undoes the effects of all molecules executed since the start of the translation. This is done via a rollback operation, which copies the shadow register values back into the working registers. At this point, the CMS re-executes the x86 instns conservatively, that is to say in their original program order, to determine the actual location of the exception.

Undoing changes to memory is slightly more complicated. The Crusoe processor handles x86 store operations by holding store data in a “gated store buffer”, from which they are only released to the memory system at the time of a commit. On a rollback, stores not yet committed can simply be dropped from the store buffer.

2. Alias Hardware

The more freedom the scheduler has to move atoms around to fill molecules, the better code it can usually generate. One of the biggest limits

on this freedom comes from potential dependencies between memory operations. In particular, it is often desirable to be able to reorder load

instructions ahead of store instructions. However, doing that is incorrect if the load happens to use data from the preceding store, and since it is generally hard to prove otherwise at translation time, a translator often has to make overly conservative assumptions. (This is also a problem for traditional compilers and microprocessors.)

The Crusoe host provides innovative alias hardware that addresses this problem. When the translator moves a load operation ahead of a store operation, it converts the load into a load-and-protect (which in addition to loading data also records the address and size of the data loaded) and the store into a store-under-alias-mask (which checks for protected regions). In the (unlikely) event that the store operation overwrites the previously loaded data, the processor raises an exception and the runtime system can take corrective action. Using this mechanism, it is always safe to reorder memory loads and stores.

The alias hardware can be put to even better use than moving atoms around: it can help to eliminate redundant load/store atoms. Consider the case where a datum is loaded from memory twice, but there is an intervening store operation (a code sequence that is actually fairly common in processors with few registers, like the x86):

ld %r30,[%x] // first load from location X

...

st %data,[%y] // might overwrite location X

ld %r31,[%x] // this accesses location X again

use %r31.

As long as the intervening store operation does not overlap with the first load, the second load is redundant, but all too often a translator or compiler cannot prove that this is the case. Using the alias hardware, it is a simple matter to protect the first load, have the store check pending aliases, and eliminate the second load:

ldp %r30,[%x] // load from X and protect it

...

stam %data,[%y] // this store traps if it writes X

use %r30 // can use data from first load

Notice that the use of the loaded data can now also be scheduled earlier, further speeding up the generated code.

3. Coping with Self-modifying Code

At times, x86 instructions in memory get overwritten, either because the operating system is loading a new program, or because an application is using self-modifying code. When this happens to code that has already been translated, the Code Morphing s/w needs to be notified to keep it from erroneously executing a translation for the old code. For this whenever the system translates a block of x86 code, it write-protects the page of x86 memory containing that code. It does so by setting a dedicated “translated” bit in that page’s entry in the processor’s memory management unit (that bit is invisible to x86 s/w). When a protected page is written to, the simplest remedy is to invalidate the affected translation(s).

As the runtime system dynamically learns more about the program’s behavior, it switches to more sophisticated strategies.

LONG RUN POWER MANAGEMENT

Although the CMS’s primary responsibility is ensuring x86 compatibility, it also provides interfaces to capabilities available only in Crusoe processor models. LongRun power management is one example—a facility in the TM5400 model that can further minimize that processor’s already low power consumption.

In a mobile setting, most conventional x86 CPUs regulate their power consumption by rapidly alternating between running the processor at full speed and turning the processor off. Different performance levels can be obtained by varying the on/off ratio. However, with this approach, the processor may be shut off just when a time-critical application needs it.

In contrast, the TM5400 can adjust its power consumption without turning itself off—instead, it can adjust its clock frequency on the fly. It does so extremely quickly, and without requiring an operating system reboot. As a result, software can continuously monitor the demands on the processor and dynamically pick just the right clock speed (and hence power consumption) needed to run the application—no more and no less. Since the switching happens so quickly, it is not noticeable to the user.

1. LongRun extends Battery Life

The TM5400’s innovative technology, called LongRun can scale the CPU’s voltage in as many as 32 steps. There are individually controllable, codependent ranges for voltage and frequency. In the current version of the TM5400, voltage can vary from 1.1 V to 1.6 V, and frequency can vary from 200 MHz to 700 MHz in increments of 33 MHz.

When LongRun s/w detects a change in the CPU load, it signals the chip to adjust the voltage and frequency up or down. If CPU needs to handle a heavier load, LongRun tells the chip to start ramping up its voltage. When the voltage stabilizes at the higher level, the chip scales up its clock frequency.

If the LongRun software determines that the CPU can save power by running more slowly, the chip starts scaling down its frequency. By always keeping the clock frequency within the limits required by the voltage, it avoids any undesirable effects. It never needs more than one frequency step to reach a different target. To scale from 600 to 700 MHz, it doesn’t have to take three 33-MHz steps. Instead, it raises the voltage to 1.6 V in multiple steps, then boosts the frequency to 700 MHz in one big jump.

One concern is that LongRun might not react quickly enough to accommodate the fast-changing demands of some programs. When the computer is playing MPEG-compressed video, for e.g., a transition from a relatively static frame to an action-filled frame might engulf a CPU that’s loitering at a low clock speed. LongRun software can detect a change in the CPU load in about half a µsecond, and it can scale the voltage up or down in less than 20 µseconds per step. The worst-case state of a full swing from 1.1 V to 1.6 V and from 200 to 700 MHz takes only 280 µseconds. Furthermore, the CPU doesn’t stop during the swing. Processor stall only while the PLL relock onto the new frequency, that’ll not take longer than 20µseconds in worst case.

LongRun isn’t the only reason that Crusoe processors appear to consume much less power than comparable x86 chips. The TM3120 doesn’t have LongRun, yet its power consumption is impressive too. The simplicity of Transmeta’s VLIW architecture is evidently a larger factor. LongRun is a genuine innovation that gives Crusoe an extra edge.

Power varies linearly with clock speed and by the square of the voltage, adjusting both can produce cubic reductions in power consumption whereas a conventional CPUs can adjust power only linearly.

CRUSOE PROCESSOR ARCHITECTURE

The Crusoe microprocessor is available in the market in the following versions: TM3120, TM3200, TM5400 and TM5600.The basic architecture of all the above models are same except for some minor changes since various models have been introduced for different segments of the mobile computing market. The following architectural description has taken Crusoe TM5400 as reference.

The Crusoe Processor incorporates integer and floating point execution units, separate instruction and data caches, a level-2 write-back cache, memory management unit, and multimedia instructions. In addition to these traditional processor features, there are some additional units, which are usually part of the core system logic that surrounds the microprocessor. The VLIW processor, in combination with Code Morphing software and the additional system core logic units, allow the Crusoe Processor to provide a highly integrated, ultra-low power, high performance platform solution for the x86 mobile market.

1. Processor Core

The Crusoe Processor core architecture is relatively simple by conventional standards. It is based on a VLIW 128-bit instn set. Within this VLIW architecture, the control logic of the processor is kept very simple and s/w is used to control the scheduling of instns. This allows a simplified and very straightforward h/w implementation with an in-order 7-stage integer pipeline and a 10-stage floating-point pipeline. By streamlining the processor h/w and reducing the control logic transistor count, the performance-to-power consumption ratio can be greatly improved over traditional x86 architectures.

The Crusoe Processor includes an 8-way set-associative Level 1 (L1) instn cache, and a 16-way set associative L1 data cache. It also includes an integrated Level 2 (L2) write-back cache for improved effective memory bandwidth and enhanced performance. This cache architecture assures maximum internal memory bandwidth for performance intensive mobile applications, while maintaining the same low-power implementation that provides a superior performance-to-power consumption ratio relative to previous x86 implementations.

Other than having execution h/w for logical, arithmetic, shift, and floating point instns, as in conventional processors, the Crusoe has very distinctive features from traditional x86 designs. To ease the translation process from x86 to the core VLIW instn set, the h/w generates the same condition codes as conventional x86 processors and operates on the same 80-bit floating-point numbers. Also, the TLB has the same protection bits and address mapping as x86 processors. The s/w component of this solution is used to emulate all other features of the x86 architecture. The s/w that converts x86 programs into the core VLIW instns is the CMS.

2. Integrated DDR SDRAM Memory Controller

DDR SDRAM interface is the highest performance memory interface available on the Crusoe. The DDR SDRAM controller supports only Double Data Rate (DDR) SDRAM and transfers data at a rate that is twice the clock frequency of the inter-face. This feature is absent in the model TM 3200.

The DDR SDRAM controller supports up to four banks, the equivalent of two Dual In-line Memory Modules (DIMMs), of DDR SDRAM using a 64-bit wide inter-face. The DDR SDRAM memory can be populated with 64M-bit, 128M-bit, or 256M-bit devices. The frequency setting for the DDR SDRAM interface is initialized during the power-on boot sequence.

3. Integrated SDR SDRAM Memory Controller

The SDR SDRAM memory controller supports up to four banks, equivalent to two Small Outline Dual In-line Memory Modules (SO-DIMMS), of Single Data Rate (SDR) SDRAM that can be configured as 64-bit or 72-bit SO-DIMMs. These SO-DIMMs can be populated with 64M-bit, 128M-bit or 256M-bit devices. All SO-DIMMs must use the same frequency SDRAMs, but there are no restrictions on mixing different SO- DIMM configurations into each SO-DIMM slot. The frequency setting for the SDR SDRAM interface is initialized during the power-on boot sequence.

4. Integrated PCI Controller

The Crusoe Processor includes a PCI bus controller that is PCI 2.1 compliant. The PCI bus is 32 bits wide, operates at 33 MHz, and is compatible with 3.3V signal levels. It is not 5V tolerant, however. The PCI controller on provides a PCI host bridge, the PCI bus arbiter, and a DMA controller.

5. Serial ROM Interface

The Crusoe serial ROM interface is a five-pin interface used to read data from a serial flash

ROM.

The flash ROM is 1M-byte in size and provides non-volatile storage for the CMS. During the boot process, the Code Morphing code is copied from the ROM to the Code Morphing memory space in SDRAM. Once trans-erred, the Code Morphing code requires 8 to 16M-bytes of memory space. The portion of SDRAM space reserved for CMS is not visible to x86 code. Transmeta supplies programming information for the flash ROM device. This interface may also be used for in-system reprogramming of the flash ROM

Seminar Topics

google search engine

Crusoe Processor

CRUSOE H/W SUPPORT FOR CODE MORPHING

No comments:

google search engine