features of modern intel microprocessors

Software & Services GroupDeveloper Products Division Copyright© 2011, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Essential Performance

Advanced Performance

Distributed Performance

Efficient Performance

Features Of Modern Intel Microprocessors

Prepared By:Krunal P Siddhapathak (10BEC097)

http://software.intel.com/en-us/articles/optimization-notice/



Core and Multi-Core Processor

What is a Core? A standard processor has one core (single-core.) Single core processors

only process one instruction at a time (they do use pipelines internally, which allow several instructions to be processed together; however, they are still run one at a time.)

What is a Multi-Core Processor? A multi-core processor is comprised of two or more independent cores,

each capable of processing individual instructions. A dual-core processor contains two cores, a quad-core processor contains four cores, and a hexa-core processor contains six cores.




Need Of Multi-Core Processors

Multiple cores can be used to run two programs side by side and, when an intensive program is running, (AV Scan, Video conversion, CD ripping etc.) you can utilize another core to run your browser to check your email etc.

Multiple cores really shine when you’re using a program that can utilize more than one core (called Parallelization) to improve the program’s efficiency and addressability. Programs such as graphic software, games etc. can run multiple instructions at the same time and deliver faster, smoother results.

If you use CPU-intensive software, multiple cores will likely provide a better computing experience. If you use your PC to check emails and watch the occasional video, you really don’t need a multi-core processor.




Core 2 Duo vs. Core i3 vs. Core i5

Core 2 Duo Core i3 Core i5

Number of Threads Two Four Four

Socket 775 (45/65nm) 1156 (nm) 1156 (nm)

Compatible RAM DDR2 DDR3 DDR3

Turbo Boost No No Yes

Overclocking No Yes No




Do I need an i3, i5, or i7?

As with all computer hardware, the type of processor you need depends on your needs, for how long you want your computer to stay current, and your budget.

If you: Browse the internet, check email, and play the occasional flash game (like

Farmville): Get a single core netbook or desktop Do word processing, spreadsheets etc., listen to music often, and watch

movies, get an i3 processor (or any dual core processor i.e. core 2 duo) Play the occasional game and are happy with lower resolution and lower

quality graphics (my suggestion assumes the graphics processor on the pre-built PC will be well-matched for the processor suggestions), watch HD movies etc., get an i5.

If you do graphic publishing, music creation, programming (and compiling), watch HD movies, or like to play visually appealing games, get a quad core i5, or i7.

If you like to have the very best hardware and play the most graphically intense games, get a quad core or hexa corei7 Extreme.




Intel Sandy Bridge Microarchitecture

Many of the bottlenecks of previous designs have been dealt with in the Sandy Bridge. Instruction fetch and predecoding has been a serious bottleneck in Intel designs for

many years. In the NetBurst architecture they tried to fix this problem by caching decoded µops, without much success.

In the Sandy Bridge design, they are caching instructions both before and after decoding. The limited size of the µop cache is therefore less problematic, and the µop cache appears to be very efficient. The limited number of register read ports has been a serious, and often neglected, bottleneck since the old Pentium Pro.

This bottleneck has now finally been removed in the Sandy Bridge. Previous Intel processors have only one memory read port where AMD processors have two. This was a bottleneck in many math applications. The Sandy Bridge has two read ports, whereby this bottleneck is removed. The branch prediction has been improved by having bigger buffers and a shorter misprediction penalty, but it has no loop predictor, and mispredictions are still quite common.

The new AVX instruction set is an important improvement. The throughput of floating point addition and multiplication is doubled when the new 256-bit YMM registers are used. The new non-destructive three-operand instructions are quite convenient for reducing register pressure and avoiding register move instructions. There is, however, a serious performance penalty for mixing vector instructions with and without the VEX prefix. This penalty is easily avoided if the programming guidelines are followed, but I suspect that it will be a very common programming error in the future to inadvertently mix VEX and non-VEX instructions, and such errors will be difficult to detect.




Intel Sandy Bridge Microarchitecture(Contd.)

Whenever the narrowest bottleneck is removed from a system, the next less narrow bottleneck will become the limiting factor. The new bottlenecks that require attention in the Sandy Bridge are the following: The µop cache: This cache can ideally hold up to 1536 µops. The effective utilization

will be much less in most cases. The programmer should pay attention to make sure the most critical inner loops fit into the µop cache.

Instruction fetch and decoding: The fetch/decode rate has not been improved over previous processors and is still a potential bottleneck for code that doesn’t fit into the µop cache.

Data cache bank conflicts: The increased memory read bandwidth means that the frequency of cache conflicts will increase. Cache bank conflicts are almost unavoidable in programs that utilize the memory ports to their maximum capacity.

Branch prediction: While the branch history buffer and branch target buffers are probably bigger than in previous designs, mispredictions are still quite common.

Sharing of resources between threads: Many of the critical resources are shared between the two threads of a core when hyperthreading is on. It may be wise to turn off hyperthreading when multiple threads depend on the same execution resources.




Intel Ivy Bridge Microarchitecture

Ivy Bridge is the codename for an Intel microprocessor using the Sandy Bridge microarchitecture. The name is also applied more broadly to the 22 nm die shrink of the microarchitecture based on tri-gate ("3D") transistors, which is also used in the future Ivy Bridge-EX and Ivy Bridge-EP microprocessors. Ivy Bridge processors are backwards-compatible with the Sandy Bridge platform, but might require a firmware update (vendor specific). Intel has released new 7-series Panther Point chipsets with integrated USB 3.0 to complement Ivy Bridge.

Volume production of Ivy Bridge chips began in the third quarter of 2011. Quad-core and dual-core-mobile models launched on April 29, 2012 and May 31, 2012 respectively. Core i3 desktop processors, as well as the first 22 nm Pentium were launched and available the first week of September, 2012.




Intel Ivy Bridge Microarchitecture(Contd.)

How much faster are the Ivy Bridge processors? The base clock frequency of these processors ranges from 2.8

GHz (for Core i5-3450S) to 3.5 GHz (for Core i7-3770K).

What different types of the Ivy Bridge processors are available? There are many types of processors in the Ivy Bridge family. The

type is indicated by putting a suffix to the CPU model name. The following list explains these suffixes - K – Unlocked, ready to be overclocked. S – Performance optimized. Low power consumption. T – Power optimized. Ultra low power consumption. M – Mobile processors for mobile devices. Q – Quad core processors.




Intel Ivy Bridge Microarchitecture(Contd.)

Features present in Ivy Bridge: HD graphics – Ivy Bridge processors have in-built GPU chip inside them.

The GPU supports DirectX 11 (Sandy Bridge supports version 10.1), OpenGL 3.1 (Sandy Bridge supports version 3.0). Ivy Bridge processors have the Intel HD4000/HD2500 GPU chips. This means that you do not need an add-on graphics card.

QuickSync Video – This feature is introduced in the Intel 3rd generation processors. It uses dedicated media processing to make video creation and conversion faster and easier. Whether you want to create DVDs, create, convert and edit 3D/2D videos, upload to your favorite social networking sites – everything is done in a jiffy.

WiDi 3.0 – Wireless Display technology allows you to stream media content to a multitude of your Wi-Fi connected display devices. You can share a 1080p 60FPS video using WiDi.

Turbo Boost Technology 2.0 – Using the Turbo boost technology, you can make your Ivy Bridge processors run faster than their base frequency. For example, a 3.5GHz iCore i7 can be made to run at 3.9 GHz for some time.




Core 2 Duo vs. Core i3

The Core 2 Duo is Intel's veteran, covering a wide range of price and performance sweet spots. It is now being replaced, however, by Intel's rookie Core i3. So, is the Core i3 actually better than the Core 2 Duo, or can you hold off upgrading for a while longer?

The Core 2 Duo has been the processor of choice in laptops for about three years. Over those three years the average speeds of Core 2 Duo processors have advanced significantly and many of today's Core 2 Duo laptops have speeds of around 2.2 GHz or faster. Core 2 Duo processors have also been the go-to for many less expensive desktop systems, with speeds reaching over 3 GHz.

However, there is a newcomer which is challenging the Core 2 Duo. This is the Core i3. It is very similar to the Core 2 Duo in many ways. Both are dual-core processors and most Core 2 Duos and Core i3 have similar clock speeds. However, the processors are based on different architectures.

So, which one is better?




Core 2 Duo vs. Core i3(Contd.)

Architecture The Core 2 Duo processors are based off the Core 2 architecture.

The Core and Core 2 architectures were arguably Intel's most successful architectures, as they replaced the Pentium 4 processors in desktop systems and made Intel competitive in that space once again.

The Core i3 is based off a new architecture called Nehalem. The Nehalem architecture has numerous advantages over the Core 2 architecture. Nehalem is better constructed for quad-core processors, has hyper-threading available, and can use a feature called Turbo Boost which maximizes processor speed. However, because the Core i3 is the low-end Nehalem variant, most of these features are disabled or not relevant - the Core i3 is a dual core processor and Turbo Boost is disabled, but hyper-threading is enabled.





Processor Performance The Core i3 is the slowest variant of the Nehalem based processor. The

Core 2 Duo processors, however, don't have the same differentiation between versions of the same architecture. The fastest Core 2 Duo desktop processor has a speed of 3.33 GHz, while the fastest Core i3 desktop processor is clocked at 3.06 GHz.

You might therefore expect that the Core 2 Duo would have the edge - particularly when you consider that the Core 2 Duo costs almost three times as much if you buy it individually - but in fact the Core i3 is faster, and often by no small margin. The Core i3 is faster even in single-threaded applications, but the performance gap really widens in multi-threaded applications. This is because the Core i3 has hyper-threading, which turns the two real cores into four virtual cores. Windows works with the Core i3 as if it is a quad-core processor.

These results remain true in the mobile space, as well. Core i3 processors punch at least 500 MHz above their weight in single-thread applications, and are virtually always faster in multi-threaded applications, no matter the clock speeds of the Core 2 Duo and Core i3 processors you are comparing.





Power Usage and Heat A look at the technical specifications of the Core i3 processors automatically puts

them into a negative light when it comes to power consumption. The desktop Core i3 parts at listed as having a 73 Watt TDP, while most Core 2 Duo desktop parts have a 65 Watt TDP. In laptops the Core i3 has a 35 watt TDP, while Core 2 Duo mobile processors usually have a 25 Watt TDP.

These differences pan out about how you'd expect them to when it comes to absolute power consumption. The Core i3 processors do consume just slightly more power than Core 2 Duo processors at load and at idle. We're talking a difference of around 10 Watts on desktops and a few on laptops - nothing huge, but a difference none the less.

However, when it comes to power efficiency the answer becomes less clear. In order for a processor to be power efficient, it needs to not only have low power consumption but also the ability to complete tasks quickly. This lowers the overall "task energy" because a faster processor will be done with a task before a slower processor, and once done it will slip back into an idle state.

When viewed from this perspective, the Core i3 is much more efficient than the Core 2 Duo on both the desktop and the laptop. This means that the Core i3 will probably not use any more power than a Core 2 Duo - and may actually use less - unless your usage patterns place a constant load on your processor.




Various Core Processors Of Intel

Core i3 Series Intel's Core i3 processor line has always been a budget option. These

processors remain dual-core, unlike the rest of the Core line, which is made up of quad core processors. Intel's Core i3 processors also have many features restricted.

The main feature that is kept from the Core i3 processors is Turbo Boost, the dynamic overclocking available on most Intel processors. This, alongside with the dual-core design, accounts for most of the performance difference between Core i3 processors and the i5 and i7 options.

One feature that Core i3 has - and i5 doesn't - is hyper-threading. This is Intel's logic-core duplication technology which allows each physical core to be used as two logic cores. The result of this is that Windows will display a dual-core Core i3 processor as if it were a quad-core.

Finally, Core i3 processors have their integrated graphics processor restricted to a maximum clock speed of 1100 MHz, and all Core i3 processors have the 2000 series IGP, which is restricted to 6 execution cores. This will result in slightly lower IGP performance overall, but the difference is frankly inconsequential in many situations.




Various Core Processors Of Intel(Contd.)

Core i5 Series Intel used to split the Core i5 processor brand into two different

lines, one of which was dual-core and one of which was quad-core. All Sandy Bridge Core i5 processors are quad-core processors,

they all have Turbo Boost, and they all lack Hyper-Threading. Most of the Core i5 processors, besides the K series (explained later) use the same 2000 series IGP with a maximum clock speed of 1100 MHz and six execution cores.

In the i3 vs. i5 vs. i7 battle, the Core i5 processor is now obviously the main-stream option no matter which product you buy. The only substantial difference between the Core i5 options is the clock speed, which ranges from 2.8 GHz to 3.3 GHz. Obviously, the products with a quicker clock speed are more expensive than those that are slower.





Core i7 Series These processors are virtually identical to the Core i5. They have a 100

MHz higher base clock speed, which is inconsequential in most situations. The real feature difference is the addition of hyper-threading on the Core i7, which means that the processor will appear as an 8-core processor in Windows. This improves threaded performance and can result in a substantial boost if you're using a program that is able to take advantage of 8 threads.

Of course, most programs can't take advantage of 8 threads. Those that can are almost usually meant for enterprise or advanced video editing applications - 3D rendering programs, photo editing programs, and scientific programs are categories of software frequently designed to use 8 threads. The average user is unlikely to see the full benefit of the hyper-threading feature. In the Core i3 vs. i5 vs. i7 battle, the i7 has limited appeal.

The IGP on Core i7 processors can also reach a higher maximum clock speed of 1350 MHz as I've said before; however, this difference is largely inconsequential when measuring real-world performance.





The K series processor Late in the lifespan of Intel's previous Core i branded products;

Intel introduced the "K" series. These processors had unlocked multipliers, making them easier to overclock.

Intel has kept this line of products alive with the new Sandy Bridge architecture by introducing a K series Core i5 and i7 processor. As before, these processors have unlocked multipliers. However, they also have a new feature - better integrated graphics processors.

This comes in the form of the 3000 series IGP, which has 12 execution cores instead of 6. The maximum clock speed remains limited by the processor brand - the Core i5 K is limited to 1100 MHz, while the Core i7 K can reach 1350 MHz the additional execution cores can result in better performance in games, although to honest, the IGP isn't remotely cut out for desktop gaming.





The IGP Features: Sandy Bridge The most importance new feature added to Intel's Sandy Bridge processors

is the inclusion of an IGP on the processor. Intel did this before with Core i3 and some Core i5 processors, but the IGP was still separate from the processor itself - the IGP and CPU were placed on the same piece of silicon, but didn't physically work together.

Now Intel has taken the IGP integration a step further and worked the IGP into the CPU architecture. It even shares cache with the processor. What this means, in practical terms, is that the on-board graphics of Intel's new processors are superior to anything they've offered before. It also enables Quick Sync, a video transcoding feature that provides blazing performance when converting videos to a different format.

Intel is offering two different types of IGPs on its processors. The 2000 has 6 execution units, while the 3000 has 12 execution units. Obviously, the later is quicker. Intel hasn't tied the IGP that you receive to the type of processor you choose, however. Instead, it has tied the 3000 series IGP to the "K" series processors. If you see a "K" at the end of the processor's name, it has the 3000 series IGP. So far, Intel doesn't offer a Core i3 K series processor, but that could change in the future.





Laying Out the Chipset The staggered release of Intel's previous Core i3/i5/i7 products also

resulted in a staggered release of processor sockets and their related chipsets. First came LGA 1366 processor socket, which was tied to some Core i7 processors. Then Intel confused things by releasing the LGA 1156 socket, which was made available on several different chipsets and processor types. Choosing the right socket and chipset for a processor wasn't easy.

Intel has now clarified matters by releasing a single processor socket and two processor chipsets alongside Sandy Bridge. The new socket is LGA 1155, and it isn't backwards compatible with anything Intel has previously offered. The new chipsets are P67 and H67, with the P variant being performance-oriented and the H variant targeted at general use. The main difference is that P67 allows for processor overclocking, while H67 does not. P67 also offers 16 additional PCIe lanes. Both Core i3 and i5 processors are compatible with either chipset.




Core i5 vs. Core i7

Core i5: The New Middle Class While the hardware has changed, Intel's branding scheme remains the

same, and Core i5 remains Intel's primary mid-range processor. It is targeted at the heart of the market, with pricing that is not at budget levels but still affordable, and performance that is extremely quick but not the fastest Intel offers.

Intel's high-end processor line is the Core i7. Many users who are looking for a high-performance part end up considering both i5 and i7 products.

A Unified Socket and Chipset Perhaps the best news to come out of Intel's new line of i5 and i7

processors is introduction of a single socket for all Sandy Bridge Core i3/i5/i7 processors. For now, however, the Sandy Bridge processors all use the LGA 1155 socket. In case you're wondering, this socket is not backwards compatible with previous LGA1156 processors.




Core i5 vs. Core i7(Contd.)

Intel Turbo Boost Intel has made Turbo Boost a standard feature on all Core i5 and

i7 processors, from the least to most expensive. Intel has also reduced the gap between the maximum turbo boost frequencies on different processors. Previously, some of the older Core i7 processors actually had a much less efficient Turbo Boost feature than some newer Core i5s.

All of Intel's current Core i5 and i7 processors offer a boost of between 300 and 400 MHz The least expensive i5s offer the 300 MHz boost - for example, the Core i5 2300 has a base clock speed of 2.8 GHz and a maximum Turbo Boost speed of 3.1 GHz. The Intel Core i7 2600, on the other hand, offers a base clock speed of 3.4 GHz and a maximum Turbo Boost of 3.8 GHz.

Besides the clock speed difference, Turbo Boost is essentially the same on the i5 and i7 processors.





Difference in Hyper-Threading Another significant performance difference is how the Core i7 and

Core i5 products will be handling hyper-threading. Hyper-threading is a technology used by Intel to simulate more cores than actually exist on the processor. While Core i7 products have all been quad-cores, they appear in Windows as having eight cores. This further improves performance when using programs that make good use of multi-threading.

All Sandy Bridge Core i5 processors have hyper-threading disabled, and all Sandy Bridge Core i7 processors have hyper-threading enabled. This is a major feature difference of Core i5 vs. Core i7 processors, and it will give the Core i7 products an advantage over Core i5 processors in some heavily multi-threaded applications.





The New IGPAll of Intel's Sandy Bridge processors make use of a new

integrated IGP that is part of the processor architecture. While far from a gaming-grade video solution, the integrated IGP offers reasonable performance without consuming much power. It also enables features like Quick Sync, which can transcode video extremely quickly.

There are two versions of this IGP; the 2000 and the 3000. The only difference between the two is the number of execution units. The 2000 has 6, while the 3000 has 12. This doesn't mean the 3000 is twice as quick, but it does means the 3000 is about 50% quicker in most benchmarks.





i5 vs. i7: What it means to Consumers and Power Users Currently, the Core i5 processor brand makes up most of Intel's Sandy

Bridge processor line. The prices of these processors range from $177 to $216 with base clock speeds between 2.8 GHz and 3.3 GHz. Intel only offers two Core i7 products, the Core i7-2600 and Core i7-2600K, both of which have a 3.4 GHz base clock speed. The i7-2600 has a price tag of $294.

As you may have guessed, paying about $80 more for the 100 MHz clock speed increase between the fastest i5 and the i7 isn't a great deal. The main reason to pay this additional cash for an i7 is hyper-threading, but this advantage will only be evident if you frequently use programs that can actually make use of 8 threads.

For most users, the i5 is clearly the better deal. The i5-2500 makes the most sense in my opinion, as it offers an extremely quick base clock speed of 3.3 GHz for about $200. Of course, the value of this is subject to change in the future as Intel fleshes out its product line with new models.




Hyper Threading

Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors. The physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.

The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance. To keep up with this demand we cannot rely entirely on traditional approaches to processor design. Microarchitecture techniques used to achieve past processor performance improvement–super-pipelining, branch prediction, super-scalar execution, out-of-order execution, caches–have made microprocessors increasingly more complex, have more transistors, and consume more power. In fact, transistor counts and power are increasing at rates greater than processor performance. Processor architects are therefore looking for ways to improve performance at a greater rate than transistor counts and power dissipation. Intel’s Hyper-Threading Technology is one solution.




Hyper Threading(Contd.)

A look at today’s software trends reveals that server applications consist of multiple threads or processes that can be executed in parallel. On-line transaction processing and Web services have an abundance of software threads that can be executed simultaneously for faster performance. Even desktop applications are becoming increasingly parallel. Intel architects have been trying to leverage this so-called thread-level parallelism (TLP) to gain a better performance vs. transistor count and power ratio.

In both the high-end and mid-range server markets, multiprocessors have been commonly used to get more performance from the system. By adding more processors, applications potentially get substantial performance improvement by executing multiple threads on multiple processors at the same time. These threads might be from the same application, from different applications running simultaneously, from operating system services, or from operating system threads doing background maintenance. Multiprocessor systems have been used for many years, and high-end programmers are familiar with the techniques to exploit multiprocessors for higher performance levels.





In recent years a number of other techniques to further exploit TLP have been discussed and some products have been announced. One of these techniques is chip multiprocessing (CMP), where two processors are put on a single die. The two processors each have a full set of execution and architectural resources. The processors may or may not share a large on-chip cache. CMP is largely orthogonal to conventional multiprocessor systems, as you can have multiple CMP processors in a multiprocessor configuration. Recently announced processors incorporate two processors on each die. However, a CMP chip is significantly larger than the size of a single-core chip and therefore more expensive to manufacture; moreover, it does not begin to address the die size and power considerations.

Another approach is to allow a single processor to execute multiple threads by switching between them. Time-slice multithreading is where the processor switches between software threads after a fixed time period. Time-slice multithreading can result in wasted execution slots but can effectively minimize the effects of long latencies to memory. Switch-on-event multithreading would switch threads on long latency events such as cache misses. This approach can work well for server applications that have large numbers of cache misses and where the two threads are executing similar tasks. However, both the time-slice and the switch-on event multi- threading techniques do not achieve optimal overlap of many sources of inefficient resource usage, such as branch mispredictions, instruction dependencies, etc.





Finally, there is simultaneous multi-threading, where multiple threads can execute on a single processor without switching. The threads execute simultaneously and make much better use of the resources. This approach makes the most effective use of processor resources: it maximizes the performance vs. transistor count and power consumption. Hyper-Threading Technology brings the simultaneous multi-threading approach to the Intel architecture. In this paper we discuss the architecture and the first implementation of Hyper-Threading Technology on the Intel Xeon processor family.

Hyper-Threading Technology makes a single physical processor appear as multiple logical processors. To do this, there is one copy of the architecture state for each logical processor, and the logical processors share a single set of physical execution resources. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on conventional physical processors in a multiprocessor system. From a microarchitecture perspective, this means that instructions from logical processors will persist and execute simultaneously on shared execution resources.





There are few elements in CPU that need to be understand to know about hyper-threading technology: Registers - Registers are basically circuits that hold a single 64-bit value and are the

fastest form of storage available on a computer. The x86- architecture provides a number of General Purpose Registers that are used by an executing program. In a multicore chip, registers are unique to each core so if you have a quad-core processor, there will be 4 sets of general purpose registers.

Cache – Cache is essentially a form of storage that falls between registers and RAM in terms of speed. In modern processors there are generally three levels and in the case of the i7, Levels 1 & 2 is private and Level 3 is shared by all the cores on a chip. The most important thing to know is that accessing the cache is slower than registers but still faster than RAM.

Execution Unit – This is the section in the CPU responsible for actually executing the instructions. If you tell the computer to add 2 + 3, this is the part that operation would be performed in.

Front-End – This is a unit of the processor that is also known as Instruction Fetch/Decode. Essentially this unit will grab instructions from either cache or RAM and decode them into a form that execution unit can understand.

Branch Predictor - this unit will attempt to predict branches in program code. If there is an “if-then” statement in a program, it will guess which statements will be executed and prefetch them for the front-end.





In a core with HT, the registers are all duplicated. This means that one core will have 2 sets of registers and this is what the operating will see as a “logical core” since the sum of the registers represents the processor’s state. We’ll call these sets A and B. Even though it appears as two cores, they will still be sharing the same cache, branch predictor, front-end, and most importantly, execution unit. Because they still share so many resources, only one thread will technically execute at once. The advantage of adding the HT logic is that if a thread is executing and stalls for any reason, the other thread can be switched in very fast while the cause of the stall in the first thread is addressed. To better illustrate how this works, consider the following: Set A is considered the current state of the processor. Thread a starts executing. Thread A needs a value from memory that isn’t in the cache. Memory access is very time consuming in CPU terms, so thread A is considered

stalled. Instead of wasting cycles waiting for the memory operation to complete, set B is

considered the current state. Thread B is now executing until it stalls or until thread A can execute again (memory

operation finishes).





This process basically just continues on constantly. Now, there should be an obvious question: What can cause a thread to stall? There are a few things; the simplest one to understand is a cache miss. This is when the thread goes to access a value that isn’t currently in the cache or any of the registers. A branch miss prediction can also occur when the branch predictor prefetches the wrong instructions into the cache.

There is another time Hyper-Threading kicks in, and that is if one thread is using Floating-Point resources while the 2nd is only using Integer resources. HT will allow them both to execute simultaneously while they don't conflict.




Does hyper threading actually help?

Hyper-Threading has some interesting performance characteristics as a result of its nature. HT will provide close to zero advantage if instruction decoding or execution is the limiting factor in performance. In the Nehalem architecture this is rarely the case. It performs ideally when there are a lot of cache misses or branch miss predictions since the execution unit would otherwise be idle waiting for these issues to be resolved.

Basically, certain applications will benefit more than others. Running a more parallel workload such as rendering or encoding video will see a nice benefit from HT since it’s likely both threads will be accessing the same data so they aren’t really competing for cache. Additionally the relatively small amount of local L2 cache in the i7 (256k) means there will be a decent amount of memory access giving the second thread time to execute. Also, it can result in a more responsive machine if not much is going on since threads will have very low execution time and it’s much faster for the CPU to switch the active register set than to grab another thread from RAM and load it into the registers.




Are there drawbacks?

As with most engineering decisions, there are drawbacks to HT. One of the more obvious one is that since HT keeps the execution unit fed more efficiently, it spends less time idle and can result in higher operating temperatures. More time idle would mean the CPU got a chance to cool down before the next execution burst and would result in a lower max temperature.

There are also programs that will either not see any benefit from HT or see decreased performance as well. Typically something that has performance limited by cache, instruction decode, the execution unit, or memory access will see little to negative improvement from HT (one of the reasons the i7 has so much memory bandwidth).




Are there drawbacks?(Contd.)

Running more than one multithreaded, computationally intensive task at a time can also be a situation where HT doesn’t help performance. If a processor core is running threads from different programs or that are operating on different data, all of the shared resources are effectively halved (data cache, branch prediction, instruction cache). This means branch miss predictions and cache misses become even more common, possibly to the point where both threads are stalled. Depending on the specific program this can mean either lower performance (compared to HT being disabled) or worse scaling than expected.

The last drawback is probably the most important one: The benefit of HT is inconsistent and dependent upon the specific operating environment and programs being run. Because of the way it works, code that is heavily optimized is likely to show less benefit as it would be designed to lower branch miss-predictions and cache misses. The inconsistency of HT while multitasking won’t show up on benchmarks since they’re designed to only test a single task at a time.




Is it worth to use Hyper-thread technology?

If one does a lot of 3d rendering or Video Transcoding then it probably is since this is the workload HT is best suited for. If you find that you generally run multiple intensive tasks simultaneously (like playing a game while encoding a video or recompiling the Linux kernel in a VM) then HT could have a negative impact on overall performance (though not necessarily). One thing that is for sure is its impact is exaggerated in synthetic benchmarks, almost to the point where it becomes misleading.




Virtualization

Server virtualization: Huge data-centers contains large number of server. Work- load,

user-activity and other things decides which server when to use and for the servers that are not been used according to their capacities companies still spending their money, energy and resources to keeping them updated and preventing them from any crashing and overheating. So server virtualization concept is used to make that physical server consolidate on fewer more powerful and energy efficient server and that vm (virtual machine) or energy efficient server imitate or pretends to be multiple servers on network. Virtual server environment is transparent on network so each user can interact with virtual server as if they are still multiple servers but now main advantage is that they should have to take care of only few energy efficient servers instead of many servers and saving of resources, energy and money also possible.




Virtualization(Contd.)

As shown in figure in traditional architecture there is hardware which is working on single operating system and in that operating system different - different application are working.

But as we know as this system as not energy efficient so one virtual environment is developed through which now we can work on different operating system with a single machine.




Virtual Machine

A virtual machine monitor (VMM) is a host program that allows a single computer to support multiple, identical execution environments. All the users see their systems as self-contained computers isolated from other users, even though every user is served by the same machine. In this context, a virtual machine is an operating system (OS) that is managed by an underlying control program. For example, IBM's VM/ESA can control multiple virtual machines on an IBM S/390 system.

We are doing server virtualization to reduce energy cost, simplify manageability and disaster management.

In server virtualization what we are doing is adding VMM software to allow hardware to use more than one OS.

Major component of the server: Processor Chipset Network interface




Virtual Machine(Contd.)

Individual technologies that make up Intel VT are built in this component that boost Performance, boost reliability, and boost flexibility.

Intel VT supports virtual machine architectures comprised of two principal classes of software: Virtual-Machine Monitor (VMM): A VMM acts as a host and has full control

of the processor(s) and other platform hardware. VMM presents guest software (see below) with an abstraction of a virtual processor and allows it to execute directly on a logical processor. A VMM is able to retain selective control of processor resources, physical memory, interrupt management, and I/O.

Guest Software: Each virtual machine is a guest software environment that supports a stack consisting of an operating system (OS) and application software. Each operates independently of other virtual machines and uses the same interface to processor(s), memory, storage, graphics, and I/O provided by a physical platform. The software stack acts as if it were running on a platform with no VMM. Software executing in a virtual machine must operate with reduced privilege so that the VMM can retain control of platform resources.




Intel Virtualization Technology-Flex Migration (Intel VT-X)

Obviously, as IT adds new systems, it would be much more convenient and efficient if an IT manager could simply add new resources to existing pools without having to worry about differences in processor generation. For this reason, Intel has developed Intel VT Flex Migration. When combined with support from virtualization software, it ensures that the hypervisor can expose a consistent set of instructions across all servers in the pool. Intel VT Flex Migration support starts with Intel® Core™ microarchitecture and will be available in future generations of the Intel Xeon processor family.

With Intel VT Flex Migration, IT managers can easily add current and future Intel Xeon processor-based systems to the same resource pool when using supporting hypervisor software. This gives IT the power to choose the right server platform when it is needed to optimize performance, cost, power, and reliability, without having to worry about forward and backward compatibility across generations of Intel Xeon processor-based servers starting with Intel Core microarchitecture and extending into future generations of Intel Xeon processors. IT managers can pool server resources using multiple generations of Intel Xeon processors whether they are single, dual- or multi-processor based. This creates a dynamic virtual server infrastructure that enables the use of live VM migration to improve usage models such as failover, load balancing, disaster recovery, and server maintenance.




Intel VT-X(Contd.)

Current Intel® Xeon® 5400 and 5200 processor series, 3300 and 3100 processor series, as well as future Intel Xeon processors, support Intel VT Flex Migration. Using virtualization software that is enabled to take advantage of this feature, Intel servers based on these processors can be pooled with earlier generation of Intel Core microarchitecture processors. These include Intel® Xeon® 7300, 5300, 5100, 3200, 3000 series processors. Major Intel VT-x component is Intel VT-x flex migration. By using this technology, we will be able to migrate the application from one server to another and recover from disaster.

From Intel VT flex migration one can migrate between to generation processor so one can react quickly on change in condition making it much easier to server upend running.




Flex Priority

Intel VT Flex Priority optimizes and accelerates interrupt virtualization by improving virtual machine access to the Task Priority Register thereby enabling efficient Symmetric Multi-Processing (SMP) configurations of 32-bit guest operating systems. For users, this translates into more efficient performance in virtual environments for their critical enterprise applications.

Intel VT Flex Priority was designed to accelerate virtualization interrupt handling thereby improving virtualization performance. Intel VT Flex Priority accelerates interrupt handling by preventing unnecessary VMExits on accesses to the Advanced Programmable Interrupt Controller.

Intel flex priority improves virtualization by 35% When processor is constantly bombarded with interruption many of which are

critical so Intel VT flex priority is kind of like receptionist who alerts when interruption is critical. Because it is not necessary that all the interrupt that are given to the processor are necessarily

Critical to be executed at the time of occurrence of interruption so through flex priority is kind like receptionist who alerts when interruption is critical so processor can work efficiently if it is less interrupted.




Virtualization for directed I/O

A VMM must support virtualization of I/O requests from guest software. I/O virtualization may be supported by a VMM through any of the following models:

Emulation: A VMM may expose a virtual device to guest software by emulating an existing (legacy) I/O device. VMM emulates the functionality of the I/O device in software over whatever physical devices are available on the physical platform. I/O virtualization through emulation provides good compatibility (by allowing existing device drivers to run within a guest), but pose limitations with performance and functionality.

New Software Interfaces: This model is similar to I/O emulation, but instead of emulating legacy devices, VMM software exposes a synthetic device interface to guest software. The synthetic device interface is defined to be virtualization-friendly to enable efficient virtualization compared to the overhead associated with I/O emulation. This model provides improved performance over emulation, but has reduced compatibility (due to the need for specialized guest software or drivers utilizing the new software interfaces).

Assignment: A VMM may directly assign the physical I/O devices to VMs. In this model, the driver for an assigned I/O device runs in the VM to which it is assigned and is allowed to interact directly with the device hardware with minimal or no VMM involvement. Robust I/O assignment requires additional hardware support to ensure the assigned device accesses are isolated and restricted to resources owned by the assigned partition. The I/O assignment model may also be used to create one or more I/O container partitions that support emulation or software interfaces for virtualizing I/O requests from other guests. The I/O-container-based approach removes the need for running the physical device drivers as part of VMM privileged software.




Virtualization for directed I/O(Contd.)

Models contd.: I/O Device Sharing: In this model, which is an extension to the I/O

assignment model, an I/O device supports multiple functional interfaces, each of which may be independently assigned to a VM. The device hardware itself is capable of accepting multiple I/O requests through any of these functional interfaces and processing them utilizing the device's hardware resources.

Depending on the usage requirements, a VMM may support any of the above models for I/O virtualization. For example, I/O emulation may be best suited for virtualizing legacy devices. I/O assignment may provide the best performance when hosting I/O-intensive workloads in a guest. Using new software interfaces makes a trade-off between compatibility and performance, and device I/O sharing provides more virtual devices than the number of physical devices in the platform.




Overview Of Intel Virtualization

A general requirement for all of above I/O virtualization models is the ability to isolate and restrict device accesses to the resources owned by the partition managing the device. Intel VT for Directed I/O provides VMM software with the following capabilities: I/O device assignment: For flexibly assigning I/O devices to VMs

and extending the protection and isolation properties of VMs for I/O operations.

DMA remapping: For supporting independent address translations for Direct Memory Accesses (DMA) from devices.

Interrupt remapping: For supporting isolation and routing of interrupts from devices and external interrupt controllers to appropriate VMs.

Reliability: For recording and reporting to system software DMA and interrupt errors that may otherwise corrupt memory or impact VM isolation.




DMA Remapping

DMA remapping facilities have been implemented in a variety of contexts in the past to facilitate different usages. In workstations and server platforms, traditional I/O memory management units (IOMMUs) have been implemented in PCI root bridges to efficiently support scatter/gather operations or I/O devices with limited DMA addressability. Other well-known examples of DMA remapping facilities include the AGP Graphics Aperture Remapping Table (GART), the Translation and Protection Table (TPT) defined in the Virtual Interface Architecture, and subsequently influencing a similar capability in the InfiniBand Architecture and Remote DMA (RDMA) over TCP/IP specifications. DMA remapping facilities have also been explored in the context of NICs designed for low latency cluster interconnects.

Traditional IOMMUs typically support an aperture-based architecture. All DMA requests that target a programmed aperture address range in the system physical address space are translated irrespective of the source of the request. While this is useful for handling legacy device limitations (such as limited DMA addressability or scatter/gather capabilities), they are not adequate for I/O virtualization usages that require full DMA isolation.




DMA Remapping(Contd.)

The VT-d architecture is a generalized IOMMU architecture that enables system software to create multiple DMA protection domains. A protection domain is abstractly defined as an isolated environment to which a subset of the host physical memory is allocated. Depending on the software usage model, a DMA protection domain may represent memory allocated to a VM, or the DMA memory allocated by a guest-OS driver running in a VM or as part of the VMM itself. The VT-d architecture enables system software to assign one or more I/O devices to a protection domain. DMA isolation is achieved by restricting access to a protection domain's physical memory from I/O devices not assigned to it, through address- translation tables.

The I/O devices assigned to a protection domain can be provided a view of memory that may be different than the host view of physical memory. VT-d hardware treats the address specified in a DMA request as a DMA virtual address (DVA). Depending on the software usage model, a DVA may be the Guest Physical Address (GPA) of the VM to which the I/O device is assigned, or some software-abstracted virtual I/O address (similar to CPU linear addresses). VT-d hardware transforms the address in a DMA request issued by an I/O device to its corresponding Host Physical Address (HPA).




DMA Remapping(Contd.)

Figure 5 illustrates DMA address translation in a multi-domain usage. I/O devices 1 and 2 are assigned to protection domains 1 and 2, respectively, each with its own view of the DMA address space.

Figure 6 illustrates a PC platform configuration with VT-d hardware implemented in the north-bridge component.

Figure 5: DMA remapping

Figure 6: Platform configuration with VT-d




Intel Smart Memory Access

Intel Smart Memory Access improves system performance by optimizing the use of the available data bandwidth from the memory subsystem and hiding the latency of memory accesses. The goal is to ensure that data can be used as quickly as possible and is located as close as possible to where it’s needed to minimize latency and thus improve efficiency and speed.

Intel Smart Memory Access includes a new capability called memory disambiguation, which increases the efficiency of out-of-order processing by providing the execution cores with the built-in intelligence to speculatively load data for instructions that are about to execute before all previous store instructions are executed.

Intel Smart Memory Access also includes an instruction pointer-based prefetcher that “prefetches” memory contents before they are requested so they can be placed in cache and readily accessed when needed. Increasing the number of loads that occur from cache versus main memory reduces memory latency and improves performance.




Intel Smart Memory Access(Contd.)

How Intel smart memory access improves execution throughput? Intel core microarchitecture memory cluster (level 1 data memory

subsystem) is highly out of order, non blocking and speculative. It has a variety of methods of caching and buffering to help achieve its performance. Included among these are Intel Smart Memory Access and its two key features: memory disambiguation and instruction pointer based (IP-based) prefetcher to the level 1 data cache.




Memory Disambiguation

Since Intel Pentium pro and all Intel processor have featured a sophisticated out of order memory engine allowing the CPU to execute non -dependent instruction in any order but they had significant short coming, these processors were built around a conservative set of assumptions concerning which memory accesses could proceed out of order. They would not move a load in the execution order above a store having an unknown address (cases where a prior store has not been executed yet). This was because if the store and load end up sharing the same address, it results in an incorrect instruction execution. Yet many loads are to locations unrelated to recently executed stores. Prior hardware implementations created false dependencies if they blocked such loads based on unknown store addresses. All these false dependencies resulted in many lost opportunities for out-of-order execution.

In designing Intel Core microarchitecture, Intel sought a way to eliminate false dependencies using a technique known as memory disambiguation. (“Disambiguation” is defined as the clarification that follows the removal of an ambiguity.) Through memory disambiguation, Intel Core microarchitecture is able to resolve many of the cases where the ambiguity of whether a particular load and store share the same address thwart out-of-order execution.




Memory Disambiguation(Contd.)

Memory disambiguation uses a predictor and accompanying algorithms to eliminate these false dependencies that block a load from being moved up and completed as soon as possible. The basic objective is to be able to ignore unknown store-address blocking conditions whenever a load operation dispatched from the processor’s reservation station (RS) is predicted to not collide with a store. This prediction is eventually verified by checking all RS-dispatched store addresses for an address match against newer loads that were predicted non-conflicting and already executed. If there is an offending load already executed, the pipe is flushed and execution restarted from that load.

The memory disambiguation predictor is based on a hash table that is indexed with a hashed version of the load’s EIP address bits. (“EIP” is used here to represent the instruction pointer in all x86 modes.) Each predictor entry behaves as a saturating counter, with reset.





The predictor has two write operation both done during the load’s retirement: Increment the entry if load “behaved well” that if it meet

unknown store address but none of them collided. Reset the entry to zero if the load “misbehaved.” That is, if it

collided with at least one older store that was dispatched by the RS after the load. The reset is done regardless of whether the load was actually disambiguated.

The predictor takes a conservative approach. In order to allow memory disambiguation, it requires that a number of consecutive iterations of a load having the same EIP behave well. This isn’t necessarily a guarantee of success though. If two loads with different EIPs clash in the same predictor entry, their prediction will interact.





Predictor lookup The predictor is looked up when load instruction is dispatched from RS to

the memory pipe. If the respective counter is saturated, the load is assumed to be safe and the result is written to the “disambiguation allowed bit” in the loaded buffer. This means that if load finds its relevant store address and the load is allowed to go on. If the predictor is not saturated, the load will behave like in prior implementations. In other words, if there is a relevant unknown store address, the load will get blocked.

Load dispatch

In case the load meets an older unknown store address, it sets the “update bit” indicating the load should update the predictor. If the prediction was "go,” the load will be dispatched and set the “done” bit indicating that disambiguation was done. If the prediction was "no go," the load will be conservatively blocked until resolving of all older store addresses.





Prediction verification To recover in case of a misprediction by the disambiguation predictor, the

address of all the store operations dispatched from the RS to the Memory Order Buffer must be compared with the address of all the loads that are younger than the store. If such a match is found the respective “reset bit” is set. When a load retires that was disambiguated and its reset bit set, we restart the pipe from that load to re-execute it and all its dependent instructions correctly.

Watchdog mechanism Disambiguation is based on prediction and mispredictions can cause

execution pipe flush, it’s important to build in safeguards to avoid rare cases of performance loss. Consequently, Intel Core microarchitecture includes a mechanism to temporarily disable memory disambiguation to prevent cas.es of performance loss. This mechanism constantly monitors the success rate of the disambiguation predictor.




Advanced smart cache

Intel Advanced Smart Cache is a multi-core optimized cache that improves performance and efficiency by increasing the probability that each execution core of a dual core processor can access data from a higher-performance, more-efficient cache subsystem.

To accomplish this, Intel Core microarchitecture shares the Level 2 (L2) cache between the cores. This better optimizes cache resources by storing data in one place that each core can access. By sharing L2 cache between each core, Intel Advanced Smart Cache allows each core to dynamically use up to 100 percent of available L2 cache. Threads can then dynamically use the required cache capacity.

As an extreme example, if one of the cores is inactive, the other core will have access to the full cache. Intel Advanced Smart Cache enables very efficient sharing of data between threads running in different cores. It also enables obtaining data from cache at higher throughput rates for better performance. Intel Advanced Smart Cache provides a peak transfer rate of 96 GB/sec (at 3 GHz frequency).




Wide dynamic execution

Intel Wide Dynamic Execution significantly enhances dynamic execution, enabling delivery of more instructions per clock cycle to improve execution time and energy efficiency. Every execution core is 33 percent wider than previous generations, allowing each core to fetch, decode, and retire up to four full instructions simultaneously.

Intel Wide Dynamic Execution also includes a new and innovative capability called Macrofusion. Macrofusion combines certain common x86 instructions into a single instruction that is executed as a single entity, increasing the peak throughput of the engine to five instructions per clock. The wide execution engine, when Macrofusion comes into play, is then capable of up to six instructions per cycle throughputs for even greater energy -efficient performance.

Intel Core microarchitecture also uses extended microfusion, a technique that “fuses” micro-ops derived from the same macro-op to reduce the number of micro-ops that need to be executed. Studies have shown that micro-op fusion can reduce the number of micro-ops handled by the out-of-order logic by more than 10 percent.

Intel Core microarchitecture “extends” the number of micro-ops that can be fused internally within the processor.




Wide dynamic execution(Contd.)

Intel Core microarchitecture also incorporates an updated ESP (Extended Stack Pointer) Tracker. Stack tracking allows safe early resolution of stack references by keeping track of the value of the ESP register. About 25 percent of all loads are stack loads and 95 percent of these loads may be resolved in the front end, again contributing to greater energy efficiency [Bekerman].

Micro-op reduction resulting from micro-op fusion, Macrofusion, ESP Tracker, and other techniques make various resources in the engine appear virtually deeper than their actual size and results in executing a given amount of work with less toggling of signals—two factors that provide more performance for the same or less power.

Intel Core microarchitecture also provides deep out of-order buffers to allow for more instructions in flight, enabling more out-of-order execution to better instruction level parallelism.




Advanced Digital media boost

Intel Advanced Digital Media Boost helps achieve similar dramatic gains in throughputs for programs utilizing SSE instructions of 128-bit operands. (SSE instructions enhance Intel architecture by enabling programmers to develop algorithms that can mix packed, single-precision, and double-precision floating point and integers, using SSE instructions.)

These throughput gains come from combining a 128-bit-wide internal data path with Intel Wide Dynamic Execution and matching widths and throughputs in the relevant caches. Intel Advanced Digital Media Boost enables most 128-bit instructions to be dispatched at a throughput rate of one per clock cycle, effectively doubling the speed of execution and resulting in peak floating point performance of 24 GFlops (on each core, single precision, at 3 GHz frequency).

Intel Advanced Digital Media Boost is particularly useful when running many important multimedia operations involving graphics, video, and audio, and processing other rich data sets that use SSE, SSE2, and SSE3 instructions.




Intelligent power capability

Intel Intelligent Power Capability is a set of capabilities for reducing power consumption and device design requirements. This feature manages the runtime power consumption of all the processor’s execution cores. It includes an advanced power-gating capability that allows for an ultra fine-grained logic control that turns on individual processor logic subsystems only if and when they are needed.

Additionally, many buses and arrays are split so that data required in some modes of operation can be put in a low-power state when not needed. In the past, implementing such power gating has been challenging because of the power consumed in powering down and ramping back up, as well as the need to maintain system responsiveness when returning to full power [Wechsler].

Through Intel Intelligent Power Capability Intel has been able to satisfy these concerns, ensuring significant power savings without sacrificing responsiveness.




References

http://www.brighthub.comhttp://mintywhite.comhttp://www.flyertalk.comhttp://www.overclock.net/a/hyperthreading-explainedhttp://download.intel.com/technology/computing/vptech/Intel(

r)_VT_for_Direct_IO.pdf

http://software.intel.com/sites/default/files/m/3/4/d/6/3/18374-sma.pdf

http://www.youtube.com/watch?v=gqZrarZiHp8http://www.youtube.com/watch?v=3fcI6G7Scqkhttp://www.youtube.com/watch?v=V9AiN7oJaIMhttp://www.youtube.com/watch?v=kkrqyEpINSQhttp://www.youtube.com/watch?v=y0Q40pBoIwA

http://www.brighthub.com/

http://www.brighthub.com/

http://mintywhite.com/

http://mintywhite.com/

http://www.flyertalk.com/

http://www.flyertalk.com/

http://www.overclock.net/a/hyperthreading-explained

http://www.overclock.net/a/hyperthreading-explained

http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf






http://www.youtube.com/watch?v=gqZrarZiHp8

http://www.youtube.com/watch?v=gqZrarZiHp8

http://www.youtube.com/watch?v=3fcI6G7Scqk

http://www.youtube.com/watch?v=3fcI6G7Scqk

http://www.youtube.com/watch?v=V9AiN7oJaIM

http://www.youtube.com/watch?v=V9AiN7oJaIM

http://www.youtube.com/watch?v=kkrqyEpINSQ

http://www.youtube.com/watch?v=kkrqyEpINSQ

http://www.youtube.com/watch?v=y0Q40pBoIwA




Thank You


features of modern intel microprocessors

Education

core singlecore

core i5core

core torun

quadcore processor

dual core processor

ahexacore processor

multicore processorwhat

quad core i5