microprocessor system architectures – ia32 advanced features and rests jakub yaghob

52
Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Upload: beverly-warner

Post on 12-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Microprocessor system architectures – IA32 advanced

features and rests

Jakub Yaghob

Page 2: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Multiple-processor management Mechanisms

Support for atomic operations on system memory Serializing instructions APIC L2 and L3 caches Hyper-threading

Aims Maintain system memory coherence Maintain cache coherence Predictable ordering of writes to memory Distribute interrupt handling among processors Increase system performance by exploiting multi-threaded OSs

and applications

Page 3: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Locked atomic operations

Three independent mechanisms Guaranteed atomic operations Bus locking using LOCK# or instruction prefix LOCK

Cache coherency protocols insuring cache coherency for atomic operations on cached data (cache lock) (Pentium Pro+)

Page 4: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Guaranteed atomic operations

i486+ R/W a byte R/W a word (2B) aligned on a word R/W a dword (4B) aligned on a dword

Pentium+ R/W a qword (8B) aligned on a qword R/W a word from/to uncached memory within 32-bit bus

Pentium Pro+ Unaligned word, dword, qword R/W from/to cached

memory within a cache line

Page 5: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Bus locking Automatic locking

XCHG with memory Setting B (busy) flag of a TSS descriptor Updating descriptors (e.g. A flag) Updating page tables Interrupt acknowledgement

Software controlled locking (prefix LOCK) Automatically assumed for XCHG BTS, BTC, BTR XADD, CMPXCHG, CMPXCHG8B INC, DEC, NOT, NEG, ADD, ADC, SUB, SBB, AND, OR, XOR Otherwise #UD exception (invalid opcode) Memory access can be unaligned Pentium Pro+ serializes locked operations

Page 6: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Self-modifying code Option 1

Write modified code using data segment Jump to new code or an intermediate location Execute the new code

Option 2 Write modified code using data segment Execute a serializing instruction Execute the new code

Required for Pentium Pro+ Performance penalty Cross-modifying code

One CPU changes a code and the second one executes it Synchronize CPUs and execute a serializing instruction

Page 7: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Memory ordering Program-ordering

Alias strong-ordering R/W issued on the bus in the order they occur in the instruction stream under

all circumstances i386

Processor-ordering Alias speculative-ordering or weak-ordering Allows increased instruction execution speed, while maintaining memory

coherency The exact behavior depends on a model; Pentium Pro+

Pentium and i486 They use processor-ordering In most cases they behave as program-ordered R miss goes ahead of W, when all buffered W are cache hits

I/O always in the order of instruction stream (strong-ordering)

Page 8: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Processor-ordering I. Single-processor and WB memory

R can be carried out speculatively and in any order R can pass buffered W, but the CPU is self-consistent W to memory are always carried out in program order, excluding instructions

CLFLUSH, MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, MOVNTPD W can be buffered W are not speculative; performed only for really executed (retired)

instructions Data from buffered W can be passed to waiting R within the CPU R/W cannot pass I/O, locked or serializing instructions R cannot pass LFENCE and MFENCE W cannot pass SFENCE and MFENCE

Multiple CPUs Individual CPUs behave as single-processor Writes by a single CPU are observed in the same order by all CPUs Writes from the individual CPUs on the bus are NOT ordered with respect to

each other

Page 9: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Processor-ordering II.

Page 10: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

„Fast string“ operation „Fast string“

Pentium Pro+ MOVS or STOS CPU works with cache lines

Reads are not performed during cache line writes Interrupts only on the cache line border Conditions

EDI and ESI aligned to 8B (PIII), EDI aligned to 8B (P4) Ascending order (DF=0) Initial counter ECX>=64 Source and target most not overlap by less then one cache line (64B

for P4+, 32B other) Memory type WC or WB

Page 11: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Strengthening or weakening memory ordering Strengthening

I/O instructions, locked instructions, LOCK and serializing instructions

SFENCE (PIII), LFENCE and MFENCE (P4+) SFENCE – all W finished before this instruction LFENCE – all R finished before this instruction MFENCE – all R and W finished before this instruction

PAT (Page Attribute Table) strengthens ordering for pages (PIII+)

Weakening or strengthening MTRR (Memory Type Range Registers) weaken or

strengthen ordering for physical memory regions (Pentium Pro+)

Page 12: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Serializing instructions

CPU finishes all flags, registers and memory changes

CPU clears all buffered W Pentium+ Privileged instructions

MOV CRx, MOV DRx, WRMSR, INVD, INVLPG, WBINVD, LGDT, LIDT, LTR

Non-privileged instructions CPUID, IRET, RSM

Non-privileged for memory ordering LFENCE, SFENCE, MFENCE

Page 13: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Propagation of page table entry changes „TLB shootdown“ Simple method

Send IPI to all CPUs Stop all CPUs excluding one (spin-lock) Active CPU makes the changes (invalidates page tables in

memory) and resumes all CPUs All CPUs invalidates their TLB (selectively or all entries) All CPUs return from IPI

Complicated and faster methods can be developed Different TLB mappings are not used on different CPUs

during the update The OS must be prepared for a situation where CPUs use

stale mapping during the update

Page 14: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

MPS 1.4 Multiprocessor Specification Controlled booting of multiple CPUs without a

dedicated HW HW can initiate a boot without a dedicated signal or

a predefined boot CPU All IA-32 CPUs have the same boot protocol

(including HT) Different mechanisms for different CPU models (P4

x Xeon older x Xeon newer) BSP = Bootstrap Processor AP = Application Processor

Page 15: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Detecting hyper-threading or multi-core

Hardware Multi-Threading feature flag CPUID.1:EDX[28] = 1

Logical processors per Package CPUID.1:EBX[23:16]

Cores per Package Only when CPUID works with EAX=4, otherwise

it has 1 core CPUID.(EAX=4,ECX=0):EAX[31:26]+1

Page 16: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Hyper-threading – I

One core is able to execute 2 or more instruction streams

Some parts of a core are private for each logical processor, some parts are shared among logical processors

Page 17: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Hyper-threading – II Private state of a logical

processor General purpose registers

EAX-ESP (RAX-RSP, R8-R15)

Segment registers CS-SS EFLAGS and EIP (RIP) x87 (ST0-ST7), MMX

(MM0-MM7), SSE (XMM0-XMM7/XMM15) and their control and status registers

Control registers CRx, GDTR, IDTR, LDTR, IA32_EFER

Debug registers DRx Time stamp Most of MSRs (including

PAT) Local APIC Instruction TLB

Shared state MTRR Data TLB Cache, the bus Some MSRs

Page 18: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Multi-Core

Page 19: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Programming MT-capable CPUs – I Requires support from OS Using PAUSE instruction in spin-lock

Encoded as REP NOP Older IA-32 CPUs interpret PAUSE as NOP Older AMD CPUs do NOT understand it

Using HLT Idle logical processor must use HLT and must not actively wait

Using MONITOR/MWAIT SSE3, check CPUID.1.ECX[3] = 1, available only for CPL=0 MONITOR sets up a memory range monitored for W MWAIT places the processor in an optimized state until a W to

the monitored range occurs

Page 20: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Programming MT-capable CPUs – II

Scheduling Dispatch tasks to logical processors 0 for all cores,

then to logical processors 1, etc. Use thread affinity

Do not measure the speed of a CPU by an active loop One lock or semaphore should be placed aligned into

128B block of memory

Page 21: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

APIC (Advanced Programmable Interrupt Controller)

Local APIC Internal in CPUs Receives interrupts from CPU’s interrupt pins, from internal

sources and from an external I/O APIC Sends and receives IPI (InterProcessor Interrupt)

I/O APIC Part of a chipset Receives external interrupts and relays them to a local APIC Possibility of IPI distribution among CPUs

xAPIC Newer architecture EXtended APIC P4 and Xeons

Page 22: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

APIC – xAPIC

xAPIC system (P4 and Xeon)

Page 23: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

APIC – „traditional“ APIC APIC system (Pentium and Pentium Pro+)

Page 24: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Local APIC structure

Page 25: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Internal cache

Cache structure of P4 and Xeon

Page 26: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Characteristics of cachesCache type Pentium/MMX Pentium Pro+ Pentium

M,Core/Core2P4 and Xeon

Trace cache N/A N/A N/A 12Kμops; 8wa

L1 instruction

8K; 2wa/16K; 32B; 4wa

16K; 32B; 4wa 32K; 64B; 8wa N/A

L1 data 8K; 2wa/16K; 32B; 4wa

8K; 2wa/16K; 32B; 4wa

32K; 64B; 8wa 8K; 64B; 4wa/16K; 8wa

L2 common external 128K-2M; 32B; 4wa

<2M; 64B; 8wa/ <4M; 64B; 16wa

256K-2M; 64B; 8wa

L3 common N/A N/A N/A Xeon 512K-4M; 64B; 8wa

Instr TLB 4K 32; 4wa/fa 32; 4wa 128; 4wa 128; 4wa

Data TLB 4K 64; 4wa/fa 64; 4wa 128; 4wa/DTLB0:16, DTLB1:256; 4wa

64; fa

Instr TLB LP ==ITLB4K 2; fa 2; fa/4; 4wa Fragmented??

Data TLB LP 8;4wa/==DTLB4K

8; 4wa 8; fa/DTLB0:16;DTLB1:32; 4wa

==DTLB4K

Store buffer 2*1/4*4 12 16/20 24

WC buffer N/A 4 6/8 6/8

Page 27: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Cache terminology Cache use MESI protocol for maintaining coherency Cache line fill

An operand is read from cacheable memory The entire cache line is read

Cache hit An operand is in a cache An access uses a value from a cache

Cache miss An operand is not in a cache

Write hit If a valid cache line exists, CPU can write into the cache If a write misses a cache, cache line fill occurs

Snooping CPU checks memory accesses on the bus with its cache lines

Page 28: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

MESI Each cache line has 2 status bits Transparent for programs Instruction L1 has only SI Transition by snooping

CPU detects W to the line with M Cancel transaction W line directly to the other CPU with branch to the memory Moving to the I state

Cache line status M (Modified) E (Exclusive) S (Shared) I (Invalid)Is it valid? yes yes yes noThe memory copy is... ...out of date ...exact ...exact N/ACopies in other CPUs? no no maybe maybe

W to this line......does not

go to the bus

... does not go to the bus,

moving to M

...moving to E

...goes directly to the memory

Page 29: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Cache control CR0[CD]

=0 – caching enabled for the whole of system memory, can be restricted for regions or pages =1 – caching disabled for Pentium, for other restricted

CR0[NW] =0 – WB enabled, can be restricted =1 – WB disabled

PCD and PWT in the page tables and directories Disable caching/WB for pages or page directories

PCD and PWT in the CR3 Disable caching/WB for page directories

G in the page tables (Pentium Pro+) Does not flush TLB entry during implicit flushing (task switch, mov cr3,eax)

CR4[PGE] (Pentium Pro+) Enables G in page tables

MTRR (Pentium Pro+) Memory types for regions of physical memory

PAT (PIII+) Memory types for pages

Page 30: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Store buffers IA-32 stores temporarily each W to memory in a

store buffer CPU continues without waiting on the memory or a cache

Transparent for software Draining store buffers

An interrupt or an exception Serializing instruction (Pentium Pro+) I/O operation LOCK operation BINIT operation (Pentium Pro+) (machine check) SFENCE instruction (PIII+) MFENCE instruction (P4+)

Page 31: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Memory types – an overview

Pentium has UC, WT, WB Control using NW, CD

UC- from PIII with PAT

Memory type CacheableWrite

cacheableSpeculative

readsMemory ordering

modelStrong Uncacheable

(UC)No No No Strong

Uncacheable (UC-) No No NoStrong, can be

overridden by WC in MTRR

Write combining (WC) No No Yes WeakWrite through (WT) Yes No Yes Speculative

Write back (WB) Yes Yes Yes SpeculativeWrite protected (WP) Yes (R) No Yes Speculative

Page 32: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Memory types – I Strong uncacheable (UC)

The system memory is not cached All R/W have strong-ordering, no speculation Useful for memory-mapped I/O Greatly reduces system performance

Uncacheable (UC-) Like UC, can be overridden to WC using MTRR Only PIII+ using PAT

Write Combining (WC) The system memory is not cached No coherency protocol Speculative R enabled, W ordering is NOT ensured W delayed and combined in WC buffers Useful for video frame buffers

Page 33: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Memory types – II Write Through (WT)

R/W from/to the system memory cached R comes from a cache on cache hit; cache line fills on cache miss; speculative R W writes to a cache and the main memory on cache hit; does not write to the cache on

cache miss WC enabled Useful for video frame buffers or devices without snooping

Write Back (WB) R/W from/to the system memory cached R comes from a cache on cache hit; cache line fills on cache miss; speculative R W writes to a cache and the main memory on cache hit; cache line fill on cache miss Cache coherency protocol

Write Protected (WP) R comes from a cache on cache hit; cache line fills on cache miss; speculative R W directly propagated on the system bus

Page 34: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

MTRR (Memory Type Range Registers) Assigning memory types to the physical memory regions Checking MTRR presence using CPUID MSR R/O registr IA32_MTRRCAP

Support for fixed ranges Number of variable ranges (Pentium Pro+) Support for WC type

Default type MSR IA32_MTRR_DEF_TYPE defines memory type for physical memory not

covered by fixed and variable ranges Fixed ranges

8 ranges of 64K size in the lowest 512K (00000000-0007FFFF) 16 ranges of 16K size in the next 256K (00080000-000BFFFF) 64 ranges of 4K size in the next 256K (000C0000-000FFFFF)

Variable ranges Address & PHYSMASKn = PHYSBASEn & PHYSMASKn When a variable range overlaps with a fixed range, the fixed range wins

Page 35: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

PAT (Page Attribute Table) Assigning memory type to the ranges of linear address space Checking PAT presence using CPUID MSR IA32_CR_PAT defines 8 types The type for a page is selected from IA32_CR_PAT by an index

created from PAT(4), PCD(2), PWT(1) bits in page tables It is always switched on The initial setting after RESET is backward compatible with PCD and

PWT – 2 * (WB, WT, UC-, UC)

Page 36: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Memory types restrictions If CR0[CD]=1, then caching is disabled If CR0[CD]=0, then caching restricted using PAT (or PCD and PWT) and MTRR

Always selected the most restrictive type WT „wins“ over WB WC „wins“ over WT and WB

Page 37: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Reset Sets a CPU to the well known state CPU in the real mode Internal caches, TLB and BTB invalidated CPU model dependent behavior

Pentium Pro+ All CPUs start initialization protocol, on of them is chosen as BSP and

continues in an OS initialization, all other APs halt and wait for an IPI „Wait for Startup“

i486 and Pentium HW knows, which CPU is BSP, other APs halt and wait on SIPI

INIT Like RESET Internal caches, MSR, MTRR, x87, SSE do not change Move to the real mode

Page 38: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

CPU state after RESET, INIT and power-up

EFLAGS 00000002 CR0 60000010h

EIP 0000FFF0 CRx 0

CS F000 EAX, ... 0

Base FFFF0000 EDX 00000mxxh

Limit FFFF STx +0.0

xS 0000 x87 CW 0040h

Base 00000000 x87 SW 0000

Limit FFFF x87 Tag 5555h

GDTR, IDTR

00000000 XMMx 0

Limit FFFF MXCSR 1F80

LDTR, TR 0000 DRx 0

Base 00000000 DR6 FFFF0FF0

Limit FFFF DR7 00000400h

Page 39: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Microcode update Pentium Pro+ has an interface for uploading microcode block

with patches to the CPU Microcode block is supplied by Intel directly to the BIOS vendors Microcode block has a header with CPU model specification Checking CPU model in the microcode header with current CPU A microcode must be uploaded before L2 is enabled and lot of

other constraints (e.g. segment limit exceeding)

Page 40: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Virtual machine extensions (VMX) Two classes of software

Virtual machine monitor (VMM) Acts like a host Full control of HW Presents abstract HW to guests

Guest software Guest software environment with OS and applications

Page 41: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Virtual-machine control data structure (VMCS) – I VMX non-root operation and VMX transitions

controlled by a VMCS Access through the VMCS pointer (one per logical

CPU) Changing the pointer using VMPTRST and VMPTRLD instructions

VMCS configuration using VMREAD, VMWRITE, VMCLEAR instructions

VMM could use a different VMCS for each virtual CPU

Each logical CPU associates a physical memory region (one 4KB frame) with each VMCS

Page 42: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Virtual-machine control data structure (VMCS) – II

VMCS state Inactive

after VMCLEAN Active

Memory region after VMPTRLD Maintains CPU state

Current VMPTRLD loads current VMCS VMLAUNCH, VMPTRST, VMREAD, VMRESUME and VMWRITE operate with current VMCS

Page 43: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Virtual-machine control data structure (VMCS) – III

VMCS data Guest-state area

CPU state is saved on VM exits and loaded from there on VM entries

Host-state area CPU state is loaded on VM exits

VM-execution control fields VM-exit control fields VM-entry control fields VM-exit information fields

Page 44: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Guest-state area Registers

CR0, CR3, CR4 RSP, RIP, RFLAGS CS, DS, ES, FS, GS, SS, LDTR, TR

Selector and part of internal cache GDTR, IDTR MSRs

IA32_DEBUGCTL, IA32_SYSENTER_CS, IA32_SYSENTER_ESP, IA32_SYSENTER_EIP

Activity state Active, HLT, shutdown, wait-for-SIPI

Interruptibility state Blocking by STI, MOV SS, NMI, SMI

Pending debug exceptions VMCS link pointer

Page 45: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Host-state area

Registers CR0, CR3, CR4 RSP, RIP CS, DS, ES, FS, GS, SS, TR Base address for FS, GS, TR, GDTR, IDTR MSRs

IA32_SYSENTER_CS, IA32_SYSENTER_ESP, IA32_SYSENTER_EIP

Page 46: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

VM-execution control fields Pin-based VM-execution controls

VM-exits on external interrupt or NMI CPU-based VM-execution controls

Instructions and events causing VM-exits Exception bitmap I/O-bitmap addresses Guest/host masks and read shadows for CR0 and CR4 CR3 target controls

4 target addresses+counter CR8 access control MSR bitmap address

Page 47: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

VM-exit control fields

VM-exit controls Basic operation of VM-exit

VM-exit controls for MSRs List of MSRs stored and loaded on VM-exit

Page 48: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

VM-entry control fields

VM-entry controls Basic operation on VM-entry

VM-entry controls for MSRs List of MSRs to be loaded on VM-entry

Event injection “Executed” before the first guest-mode instruction Interrupts, exceptions including error-code

Page 49: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

VM-exit information fields

Basic VM-exit information Exit reason, exit qualification

Vectored events Interrupts, exceptions

VM-exits during event delivery VM-exits due to instruction execution

Instruction address, length, detailed information

Page 50: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

VMXON region

Physical memory region (4KB frame) for VMX operation

Operand of VMXON instruction

Page 51: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

Using VMCS

VMCLEAR should be executed before VM-entry

VMLAUNCH should be used for the first VM-entry using VMCS after VMCLEAR

VMRESUME should be used for any subsequent VM-entry

Page 52: Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

VMX non-root operation

Instructions, which cause VM-exit Unconditionally: CPUID, INVD, MOV from CR3, all

VMX instructions Conditionally: CLTS, HLT, IN/OUT, INVLPG, LMSW, MONITOR, MOV CR8, MOV to CR0, MOV to CR3, MOV to CR4, MOV DR, MWAIT, PAUSE, RDMSR, RDPMC, RDTSC, RSM, WRMSR

Other causes Exceptions, interrupts, INIT signals, start-up IPI,

task switches, system-management interrupts