design tradeoffs for software-managed tlbs authers; nagle, uhlig, stanly sechrest, mudge & brown

Design Tradeoffs For Software-Managed TLBsAuthers; Nagle, Uhlig, Stanly

Sechrest, Mudge & Brown

Definition

The virtual to physical address translation operation sits on the critical path between the CPU and the cache.

If every request for a memory location out from the processor required one or more accesses to main memory (to read page

table entries), then the processor would be very slow. TLB is a cache for page table entries. It works in much the

same way as the data cache, it stores recently accessed page table entries.

Operations on an address request by the CPU

Each TLB entry covers a whole page of physical memory

a relatively small number of TLB entries will cover a large amount of memory

The large coverage of main memory by each TLB entry means that TLBs have a high hit rate

TLB types

Fully associative in early TLB design Set associative, is more common in new

design

The Problem.

This paper discusses software managed TLB design tradeoffs and their interaction with a range of operating systems however software management can impose considerable penalties, which can highly dependent on the operating system structure and its use of virtual memory

Namely memory references that require mappings not in the TLB result in misses that must be serviced either by hardware or by software.

Test Environment DECstation 3100 with MIPS R2000 processor R2000 contains 64 entry fully-associative TLB R2000 TLB hardware supports partitioning into two

sets, an upper and lower set Lower set consists of entries 0-7 and is used for Page

Table Entries with slow retrieval Upper set consists of entries 8-63 and contains more

frequently used level 1 user PTEs

Test Tools.

a system analysis tool called Monster, which enables us to monitor actual miss handling costs in CPU cycles.

a TLB simulator called Tapeworm which is compiled directly into the kernel so that it can intercept all of the actual TLB misses caused by both user processes and OS kernel memory references.

TLB information that Tapeworm extracts from the running system is used to obtain TLB miss

counts and to simulate different TLB configurations.

System monitoring with monster. Monster is a hardware monitoring system, its

comprised of a monitored DECstation 3100, an attached logic analyzer and a controlling workstation .

Measures the amount of time to handle each TLB miss

TLB Simulation with Tapeworm. The Tapeworm simulator is built into the

operating system and is invoked whenever there is a TLB miss.

The simulator uses the real TLB misses to simulate its own TLB configuration.

Trace Driven Simulation Trace driven simulation was used because it’s

good for studying the components of a computer memory systems like TLBs.

a sequence of memory references to the simulation model to mimic the way that a real processor might exercise the design.

Problems with Trace driven simulation

Difficult to obtain accurate traces. Consumes a considerable processing and

storage resources It assumes that address traces are invariant to

changes in the structural parameters of a simulated TLB

Solution. Compiling the TLB simulator Tapeworn, directly

onto the operating system kernel. This allows us to account for all system activity, including multiple process and kernel interactions.

It does not require address trace It considers all TLB misses, caused by user level

tasks, or kernel.

Benchmarks

Operating Systems

Test Results

OS Impact on software managed TLBs Different OS gave different results, although

the same application were run on each system.

There is a difference in TLB misses & total TLB service time

Increasing TLB Performance Additional TLB Miss Vectors. Increase Lower Slots in TLB Partition. Increase TLB Size. Modify TLB Associativity.

TLB Miss Vectors L1 User - on level 1 user PTE L1 Kernel - miss on level 1 kernel PTE L2 - miss on level 2 PTE, after level 1 user

miss L3 - miss on level 3 PTE, after level 1 kernel

miss Modify - miss on protection violation Invalid – page fault

TLB Miss Vector Results

Modifying Lower TLB Partition OSF/1 OS - increase from 4 to 5 lower slots

decreases miss handling time by 50% Mach 3.0 OS – performance increase up to 8

slots Microkernel's benefit from lower TLB

partition increase because many system calls (e.g. Unix server on Mach 3.0) mapped to L2 PTEs

Increasing TLB size

Increasing TLB size

• Building TLBs with additional upper slots.• The most significant component is L1k

misses, that’s due to the large number of mapped data structure in the kernel.

• Allowing the uTLB handler to service L1k misses reduces the TLB service time.

• In each system there is a noticeable improvement in the TLB service time as the TLB increases.

Conclusion.

Software-management of TLBs magnifies the importance of the interactions between TLBs and operating systems, because of the large variation in TLB miss service times that can exist. TLB behavior depends upon the kernel’s use of virtual memory to map its own data structures, including the page tables themselves. TLB behavior is also dependent upon the division of service functionality between the kernel and separate user tasks.

design tradeoffs for software-managed tlbs authers; nagle, uhlig, stanly sechrest, mudge & brown

Documents

real tlb

tlb information

tlb result

tlb simulator tapeworn

actual tlb misses

cpueach tlb entry

tlb misstlb simulation

small number of tlb