design tradeoffs for software-managed tlbs authers; nagle, uhlig, stanly sechrest, mudge & brown
TRANSCRIPT
Definition
The virtual to physical address translation operation sits on the critical path between the CPU and the cache.
If every request for a memory location out from the processor required one or more accesses to main memory (to read page
table entries), then the processor would be very slow. TLB is a cache for page table entries. It works in much the
same way as the data cache, it stores recently accessed page table entries.
Operations on an address request by the CPU
Each TLB entry covers a whole page of physical memory
a relatively small number of TLB entries will cover a large amount of memory
The large coverage of main memory by each TLB entry means that TLBs have a high hit rate
The Problem.
This paper discusses software managed TLB design tradeoffs and their interaction with a range of operating systems however software management can impose considerable penalties, which can highly dependent on the operating system structure and its use of virtual memory
Namely memory references that require mappings not in the TLB result in misses that must be serviced either by hardware or by software.
Test Environment DECstation 3100 with MIPS R2000 processor R2000 contains 64 entry fully-associative TLB R2000 TLB hardware supports partitioning into two
sets, an upper and lower set Lower set consists of entries 0-7 and is used for Page
Table Entries with slow retrieval Upper set consists of entries 8-63 and contains more
frequently used level 1 user PTEs
Test Tools.
a system analysis tool called Monster, which enables us to monitor actual miss handling costs in CPU cycles.
a TLB simulator called Tapeworm which is compiled directly into the kernel so that it can intercept all of the actual TLB misses caused by both user processes and OS kernel memory references.
TLB information that Tapeworm extracts from the running system is used to obtain TLB miss
counts and to simulate different TLB configurations.
System monitoring with monster. Monster is a hardware monitoring system, its
comprised of a monitored DECstation 3100, an attached logic analyzer and a controlling workstation .
Measures the amount of time to handle each TLB miss
TLB Simulation with Tapeworm. The Tapeworm simulator is built into the
operating system and is invoked whenever there is a TLB miss.
The simulator uses the real TLB misses to simulate its own TLB configuration.
Trace Driven Simulation Trace driven simulation was used because it’s
good for studying the components of a computer memory systems like TLBs.
a sequence of memory references to the simulation model to mimic the way that a real processor might exercise the design.
Problems with Trace driven simulation
Difficult to obtain accurate traces. Consumes a considerable processing and
storage resources It assumes that address traces are invariant to
changes in the structural parameters of a simulated TLB
Solution. Compiling the TLB simulator Tapeworn, directly
onto the operating system kernel. This allows us to account for all system activity, including multiple process and kernel interactions.
It does not require address trace It considers all TLB misses, caused by user level
tasks, or kernel.
OS Impact on software managed TLBs Different OS gave different results, although
the same application were run on each system.
There is a difference in TLB misses & total TLB service time
Increasing TLB Performance Additional TLB Miss Vectors. Increase Lower Slots in TLB Partition. Increase TLB Size. Modify TLB Associativity.
TLB Miss Vectors L1 User - on level 1 user PTE L1 Kernel - miss on level 1 kernel PTE L2 - miss on level 2 PTE, after level 1 user
miss L3 - miss on level 3 PTE, after level 1 kernel
miss Modify - miss on protection violation Invalid – page fault
Modifying Lower TLB Partition OSF/1 OS - increase from 4 to 5 lower slots
decreases miss handling time by 50% Mach 3.0 OS – performance increase up to 8
slots Microkernel's benefit from lower TLB
partition increase because many system calls (e.g. Unix server on Mach 3.0) mapped to L2 PTEs
Increasing TLB size
• Building TLBs with additional upper slots.• The most significant component is L1k
misses, that’s due to the large number of mapped data structure in the kernel.
• Allowing the uTLB handler to service L1k misses reduces the TLB service time.
• In each system there is a noticeable improvement in the TLB service time as the TLB increases.
Conclusion.
Software-management of TLBs magnifies the importance of the interactions between TLBs and operating systems, because of the large variation in TLB miss service times that can exist. TLB behavior depends upon the kernel’s use of virtual memory to map its own data structures, including the page tables themselves. TLB behavior is also dependent upon the division of service functionality between the kernel and separate user tasks.