computer architecture short note (version 8)
TRANSCRIPT
1 | P a g e
Computer Architecture IN 2320
Lesson 02 – Introduction
Computer architecture:
Deals with the functional behavior of a computer system as viewed by a programmer
Ex: the size of a data type –32 bits to an integer
Computer organization:
Deals with structural relationships that are not visible to the programmer
Ex: clock frequency or the size of the physical memory
Levels of a computer:
1. User Level: Application Programs (HIGH LEVEL)
2. High level languages
3. Assembly Language/ Machine Code
4. Microprogrammed/ Hardwired Control
5. Functional Units (Memory, ALU, etc)
6. Logic Gates
7. Transistors and Wires (LOW LEVEL)
Computer Architecture-Definition:
The attributes of the computer system that are visible to programmers i.e. the attributes of the
computer system that have a direct impact on the logical execution of a program
Ex: the instruction set, the size of a data type, techniques of addressing the memory
EX: Architectural issue is whether a computer will have a multiply instruction
Computer Organization-Definition:
The operational units and their interconnection that realize the architectural specifications
Ex: control signals, interface between computer and peripherals, memory technology used
Ex: Organizational issue is whether multiply instruction is implemented using the a separate cct
or whether it is implemented using the repeated use of adder cct.
Organozational decision may based on the several parameters such as anticipated frequency of
the use of multiply instruction.
2 | P a g e
Forces on Computer Architecture:
Technology
Programming Languages
Applications
OS
History
The Computer Architect’s view:
Architect is concerned with design & performance
Designs the ISA for optimum programming utility and optimum performance of implementation
Designs the hardware for best implementation of the instructions
Uses performance measurement tools, such as benchmark programs, to see that goals are met
Balances performance of building blocks such as CPU, memory, I/O devices, and
interconnections
Meets performance goals at lowest cost
Factors involving when selecting a better computer are;
1. COST factors
a. Cost of hardware design
b. Cost of software design (OS, applications)
c. Cost of manufacture
d. Cost of end purchaser
2. PERFORMANCE factors
a. What programs will be run?
b. How frequently will they be run?
c. How big are the programs?
d. How many users?
e. How sophisticated are the users (User level)?
f. What I/O devices are necessary?
g. There are two ways to make computers go faster.
i. Wait sometime (year). Implement in a faster/better/newer technology.
1. More transistors will fit on a single chip.
2. More pins can be placed around the IC.
3. The process used will have electronic devices (transistors) that switch
faster.
ii. New/innovative architectures and architectural features, and clever
implementations of existing architectures.
3 | P a g e
Higher Computer performance may involve one or more of the following:
Short response time for a given piece or work
o The total time taken by a functional unit to respond to a request for service
o Functional unit/ execution unit is a part of CPU that performs the operations and
calculations as instructed by a computer.
High throughput (rate or processing work)
o Rate at which something can be processed
Low utilization of computing resources
o System resources(practical): physical or virtual entities of limited availability
Ex: memory, processing capacity, network speed
o Computational resources(abstract): resources used for solving a computational problem
Ex: computational time, memory space
Fast data compression and decompression
High bandwidth
Short data transmission time
*note red coloured performance factors are the area of interest.
Throughput:
if(no overlap or if no parallelism)
throughput = 1/average response time
else
throughput > 1/average response time
//the number of parallel processing computers are also important
Elapsed time/response time:
Elapsed time = Response time = CPU time + I/O wait time
CPU time = time spent running a program
Performance= 1/response time
Since we are more concerned about CPU time,
Performance = 1/CPU time
*note Improve Performance
1. Faster the CPU
Helps to improve both response time and throughput
2. Add more CPUs
Helps to improve throughput and perhaps response time due to less queuing
4 | P a g e
*Note: Selection is depend on what is important to whom, i.e. cost factors and performance factors
Ex 01: Computer system user
Goal: Minimize elapsed time for program=time_end-time_start
Called response time (counted in ms)
Ex 02: Computer Center Manager
Goal: Maximize completion rate = no. of jobs per second
Called throughput (counted per sec)
Factors driving architecture:
Effective use of new technology
Can a desired performance improvement
Performance Metrics
Values derived from some fundamental measurements:
Count of how many times an event occurs
Duration of a time interval
Size of some parameter
Some basic metrics include:
Response time
o Elapse time from request to response
o Elapsed time = Response time = CPU time + I/O wait time
CPU time = time spent running a program
Performance time= 1/response time
Since we are more concerned about CPU time,
Performance time= 1/CPU time
o CPU time is affected by;
Number of instructions in the program
Average number of clock cycles to complete one instruction
Clock cycle time
Throughput
o Jobs or operations completed per unit of time
Bandwidth
o Bits per second
Resource utilization
5 | P a g e
Standard benchmark metrics
SPEC
TCP
Characteristics of good metrics:
Linear
o Proportional to the actual system performance
Reliable
o Larger value -> better performance
Repeatable
o Deterministic when measured
Consistent
o Units and definition constant across systems
Independent
o Independent from influence of vendors
Easy to measure
Some examples of Standard Metrics:
MIPS
MFLOPS, GFLOPS, TFLOPS, PFLOPS
SPEC metrics
TCP metrics
Parameters of Performance Metrics:
Clock rate (=1/Clock cycle time)
Instructions per program (I/P)
Average clock cycles per instruction (CPI)
Service time
Interarrival time (time between arrivals of successive requests)
Number of users
Think time
*note Execution time (CPU time, runtime) = I/P * CPI * clock cycle time <= Iron Law
All the three factors are combined to affect the metric Execution time.
I/P -> depend on compiler
CPI -> depend on CPU design/organization
Clock cycle time -> processor architecture
6 | P a g e
Ex01:
Our program takes 10s to run on computer A, which has 400 MHz clock. We want it to run in 6s. The
designer says that the clock rate can be increased, but it will cause the total number of clock cycles for
the program to increase to 1.2 times the previous value. What is the minimum clock rate required to get
the desired speedup?
Answer:
Old Machine A New Machine A
Runtime 10s 6s
Clock Rate 400Hz CR
Let Total number of clock cycles per program in old machine A = x
Since clock cycles per program = Clock Rate * Runtime
x = 400 * 10 = 4000 cycles
Total number of clock cycles per program in new machine A = 1.2 x
= 1.2 * 4000
= 6 * CR
6 * CR = 1.2 * 4000
CR = 800Hz
Workload:
A test case for the system
Benchmark:
A set of workload which together is representative of ‘my program’ should be reproducible.
Ex02:
Which is faster? A or B?
Test Case Machine A Machine B
1 1s 10s
2 100s 10s
Assume Test Case 1 type processes happen 99% of the time
7 | P a g e
Answer:
We have to obtain the weighted average of runtime.
Weighted average for A = 1(99)+ 100(1)
100 = 1.99 s <= answer is A
Weighted average for B = 10(99)+ 10(1)
100 = 10 s
*note
Cost of improving the processor is high. But if you find that you are needed a particular circuit 99% of
the time (ex: multiplication instruction), then you can improve that circuit from 2, 3 factors. You will
improve the performance as a whole that way.
Performance comparison
Performance = 1
𝑡𝑖𝑚𝑒
There are 2 machines A and B.
Performance(A) = 1
𝑡𝑖𝑚𝑒(𝐴)
Performance(B) =1
𝑡𝑖𝑚𝑒(𝐵)
Therefore;
Performance(A)
Performance(B) =
𝑡𝑖𝑚𝑒(𝐵)
𝑡𝑖𝑚𝑒(𝐴) = 1 +
𝑥
100 iff A is x% faster than B
Ex03:
time(B) = 10s, time(B) = 15s
Performance(A)
Performance(B) =
𝑡𝑖𝑚𝑒(𝐵)
𝑡𝑖𝑚𝑒(𝐴) =
15
10 = 1.5 = 1 +
50
100 i.e. A is 50% faster than B
Breaking down performance:
A program is broken into instructions.
o Hardware is aware of instructions, not programs.
At lower level, hardware breaks into instructions into cycles.
o Lowe level state machines change state every cycle
For example 500MHz P-III runs 500M cycles/sec, 1 cycle = 2ns
8 | P a g e
Iron Law
Processor time = 𝑇𝑖𝑚𝑒
𝑃𝑟𝑜𝑔𝑟𝑎𝑚 =
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝑃𝑟𝑜𝑔𝑟𝑎𝑚 *
𝐶𝑦𝑐𝑙𝑒𝑠
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 *
𝑇𝑖𝑚𝑒
𝐶𝑦𝑐𝑙𝑒
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝑃𝑟𝑜𝑔𝑟𝑎𝑚
𝐶𝑦𝑐𝑙𝑒𝑠
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝑇𝑖𝑚𝑒
𝐶𝑦𝑐𝑙𝑒
(Code size) (CPI) (Cycle time)
Architecture Implementation Realization Compiler Designer Processor Designer Chip Designer
Instructions executed, not static code size
Determined by algorithm, compiler, ISA
Determined by ISA and CPU organization
Overlap among instructions reduces this term
Average number of clock cycles to complete one instruction
Determined by technology, organization, clever circuit design
Ex04:
Machine A: clock 1ns, CPL 2.0, for program X
Machine B: clock 2ns, CPL 1.2, for program X
Which is faster and how much?
Time(A)= I/P * CPI * Clock cycle time = I/P * 2.0 * 1 =2 I/P
Time(B)= I/P * CPI * Clock cycle time = I/P * 1.2 * 2 =2.4 I/P
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛 = 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒(𝐴)
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒(𝐵)=
𝑇𝑖𝑚𝑒(𝐵)
𝑇𝑖𝑚𝑒(𝐴)
Performance comparison = 2.4 I/P
2 I/P = 1.2 = 1 +
20
100 = machine A is 20% faster than machine B
Ex05:
Keep clock(A) at 1ns and clock(B) at 2ns.
For equal performance, if CPI(B) = 1.2, what is CPI(A)?
Time(A)= I/P * CPI * Clock cycle time = I/P * CPI(A) * 1 =CPI(A)* I/P
Time(B)= I/P * CPI * Clock cycle time = I/P * 1.2 * 2 =2.4 I/P
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛 = 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒(𝐴)
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒(𝐵)=
𝑇𝑖𝑚𝑒(𝐵)
𝑇𝑖𝑚𝑒(𝐴)
Performance comparison = 2.4 I/P
CPI(A)∗ I/P =
2.4
CPI(A) = 1
CPI(A) = 2.4
9 | P a g e
Other Metrics MIPS: Million Instructions Per Second
MFLOPS: Million FLOating point operations Per Second
GFLOPS: Giga FLOating point operations Per Second
Since floating point numbers contain 3 parts including sign, mantissa, and exponent. It takes more time
than an integer. i.e. floating point numbers take more cycles per instruction. Therefore we take the
worst case as metrics.
The common case differs from application to application. Difference can be significant if a program
relies predominantly on integers, as opposed to floating point operations.
Ex06:
Without floating point (FP) hardware, an FP operation may take 50 single cycle instructions. With FP
hardware, only one 2 cycle instructions,
Thus adding FP hardware:
CPI increases.
Instructions /program decreases.
Total execution time decreases.
without FP hardware with FP hardware
I/P 50 1
CPI 1 2
Instruction Set Architecture (ISA) changes => CPI changes
Compiler design also had been changed => I/P changes
Since no change to clock rate, clock cycle time remains the same.
CPU Time = I/P * CPI * Clock cycle time
CPU Time without FP hardware = 50 * 1 * Clock cycle time
CPU Time with FP hardware = 1 * 2 * Clock cycle time
CPU Time with FP hardware < CPU Time without FP hardware
10 | P a g e
Average If programs run equally:
𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛 = 1
𝑛∑ 𝑡𝑖𝑚𝑒(𝑡)
𝑛
𝑡=1
If the programs run in different proportions:
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛 = ∑ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑡) x 𝑡𝑖𝑚𝑒(𝑡)𝑛
𝑡=1
∑ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑡) 𝑛𝑡=1
Ex07:
Machine A CPU time Machine B CPU time
Program 1 1ns 10ns
Program 2 1000ns 100ns
What is the fastest computer?
If programs run equally:
Mean CPU time of A = 1+1000
2 =
1001
2 = 500.5ns
Mean CPU time of B = 10+100
2 =
110
2 = 55ns
Machine B is the fastest.
If program type 1 run 90% of the time and program type 2 run 10% of the time:
Mean CPU time of A = 1 x 90 +1000 x 10
100 =
10090
100 = 100.9ns
Mean CPU time of B = 10 x 90 +100 x 10
100 =
1900
100 = 19ns
Machine B is the fastest.
11 | P a g e
Amdahl’s Law Improving the most affected component in a large factor is better than improving everything by a small
factor. i.e. Speedup the common case!
Speed-up of a computer:
The definition of the overall/ final speed-up is given below.
𝑆𝑝𝑒𝑒𝑑 𝑢𝑝 = 𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛
𝑛𝑒𝑤 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛
If you have improved the performance, some parts will work in less time; speed-up>1 otherwise you
have not improved.
According to Amdahl’s law, we do not try to improve the whole processor at once; therefore we select a
particular part and improve it.
Ex08:
75% of a program of 40ns was improved. Therefore 75% of program works according to the new time
and 25% of the program works according to the old time.
Before improving the above mentioned 75% of instructions were executed in 5ns. After the
improvement, that type of instructions is executed in 1ns. Old time taken to execute a program was
40ns
Assuming that improvement is done only to a fraction f in program, and speed-up of that fraction f = 5𝑛𝑠
1𝑛𝑠
i.e. speed-up of that fraction f = s = 5
new time taken = (1 - f) x old time taken + f x 𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛
𝑠
new time taken = (1-0.75)x 40 + 0.75 x 40/5 = 16ns
Overall speed up =𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛
(1−𝑓) x 𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛 + 𝑓 x 𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛
𝑠
= 1
(1−𝑓)+ 𝑓
𝑠
Amdahl’s Law:
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑 𝑢𝑝 = 1
(1 − 𝑓) + 𝑓𝑠
“Speed up the common case.”
Amdahl’s Law Limit:
Maximum Overall speed up = lim𝑠→∞
1
(1−𝑓)+ 𝑓
𝑠
= 1
1−𝑓
12 | P a g e
Figure 1: Amdahl's Law limit
If (1 – f) is nontrivial (extremely difficult and time consuming) speed up is limited.
If a program is highly sequential, there is no any solution other than increase the speedup of a fraction
of the program.
If parallel, we have the additional option to increase the parallelism.
*note
The performance enhancement possible with a given improvement is limited by the amount the
improved feature is used.
Ex: To make a significant impact on the CPI, identify the instructions that occur more frequently and
optimize the design for them.
Ex09:
Program runs In 100s multiplies 80% of the time. Designer M can improve the speed-up of multiply
operations. Now I am a user and I need to make My program 5 times faster. How much speed-up should
M achieve to allow me to reach my overall speed-up goal?
First we need to check whether we can achieve this speed up practically. So let us find the maximum
speed up that we can achieve by f of 80%.
Maximum speed up that we can achieve by f of 80% = 1
1−0.8 =
1
0.2 = 5
We can achieve an overall speed up of 5 if we give an infinite speed up for multiplication instruction. i.e.
s → ∞
The designer M was asked to improve the overall speed up to 5. Theoretically we proved that maximum
overall speed up is also 5. Normally practical maximum speed up is always less than the theoretical
maximum speed up. Therefore this goal cannot be achieved by designer M.
13 | P a g e
Ex10:
Usage frequency and the cycles per operations were given below.
Operation Frequency Cycles per operation
ALU 42% 1
Load 21% 1
Store 12% 2
Branch 24% 2
Assume stores can execute in 1 cycle by slowing clock by 15%. Is it worth implementing this?
Execution time (CPU time, runtime) = I/P * CPI * clock cycle time
CPI = Average number of instructions per cycle
Old CPI = 42 x 1+21 x 1+12 x 2+24 x 2
100 = 1.36
New CPI = 42 x 1+21 x 1+12 x 1+24 x 2
100 = 1.24
Let Old clock cycle time = x
Since clock will be slowed down by 15%, the clock cycle time will be increase by 15% due to inverse
relationship. (clock rate = 1 / clock cycle time)
Therefore New clock cycle time = 1.15x
Since the architecture of the compiler remains constant, the I/P is constant.
Old machine New machine
I/P I/P I/P
CPI 1.36 1.24
Clock cycle time x 1.15x
Old CPU time = I/P x 1.36 x x
New CPU time = I/P x 1.24 x 1.15x
Speed up = I/P x 1.36 x 𝑥
I/P x 1.24 x 1.15𝑥 =
1.36
1.24 x 1.15 = 0.95
The speed up < 1
Old CPU time
New CPU time < 1
Old CPU time < New CPU time
This implementation is not worth doing.
14 | P a g e
Generations of Computer
Vacuum tube
Transistor
Small scale IC
Medium scale IC
Large scale IC
Very large scale IC
Ultra large scale IC
AI
Moore’s Law The observation that, over the history of computing hardware, the number of transistors in a dense IC
has doubled approximately every two years
Figure 2: CPU Transistor Counts 1971 - 2008 and Moore's Law
Consequences:
Higher packing density means shorter electrical paths, giving higher performance in speed
Smaller size gives increased flexibility
Reduced power and cooling requirements
Fewer interconnections increases reliability
Cost of a chip has remained almost unchanged.
15 | P a g e
Requirements changed over time:
Image processing
Speech recognition
Video conferencing
Multimedia authoring
Voice and video annotation files
Simulation modeling
Ways to speeding up the processor:
Pipelining
On board cache
On board L1 and L2 cache
Brand prediction
Data flow analysis
Speculative execution
Performance mismatch:
Processor speed increases
Memory capacity increases
But memory speed always lags behind processor speed
Figure 3: DRAM (Main Memory) and Processor characteristics
16 | P a g e
Solutions:
Increase number of bits retrieved at one time
o Make DRAM wider rather than deeper
Change DRAM interface
o Cache
Reduce frequency of memory access
o More complex cache and cache on chip
Increase interconnection bandwidth
o High speed buses
o Hierarchy of buses
Final Computer Performance is measured in CPU time:
CPU time = 𝑇𝑖𝑚𝑒
𝑃𝑟𝑜𝑔𝑟𝑎𝑚 =
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
𝑃𝑟𝑜𝑔𝑟𝑎𝑚 *
𝐶𝑦𝑐𝑙𝑒𝑠
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 *
𝑇𝑖𝑚𝑒
𝐶𝑦𝑐𝑙𝑒
instruction Count CPI Clock Rate
Program x
Compiler x (x)
instruction set x x
Organization x x
Technology x
17 | P a g e
Lesson 03 – Computer Memory A memory unit is a collection of storage cells together with necessary circuits for information transfer in
and out of storage.
Memory stores binary information in groups of bits called words.
A word in a memory is a fundamental unit of information in a memory.
Hold series of 1’s and 0’s
Represent numbers, instruction codes, characters etc.
A group of 8 bits called as a byte which is fundamental unit of metric.
Usually a word is a multiple of bytes.
Classification of memory due to key Characteristics:
Location
o CPU (registers)
o Internal (Main Memory/ RAM)
o External (backing storage)
Capacity
o Word size (The natural unit of organization)
o Number of words (Or bytes)
Unit of transfer
o Internal (depend on the bus width)
o External (memory block)
o Addressable unit (smallest unit which can be uniquely addressed, word internally)
Access method
o Sequential (ex: tape)
o Direct (ex: disk)
o Random (ex: RAM)
o Associative (ex: cache, within words)
Performance
o Access time (latency)
o Memory cycle time
o Transfer rate
Physical type
o Semiconductor (ex: SRAM, caches)
o Magnetic (ex: disk and tape)
o Optical (CD and DVD)
o Others (ex: bubble)
18 | P a g e
Physical characteristics
o Decay (leak charges in capacitors in DRAM)
o Volatility
o Erasable
o Power consumption
Organization
Figure 4: Memory Hierarchy
Classification of memory due to key Characteristics:
Location
Whether memory is internal or external to the computer
Internal memory:
Often refers to the Main Memory
But there are other types of internal memory too which are associate with the processor
o Register memory
o Cache memory
External memory:
Refers to peripheral storage devices, such as disk and tape
Accessible to the processor via I/O controllers
19 | P a g e
Capacity
Internal memory:
Measured in terms of bytes or words
Order of 1, 2, 4, 8 bytes
External memory:
Measured in terms of hundreds of Mega bytes or Giga bytes
Unit of transfer
Internal memory:
Refers to the number of data lines into and out of the memory module
This may equal to the word length, but is often larger 128, 256 bits
Concepts related to internal memory:
Word
o Natural unit of organization of memory
o The size of the word is typically equal to the number of bits used to represent a number
and to the instruction length. But there are exceptional cases too.
Addressable units
o Refers to the location which can be uniquely addressed
o In some systems addressable unit is the word.
o Many systems allow addressing at byte level.
o In any case relationship between the length in bits A of an address and the number N of
addressable units is 2A = N, range of addressable units 0 to (2A – 1)
External memory:
Data are often transferred in much larger units than a word, and these are referred to as blocks.
Access method
Methods of accessing units of data
Sequential access:
Memory is organized into units of data called records.
Access must be made in specific linear sequence.
Each intermediate record from current location to the desired location should be passed and
rejected.
Time to access arbitrary record is highly variable depending on the location of the data and
previous location of the reading header.
Ex: tape
20 | P a g e
Direct access:
Individual blocks or records have a unique address based on physical location.
Access is accomplished by direct access to reach a vicinity plus sequential searching to reach the
final location.
Access time is variable.
Ex: Disk units
Random access:
Each addressable location in the memory is unique, physically wired in addressing mechanism.
The time to access a given location is independent of the sequence of prior accesses and is
constant.
Ex: Main memory, some cache systems
Associative access:
This is a random access type of memory that enables one to make a comparison of desired bit
locations within a word for a specific match, and to do this for all words simultaneously.
Thus a word is retrieved based on a portion of its contents rather than its address.
This is a very high speed searching kind of a memory access.
Ex: cache
Performance
Capacity and performance are the most important characteristics for a user
Access time (latency):
For Random Access memory
Time takes to perform a read or write operation, i.e. time from the instant that an
address is presented to the memory to the instant that data have been stored or made
available for use
For non Random Access memory
Time takes to position the read-write mechanism at the desired location
Memory Cycle time:
Primarily applied for random access memory
Memory cycle time = Access time + Time required before a second success can commence
Time required before a second success can commence is the time taken to recover.
Memory cycle time is concerned with the system bus, not the processor.
21 | P a g e
Transfer rate:
The rate at which data can be transferred into and out of a memory unit
For random access memory
𝑇𝑟𝑎𝑛𝑠𝑓𝑒𝑟 𝑡𝑖𝑚𝑒 =1
𝐶𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒
For non random access memory
𝑇_𝑁 = 𝑇_𝐴 + 𝑁
𝑅
T_N = Average time to read or write N bits
T_A = Average Access time
N = Number of bits
R = Transfer rate in bps
Memory Access time
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑎𝑐𝑐𝑒𝑠𝑠 𝑡𝑖𝑚𝑒 (𝑇𝑠) = 𝐻 ∗ 𝑇1 + (1 − 𝐻) ∗ (𝑇1 + 𝑇2)
𝐴𝑐𝑐𝑒𝑠𝑠 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 = 𝑇1
𝑇2
H = fraction of all memory accesses that are found in the faster memory (Hit ratio)
T1 = access time to level 1
T2 = access time to level 2
L2 (Main memory)
L1 (Cache)
CPU
22 | P a g e
Ex01:
Suppose a processor has two levels of memory. Level 1 contains 1000 words and has access time of
0.01μs. Level 2 contains 100000 words and has access time of 0.1 μs. Level 1 can directly access. If it in
level 2, then word first transferred into level 1 and then access by the processor.
For simplicity we ignore the time taken to processor to determine whether the word is in level 1 or 2.
For high percentage of level 1 access, the average access time is much closer to that of level 1 than level
2.
Suppose we have 95% of the memory access found in Level 1
Ts = 0.95 * 0.01 µs + (1 - 0.95) * (0.01 µs + 0.1 µs)
Ts= 0.015 µs
Locality of reference
Also known as the principle of locality, the phenomenon in which the same values or related storage
locations are frequently accessed.
Two basic types of reference locality
1. Temporal coherence
There is a higher probability of repeated access to any data item that has been accessed
in the recent past.
Ex: for loop
2. Spatial coherence
There is a higher probability of access to any data item that is physically closer to any
other data item that has been access in the recent past.
Ex: arrays
Physical Characteristics
Figure 5: Memory Hierarchy list and how physical characteristics differ accordingly.
23 | P a g e
Semiconductor Memory Basic element is the cell.
Cell is able to be in one of the two states:
1. Read
2. Write
Random Access Memory (RAM)
Dynamic RAM (DRAM) Static RAM (SRAM)
Bits stored as charge in capacitors Bits stored as on/ off switches (use 6 transistors)
Charges leak No charges to leak
Need refreshing even when powered No refreshing needed when powered
Simpler construction More complex construction
Smaller per bit Larger per bit
Less expensive More expensive
Need refresh circuits No need refresh circuits
Slower Faster
Ex: Main Memory Ex: Cache
It is possible to build a computer which uses only SRAM. But there are problems
This would be very fast
This would need no cache
This would cost a very large amount
DRAM Organization in details
There are many ways that a DRAM (Main memory) could be organized.
Ex02:
List few ways how a 16Mbit DRAM can be organized.
16 chips of 1Mbit cells in parallel, so that 1bit of each word in 1chip. i.e. word size is 16bit
=> 1M x 16
4 chips of 4Mbit cells in parallel, so that 1bit of each word in 1chip. i.e. word size is 4bit
=> 4M x 4
Typical 16Mbit DRAM (4M x 4):
2048 x 2048 x 4bit array
24 | P a g e
Cache memory Take bunch of Main Memory blocks asked by CPU and make a copy of them available to CPU in a faster
manner. If requested address is already available within cache, it is a “hit”.
What happens when CPU requests for a main memory address?
If the address is available in cache Content inside the address is presented to CPU
Else Search for the address in Main memory If cache is having enough space the new block
The new block is stored in cache Else An existing block in cache is replaced by the new block Content is presented to CPU
Figure 6: Performance of accesses involving only
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑎𝑐𝑐𝑒𝑠𝑠 𝑡𝑖𝑚𝑒 (𝑇𝑠) = 𝐻 ∗ 𝑇1 + (1 − 𝐻) ∗ (𝑇1 + 𝑇2)
When H ->1, the Average access time(Ts) = T1
When H ->0, the Average access time(Ts) = T1 + T2
Cache:
Small amount of fast memory
Sits between normal main memory and CPU
May be located on CPU chip or module
25 | P a g e
Figure 7: Cache memory unit of transfer
Overview of the Cache Design:
Size
o Cost
More cache is expensive
o Speed
More cache is fast, but up to a point only
Checking cache for main memory addresses takes time
Mapping Function
o Direct mapping
o Associative mapping
o Set associative mapping
Replacement Algorithms
o LRU
o FIFO
o LRY
o Random
Write Policy
o Write through
o Write back
Block size
Number of caches
26 | P a g e
Mapping function
Figure 8: CPU, Cache, Cache lines and Main Memory
Cache line:
Each and every individual block in Cache memory is directly connected to CPU without any
barriers. CPU accesses these blocks using Cache lines.
Mapping:
Size(Block of Main Memory) = Size(Block of Cache Memory)
Which Main Memory block maps to which Cache memory Block
1. Direct Mapping Each block of main memory maps to only one cache line.
Example:
Figure 9: A system with 64KB cache and 16 MB Memory
Assume Block size is 4 words, 1 byte per 1 word, size of a block is 4 bytes
𝑆𝑖𝑧𝑒 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 = 64 𝐾𝐵
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐵𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝑡ℎ𝑒 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 =64 𝐾𝐵
4 𝐵= 16 𝐾
27 | P a g e
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝐶𝑎𝑐ℎ𝑒 = 16 𝐾 = 24 𝑥 210 = 214
Therefore we need 14 bits to identify a Cache line or Cache memory block.
𝐴𝑑𝑑𝑟𝑒𝑠𝑠 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑎 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 𝑏𝑙𝑜𝑐𝑘 = 14
𝑆𝑖𝑧𝑒 𝑜𝑓 𝑀𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 = 16 𝑀𝐵
𝑆𝑖𝑧𝑒 𝑜𝑓 𝑎 𝑀𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 𝑤𝑜𝑟𝑑 = 1 𝐵
𝐴𝑑𝑑𝑟𝑒𝑠𝑠 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑎 𝑀𝑎𝑖𝑛 𝑀𝑒𝑚𝑜𝑟𝑦 𝑤𝑜𝑟𝑑 =16 𝑀𝐵
1 𝐵= 16 𝑀 = 24 𝑥 210 𝑥 210 = 224
Therefore we need 24 bits to identify a Main memory byte or word.
Since addresses are divided into groups of 4 words,
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐵𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑀𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 =16 𝑀𝐵
4 𝐵= 4 𝑀
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝑀𝑎𝑖𝑛 𝑀𝑒𝑚𝑜𝑟𝑦 = 4 𝑀 = 22 𝑥 210 𝑥 210 = 222
Therefore we need 22 bits to identify a Main memory block.
𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝐵𝑙𝑜𝑐𝑘 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑀𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 𝑎𝑑𝑑𝑟𝑒𝑠𝑠 = 22
The combinations of remaining 2 bits are used to identify 4 words belongs to a given Main memory
block.
Figure 10: Main Memory Address and its three main components
Cache line number is also equal to the cache block number.
28 | P a g e
Graphical approach
*note
Green Colours represent cache lines.
Blue colours represent Tags.
Blue + Green represent the Main Memory Block number.
Figure 11: Direct Mapping
When CPU is asking for a main memory address 000000010000000000000010,
First it checks at cache line connects to 00000000000000 address.
If the address is not null (or it have something in it), it checks for the tag whether the block in cache line
matches with the tag 00000001.
If matches it returns the word with 10 as the last two bits of the address.,
else the current block is replaced by the required Main memory block.
Else If the address is null, load the required Main Memory block to the Cache.
Exercise:
Find the cache line and tag of the following Main Memory address with all the above assumptions and
conditions.
000010010011000001001011
Answer
Cache line number: 00110000010010 Tag: 00001001
29 | P a g e
Figure 12: Direct Mapping Process
Mathematical Approach
𝑖 = 𝑗 𝑚𝑜𝑑 𝑚
Where 𝑖 = 𝑐𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 , 𝑗 = 𝑚𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 𝑏𝑙𝑜𝑐𝑘 𝑛𝑢𝑚𝑏𝑒𝑟 and 𝑚 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒𝑠
Figure 13: Direct mapping function
30 | P a g e
Main memory blocks will map to cache blocks sequentially, but after block 7, block 8 has no block to
map in cache, therefore it will again starts to map from block 0 of cache sequentially.
Likewise a particular block j in Main Memory will map to j mod m block in cache where m is the number
of blocks in cache.
Direct mapping Cache line table:
m = number of memory blocks in cache
S = number of bits to identify the main memory block number
Main Memory block (j) Cache line (i)
0, m, 2m, 3m, …, 2S-m 0
1, m+1, 2m+1, 3m+1, …, 2S-m+1 1
.. …
m-1, 2m-1, 3m-1, …, 2S-1 m-1
*note
Use of a portion of the address as line number provides a unique mapping.
When more than one memory block maps to same cache line, it is necessary to distinguish them using
tag.
Pros and Cons of Direct Mapping
Pros:
Simple
Inexpensive
Cons:
One fixed location for given block.
o If a program accesses two blocks that map to the same line repeatedly, cache misses are
very high.
o It leads to trashing.
31 | P a g e
2. Associative Mapping Main memory block can be load into any cash block that is available
There are two parts in a Main Memory address when we consider Associative mapping.
If we take the same example of 64KB cache and 16MB Main Memory, the address will be like follows.
Figure 14: Two main Components of a Main Memory address
*note
In Associative mapping, a main memory block can be load into any cash block. Therefore the main
memory block number is considered as the tag.
Every cash line’s tag is examined for a match.
Cache searching is expensive.
Figure 15: Associative Mapping Process
32 | P a g e
Pros and Cons of Associative Mapping
Pros:
Any main memory block can be mapped to any cache memory block
Less swapping in temporal and spatial coherence (no thrashing)
Cons:
Have to search in all the cache lines for the particular tag that the Main memory address
3. Set Associative Mapping A combination of Direct mapping and Associative mapping
Cache is divided into a number of sets.
Each set contains a number of cache blocks/ cache lines.
A given block maps to any line in the particular set that block mapped to.
Example: 2 way associative mapping
Two lines per set
A given block can be in one of 2 lines in the set which that block belongs to
Figure 16: Structure of a Cache memory with sets
Suppose there are m number of cache blocks in the cache memory.
𝑚 = 𝑣 x 𝑘
v = number of sets within the cache
k = number of lines (vacancies or cache blocks) within a set
33 | P a g e
Every Block in main memory maps to one particular set in the cache.
Within that set there are a number of vacancies available.
The main memory block can be mapped to any vacant block within that particular set.
Replacement mechanisms are needed if that particular set is full, otherwise no.
Mapping a Main Memory Block to a set
Suppose i is the set number of a given main memory block.
𝑖 = 𝑗 𝑚𝑜𝑑 𝑣
j = main memory block number
v = number of sets available within the cache
Accordingly 0th to (v-1)th main memory blocks maps to 0th to (v-1)th sets consequently. vth main memory
block again starts from mapping to 0th set and so on.
If we have v number of sets, let v = 2d
Now d is the number of bits used to represent the set.
Figure 17: Components of a Main Memory Address in Set Associative Mapping
If the tag of the required main memory address is available in the particular set, return the word to CPU.
Identical tags are not coming to the same set. Therefore tag is unique to the set.
34 | P a g e
If we take the same example of 64KB cache and 16MB Main Memory for 2 way set associative mapping,
the address can be divided into 3 parts as follows.
Assume Block size is 4 words, 1 byte per 1 word, and size of a block is 4 bytes
𝑆𝑖𝑧𝑒 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 = 64 𝐾𝐵
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐵𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝑡ℎ𝑒 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 =64 𝐾𝐵
4 𝐵= 16 𝐾
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝐶𝑎𝑐ℎ𝑒 = 16 𝐾
Since 2 way set associative mapping is considered, a set contains 2 line or 2 cache blocks
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑎𝑐ℎ𝑒 = 𝑣 = 16 𝐾
2= 8 𝐾 = 213
Now it is in the correct format of v = 2d
Therefore we need 13 bits to represent the set number which a Main memory address belongs to.
Remaining 9 bits of the main memory block number is taken as the tag that identifies a particular
main memory block uniquely within the set.
Figure 18: Three main components of Main memory address
35 | P a g e
Cache Replacement Algorithms There is the possibility of mapped cache memory becoming fully occupied. At such an instance removing
an existing block from cache and loading the new block to cache is done.
Replacement is depends on the mapping mechanism.
Mapping mechanism Moment where replacing will be needed
How replacement mechanism is done
Direct Mapping if the mapped cache block is full No choice that particular block have to be replaced
Associative Mapping if all the cache blocks are full Hardware implemented algorithm (fast) * Least Recently Used (LRU) * First In First Out (FIFO) * Least Frequently Used (LFU) * Random
Set Associative Mapping if the mapped set is full
Least Recently Used (LRU)
Replace that block in the set that has been in the cache longest with no reference to it. For two – way
set associative, this is really implemented. Each cache line includes a USE bit. When a line is referenced,
its USE bit is set to 1 and the USE bit of the other line in that set is set to 0. When a block is to read into
the set, the line whose USE bit is 0 is used. Because we are assuming that more recently used memory
locations are more likely to be referenced, LRU should give the best hit ratio.
LRU is also relatively easy to implement for a fully associative cache. The cache mechanism maintains a
separate list of indexes to all the lines in the cache. When a line is referenced, it moves to the front of
the list. For replacement, the line at the back of the list is used. Because of its simplicity of
implementation, LRU is the most popular replacement algorithm.
First In First Out (FIFO)
Replace that block in the set that has been in the cache the longest. FIFO is easily implemented as a
round-robin or circular buffer technique.
Least Frequently Used (LFU)
Replace that block in the set that has experienced the fewer references. LFU could be implemented by
associating a counter with each line.
Random
A technique not based on usage (i.e., not LRU, LFU, FIFO, or some variant) is to pick a line at random
from among the candidate lines. Simulation studies have shown that random replacement provides only
slightly inferior performance to an algorithm based on usage.
36 | P a g e
Write Policy When a block that is in the cache is to be replaced, there are 2 cases to consider,
1. If the old block in the cache has not been modified, then overwriting can be done without any
issue.
2. If the old block in the cache has been modified, then main memory must be updated by writing
the line of cache out to the block of main memory before bringing the new block to that place.
There are 2 problems related to writing back to main memory:
1. More than one device have the access to main memory.
Ex: An I/O module may be able to read-write directly to memory. If a word has been
altered only in the cache, then the corresponding memory word is invalid. If the I/O
device has altered main memory, then the cache word is invalid.
2. Multiple processors are attached to the same bus and each processor has its own local cache.
If a word is altered in one cache, it could be conceivably invalidate a word in other
caches.
There are 2 techniques for Write Policy:
1. Write through policy
2. Write back policy
Write through policy
All write operations are made to main memory as well as to the cache, ensuring that main
memory is always valid.
Any other processor-cache module can monitor traffic to main memory to maintain consistency
within its own cache.
The main disadvantage of this technique is that it generates substantial memory traffic and may
create a bottleneck. Overall performance will go down this way.
Write back policy
In this technique updates are made only in the cache.
When an update occurs, a dirty bit, or use bit, associated with the line is set. Then, when a block
is replaced, it is written back to main memory if and only if the dirty bit is set.
The problem with write back policy is that portions of main memory are invalid, and hence
accesses by I/O modules can be allowed only through the cache. This makes for complex
circuitry and a potential bottleneck.
Sir did not talk about cache coherency
37 | P a g e
Line Size (Block Size) As the block size increases from very small to larger sizes, the hit ratio will at first increase because of
the principle of locality. The hit ratio will began to decrease as the block becomes even bigger.
Two specific effects come into play when block sizes are getting larger:
Reduces the number of blocks that fit into main memory
Some additional words are farther from the requested word and therefore less likely to be
needed in near future
Number of caches When caches were introduced originally systems used only one cache. More recently, the use of
multiple caches has become the norm.
Two aspects of this design issue concerns,
1. The number of cache levels
2. The use of unified vs split caches
Cache Performance Cache has an important effect on the overall system performance.
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 = (𝐶𝑃𝑈 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑐𝑦𝑐𝑙𝑒𝑠 + 𝑀𝑒𝑚𝑜𝑟𝑦 𝑠𝑡𝑎𝑙𝑙 𝑐𝑦𝑐𝑙𝑒𝑠) x 𝐶𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒
𝑀𝑒𝑚𝑜𝑟𝑦 𝑠𝑡𝑎𝑙𝑙 𝑐𝑦𝑐𝑙𝑒𝑠 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 x 𝑚𝑖𝑠𝑠𝑒𝑠 𝑝𝑒𝑟 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 x 𝑀𝑖𝑠𝑠 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
As CPU increases in performance, the memory stall cycles have an increasing effect on the overall
performance.
How to reduce the memory stall time:
Reduce miss rate (better cache strategies)
o Multilevel cache with on chip small cache (very fast), possibly set associative, and large
off chip cache, probably direct mapped
Reduce the miss penalty (fast memory)
o Increase bandwidth to main memory (wider bus)
*note
Read Pentium IV cache Organization.
38 | P a g e
Lesson 04 – Virtual Memory If the system uses 24bit addresses, the addressable number of units equals to 224. How can we have a
larger number than that? LIMITATION
VM is a concept that emerged to overcome the space limitation in Main Memory.
VM is a technique that allows the execution of processes which are not completely available in memory.
The main advantage of this scheme is that programs can be larger than physical memory. VM is the
separation of logical memory from physical memory.
This separation allows an extremely large virtual memory to be provided for programmers when only a
smaller physical memory is available. Following are the situations, when entire program is not required
to be loaded fully in Main memory.
User written error handling routines are used only when an error occurred in the data or
computation.
Certain options and features of a program may be used rarely.
Many tables are assigned a fixed amount of address space even though only a small amount of
the table is actually used.
The ability to execute a program partially in memory would counter many benefits.
Less number of I/O would be needed to load or swap each user program into main memory.
A program would no longer be constrained by the amount of physical memory that is available.
Each user program could take less physical memory; more programs could be run the same
time, with a corresponding increase in CPU utilization and throughput.
Since VM had being introduced to acts as its Main Memory (MM) were much larger than the actual size,
the programmers can think that they have unlimited memory space.
Figure 19: Virtual Memory concept
39 | P a g e
VM terminology
Page:
o equivalent of “block” fixed size
Page faults:
o equivalent of “misses”
Virtual address:
o equivalent of “tag”
No cache index equivalent: fully associative. VM table index appears becoz VM uses a different
(page table) implementation of fully associative.
Physical address:
o translated value of virtual address, can be smaller than virtual address, no equivalent in
caches
Memory mapping (address translation):
o converting virtual to physical addresses, no equivalent in caches
Valid bit:
o Same as in caches
Referenced bit:
o Used to approximate LRU algorithm
Dirty bit:
o Used to optimize write-back
VM VM fits lots of programs and program data into Actual MM.
Every program has its own Virtual address space starting from zero. They maintain separate table called
page table which can be uniquely identified by Process ID (PID). It does mapping of VM addresses to
cache, MM, Secondary storage addresses. There is a another table called transaction Look aside Buffer
which keeps most recently used page numbers. It is a fast semiconductor memory. The TLBs are
identified uniquely from the Process ID (PID). Each program feels that only that particular process is
running in CPU.
Figure 20: Virtual address
40 | P a g e
Figure 21: Virtual address space for the program which has memory blocks of A, B, C, and D
In the above manner program size should not be known beforehand and program size could be changed
dynamically.
Goals of VM:
Illusion of having more physical memory
Program relocation support (relieves programmer burden)
Protection due to one program does not read/write data of another
Since this is an indirect mechanism it delays, but the overall performance will increase significantly.
Virtual memory implementation techniques:
1. Paged
2. Segmentation
3. combined
Paged implementation:
Overall program resides on larger memory
Address space divided into virtual pages with equal size
MM divided into page frames of same size as pages in low level memory
Map virtual page to physical page by using page table
TLB is used to keep recently used page numbers
41 | P a g e
Segmented implementation:
Program is not viewed as a single sequence of instruction and data
Arranged into several modules of code, data, and stacks
Each module called segment – segment sector
Different sizes
Associated with segment registers
o Ex: Stack, Data, Program segment registers
Figure 22: Paging vs Segmentation
*note
A scheme that allows the use of variable size segments can be useful from a programmer's point of
view, since it lends itself to the creation of modular programs, but the operating system now not only
has to keep track of the starting address of each segment, but since they are variable in size, must also
calculate the offset to the end of each segment. Some systems combine paging and segmentation by
implementing segments as variable-size blocks composed of fixed-size pages.
VM design issues:
Miss penalty huge: Access time of disk = millions of cycles
o Highest priority to minimize page faults
o Use write back policy instead of write through. This is called copy-back in VM. For
optimization purposes it uses dirty bit to clarify whether that page is modified and has
to be copied back.
o If there is a page fault, OS schedules another process.
Protection support
o Break up program’s code and data into pages. Add process ID to cache index; use
separate tables for different programs
o OS is called via an exception: handles page faults
42 | P a g e
How a particular virtual address is mapped with the physical memory address.
Figure 23: Vitual address mapping to physical address
When a certain virtual address of a process is asked by the CPU, virtual page number is extracted and it
is first hunted at TLB, if found present the content to CPU else if that page is not available within TLB,
i.e., the content of that address is not recently used. Next it is hunted at page table. If found present the
content to CPU, else if it is invalid in page table, i.e., it is not in MM even, then go to secondary memory
and bring the content to MM and then present the content to CPU.
Figure 24: CPU->TLB->Page table
43 | P a g e
Figure 25: TLB and caches, action hierarchy
44 | P a g e
Lesson 04 – Register Transfer Language and Micro-Operations