embeddedassignment1 darshan

M S Ramaiah Institute of TechnologyDepartment of Electronics and Communication Engineering

Embedded System Design and Software

Assignment 1

Darshan Kumar S YaradoniUSN: 1MS08EC030 Section A

February 20, 2012

Embedded System Design and Software [ECPE17] - Assignment 1

1 What are the characteristics of an Embedded System?

The following are the characteristics of an Embedded System:

1. Single-functioned: An embedded system usually executes a specific program repeatedly. Forexample, a pager is always a pager. In contrast, a desktop system executes a variety of programs,like spreadsheets, word processors, and video games, with new programs added frequently. Anembedded system’s program may be updated with a newer program version as in the case of cellphones. Several programs may be swapped in and out of a system as is the case with missiles,which run one program in cruise mode, then load a second program for locking onto a target.

2. Tightly constrained: Design metric constraints on embedded system are tight. A design metricis a measure of an implementation’s features, such as cost, size, performance and power. Embeddedsystems often must cost just a few dollars, must be sized to fit on a single chip, must perform fastenough to process data in real time, and must consume minimum power to extend battery life orprevent the necessity of a cooling fan.

3. Reactive and real time: Embedded systems must continually react to changes in the system’senvironment and must compute certain results in real time without delay. For example, a car’scruise controller continually monitors and reacts to speed and brake sensors. It must computeacceleration or deceleration amounts repeatedly within a limited time; a delayed computation couldresult in a failure to maintain control of the car. In contrast, a desktop system typically focuseson computations, with relatively infrequent (from the computer’s perspective) reactions to inputdevices. A delay in those computations may cause inconvenience but does not result in systemfailure.

2 Explain the various metrics that need to be optimized while designing an embeddedsystem.

The various metrics that need to be optimized while designing an embedded system are:

1. NRE cost: NRE (nonrecurring engineering) cost is the one-time monetary cost of designing thesystem. Once the system is designed, any number of units can be manufactured without incurringany additional cost; hence the term nonrecurring.

2. Unit Cost: The monetary cost of manufacturing each copy of the system, excluding the NREcost.

3. Size: The physical space required by the system, often measured in bytes for software, and gatesor transistors for hardware.

4. Performance: The execution time of the system.

5. Power: The amount of power consumed by the system, which may determine the lifetime of abattery, or the cooling requirements of the IC, since more power means more heat.

6. Flexibility: The ability to change the functionality of the system without incurring heavy NREcost. Software is typically considered to be very flexible.

Darshan Kumar S Yaradoni, USN:1MS08EC030 1


7. Time-to-prototype: The time needed to build a working version of the system, which may bebigger or more expensive than the final system implementation, but it can be used to verify thesystem’s usefulness and correctness and to refine the system’s functionality.

8. Time-to-market: The time required to develop a system to the point that it can be released andsold to customers. The main contributors are design time, manufacturing time and testing time.

9. Maintainability: The ability to modify the system after its initial release, especially by designerswho did not originally design the system.

10. Correctness: Our confidence that we have implemented the system’s functionality correctly. Thefunctionality throughout the process of designing the system can be checked and test circuitry canbe inserted to check that manufacturing was correct.

11. Safety: The probability that the system will not cause harm.

Figure 1: Design metric competition - improving one may worsen others.

Metrics typically compete with one another. Improving one may worsen others. Reducing an im-plementation’s size may have an adverse effect on performance. This can be compared to a wheel withnumerous pins, as shown in Fig: 1. If one of the pins is pushed, the other pops out.

4 Derive an equation for percentage revenue loss for any market rise angle θ. A productwas delayed by 4 weeks in releasing to market. The peak revenue for the product foron-time entry to market would occur after 20 weeks for a market rise angle of 45◦.Determine the percentage revenue loss.

Let θ be the rise angle. Let h1 be the height of on-time entry triangle. Let h2 be the height of delayedentry triangle.

From Fig: 2, h1 = W tan θ and h2 = (W −D) tan θ.

Area of on-time entry triangle:

= 12 × base× height

= 12 × 2W × tan θ

= W 2 tan θ



Figure 2: Simplified Revenue Model for computing revenue loss from delayed entry

Area of delayed entry triangle:

= 12 × (2W −D) × (W −D) tan θ

= (2W 2−3WD tan θ+D2) tan θ2

= (2W 2−3WD tan θ+D2) tan θ2

Percentage Revenue Loss:

=W 2 tan θ− (2W2−3WD tan θ+D2) tan θ

2

W 2 tan θ× 100%

=2W 2 tan θ−2W 2 tan θ+3WD tan θ−D2 tan θ2W 2 tan θ

× 100%

= D tan θ(3W−D)2W 2 tan θ

× 100%

= D(3W−D)2W 2 × 100%

Given :

D = 4weeks

W = 20weeks


=D(3W−D)

2W 2 × 100%

=4(3(20)−4)

2(20)2× 100%

= 28%



6 What is a market window? Why is it important for products to reach market early inthe window?

Time-to-market design has become especially demanding in recent years. Introducing an embeddedsystem to the marketplace early can make a big difference in the system’s profitability, since marketwindows are becoming quite short, with such windows often measured in months.

Figure 3: Market Window

Fig: 3 shows a simple market window during which time a product would have highest sales. Missingthis window, which means that the product begins being sold further to the right on time scale, can meansignificant loss in sales. In some cases, each day that a product is delayed from introduction to marketcan translate to a one-million-dollar loss. Average time-to-market constraint has been reported as havingshrunk to only 8 months. Embedded system complexities are growing due to increasing IC capacities.Such rapid growth in IC capacity translates into pressure on designer to add more functionality to asystem.

As shown in Fig: 2, peak of market occurs at halfway point (denoted as W) of product life. Peakis the same even for delayed entry. As derived earlier, delayed entry results in percentage revenue lossgiven by:


= D(3W−D)2W 2 × 100%

To understand the importance of on-time entry, consider a product whose lifetime is 52 weeks. Letthe product enter into the market with a delay of 4 weeks. We therefore have:


= 4(3(26)−4)

2(26)2× 100% = 21.89%

Reaching the market late has a larger negative effect on revenues than development cost overruns oreven a product price that is too high.

7 Explain the three main processor technologies that can be used with embedded systems.Also highlight the benefits of each.



(a) Processors vary in their customization for the problem at hand. (b) General-Purpose processor

Figure 4: Processor Customization and General Purpose Processor Architecture

Processor technology relates to the architecture of the computation engine used to implement asystem’s desired functionality. Each processor differs in its specialization towards a particular function,thus manifesting design metrics different than other processors. This concept is illustrated graphicallyin Fig: 4a. The application requires a specific embedded functionality, symbolized as a cross, such assumming of items in an array.

The three main processor technologies are:

1. General-Purpose Processors - Software: Designer of a general-purpose processor, or micro-processor, builds a programmable device that is suitable for a variety of applications to maximizenumber of devices sold. One feature of such a processor is a program memory. The designer ofsuch a processor does not know what program will run on the processor, so the program cannot bebuilt into the digital circuit. Another feature is a general datapath. The datapath must be generalenough to handle a variety of computations, so such a datapath typically has a large register file andone/more general-purpose ALUs. An embedded system designer need not be concerned about thedesign of a general-purpose processor. He simply uses it by programming the processor’s memoryto carry out required functionality. This part of implementation is referred to as ”Software”.

Fig: 4b illustrates the architecture of a general-purpose processor. The functionality is stored ina program memory. The controller fetches the current instruction, as indicated by the programcounter (PC), into the instruction register (IR). It then configures the datapath for this instructionand executes the instruction. It then determines the next instruction address, sets the PC to thisaddress, and fetches again.

Benefits of general-purpose processor:

(a) Low time-to-market and low NRE costs: Designer must only write a program but not do anydigital design

(b) High Flexibility: Changing function requires changing only the program.

(c) Unit cost may be low in small quantities compared with designing our own processor.

(d) Performance may be fast for computation-intensive applications, if using a fast processor.

2. Single-Purpose Processors - Hardware: A single-purpose processor is a digital circuit de-signed to execute exactly one program. For example, in a digital camera, all components other



(a) Single-Purpose Processor (b) Application-Specific Processor

Figure 5: Single-Purpose and Application-Specific Processor Architectures

than microcontroller are single-purpose processors. The JPEG codec executes single program thatcompresses and decompresses video frames.

An embedded system designer may create a single-purpose processor by:

(a) designing a custom digital circuit, or by

(b) purchasing a predesigned single-purpose processor.

This part of implementation is the ”hardware” portion. Using single-purpose processor in anembedded system results in several design metric benefits. Fig: 5a shows the architecture ofsingle-purpose processors. Datapath contains only the essential components for this program (i.e.,summing elements of an array) : two registers and an adder. Since the processor executes this oneprogram, we hardwire program’s instructions directly into control logic and use a state register tostep through these instructions. Therefore program memory is not necessary.

Benefits of single-purpose processor:

(a) Low unit cost for large quantities.

(b) Fast performance.

(c) Small size and low power.

3. Application-Specific Processors: Application-Specific Instruction-Set processor (ASIP) canserve as a compromise between the other processor options. An ASIP is a programmable processoroptimized for a particular class of applications having common characteristics, such as embeddedcontrol, digital signal processing or telecommunications. The designer can optimize datapath forthe application class, adding special functional units for common operations and eliminating otherinfrequently used units. Examples of ASIPs are microcontrollers and DSPs.

• Mircontrollers:

A microntroller is a microprocessor that has been optimized for embedded control applica-tions. Such applications typically monitor and set numerous single-bit control signals but donot perform large amount of data computations. They incorporate on microprocessors several



peripheral components common in control applications. This enables single-chip implementa-tions and hence smaller and lower-cost products.

• DSPs:

A DSP is a microprocessor designed to perform common operations on digital signals, which aredigital encodings of analog signals like video and audio. These operations carry out commonsignal processing tasks like signal filtering, transformation or a combination of tasks. A DSPmay include special hardware to fetch sequential data memory locations in parallel with otheroperations, to improve speed.

Fig: 5b illustrates the general architecture of ASIP. Datapath may be customized for the example.It may have an autoincrementing register, a path that allows us to add a register with a memorylocation in one instruction, fewer registers and a simple controller.

Benefits of application-specific processors:

(a) Flexibility.

(b) Good performance.

(c) Small size and low power.

12 Explain how the top-down design process improves the productivity.

In the top-down design process, the designer refines the system through several abstraction levels.At the system level, the designer describes the desired functionality in some executable language like C;this is called the system specification. The designer refines this specification by distributing portions ofit among several general and/or single-purpose processors, yielding behavioural specifications for eachprocessor. These are then refined into register-transfer (RT) specifications by converting behaviour ongeneral-purpose processors to assembly code, and by converting behaviour on single-purpose processorto a connection of register-transfer components and state machines. These are further refined intologic specification consisting of Boolean equations. Finally, these are refined into an implementation,consisting of machine code for general-purpose processors and a gate-level netlist for single-purposeprocessors.

There are three main approaches to improve the productivity in a design process:

1. Compilation/Synthesis.

2. Libraries/IP.

3. Test/Verification.

1. Compilation/Synthesis: Compilation/Synthesis lets a designer specify desired functionality inan abstract manner and automatically generates lower-level implementation details. Describing asystem at high abstraction levels can improve productivity by reducing the amount of details thata designer must specify. The various tools and their functions are shown in Table 1.

2. Libraries/IP: Libraries involve reuse of preexisting implementations. Using libraries of existingimplementations can improve productivity if the time it takes to find, acquire, integrate and testa library is less than that of designing the item oneself. The libraries and their contents are shownin Table 2.



Sl. No Tool Function

1 Logic Synthesis ToolConverts Boolean expressions into a connection oflogic gates, called a netlist

2 Register-Tranfer (RT) Synthesis ToolConverts FSMs and register transfers into a datapath ofRT components and a controller of Boolean equations.

3 Behavioural Synthesis ToolConverts a sequential program into FSMs andregister transfers.

4 Software Compiler Converts a sequential program to assembly code.

5 System Synthesis ToolConverts an abstract system specification into aset of sequential programs on general- andsingle-purpose processors.

Table 1: Tools used for improving productivity

Sl. No Library Contents1 Logic-level library Layouts for gates and cells2 RT-level library Layouts for RT components.

3 Behavioural-level libraryCommonly used components,such as compression components, bus interfaces etc.

4 System-level library

Complete systems solving particular problems,such as an interconnection of processors withaccompanying operating systems and programsto implement an interface to the Internet.

Table 2: Libraries and their contents

3. Test/Verification: Testing ensures functionality is correct. It prevents time-consuming debuggingat low abstraction levels and iteration back to high abstraction levels. Simulation is the mostcommon method of testing for correct functionality. The various types of simulators and theirfunctions are shown in Table 3.

14 Given the details in Table 4, draw the graphs of total cost Vs Volume and per-productcost Vs. volume. Make a table for volumes 400, 800, 1600, 2000 and 2400 for all thethree technologies.

The graph of total cost vs. volume is shown in Fig: 6.The graph of per-product cost vs. volume is shown in Fig: 7The MATLAB code used to generate Fig: 6 and Fig: 7 is shown below:

% To generate graph of total cost vs. volume

x = [0:400:3200];

y = 2000+(100*x); % Total cost = NRE cost + Unit cost* # of units

plot(x,y,’--rs’,’LineWidth’,2,...

’MarkerEdgeColor’,’k’,...

’MarkerFaceColor’,’g’,...

’MarkerSize’,10);

hold on;

y = 30000+(30*x);



Level Simulators Function

Logic LevelGate-level Simulators

Output signal timing waveforms, giveninput signal waveforms.

General-purpose Simulators Execute machine code

RT-level HDL simulatorsExecute RT-level description and provideoutput waveforms, given input waveforms.

Behavioral-levelHDL simulators Simulate sequential programs

Co-simulatorsConnect HDL and general-purpose processorto enable hardware/software coverification.

System-level

Model Simulators

Simulate initial system specification using anabstract computation model, independent ofany processor technology to verify correctnessand completeness of specification

Model Checkers

Verify certain properties of specification,such as ensuring that certain simultaneousconditions never occur or that system doesnot deadlock.

Table 3: Simulators and their functions

Metric Technology A Technology B Technology CNRE cost $2,000 $30,000 $100,000Unit cost $100 $30 $2

Table 4: Technology Specifications

plot(x,y,’--diamond’,’LineWidth’,2,...


’MarkerFaceColor’,’b’,...


hold on;

y = 100000+(2*x);

plot(x,y,’--mx’,’LineWidth’,2,...


’MarkerFaceColor’,’c’,...


% To generate graph of per-product cost vs. volume

x = [0:400:3200];

y = (2000./x)+100; % per product cost = NRE cost/# of units + Unit cost

plot(x,y,’--rs’,’LineWidth’,2,...


’MarkerFaceColor’,’g’,...


hold on;

y = (30000./x)+30;

plot(x,y,’--diamond’,’LineWidth’,2,...




Figure 6: Graph of total cost vs. volume

Figure 7: Graph of per-product cost vs. volume

’MarkerFaceColor’,’b’,...


hold on;

y = (100000./x)+2;

plot(x,y,’--mx’,’LineWidth’,2,...


’MarkerFaceColor’,’c’,...


The total cost and per-product against volumes of 400, 800, 1200, 1600, 2000 and 2400 are given inTable 5.



VolumeTotal Cost($) Per-product Cost($)

Technology A Technology B Technology C Technology A Technology B Technology C400 42,000 42,000 100,800 105.00 105.00 252.00800 82,000 54,000 101,600 102.50 67.50 127.001200 122,000 66,000 102,400 101.66 55.00 85.331600 162,000 78,000 103,200 101.25 48.75 64.502000 202,000 90,000 104,000 101.00 45.00 52.002400 24,2000 102,000 104,800 100.83 42.50 43.66

Table 5: Total cost and per-product costs

Figure 8: Successive Approximation

15 Assume 8-bit encoding of input voltage in the range -5V to +5V. Calculate the en-coding for 1.2V and trace the succesive approximation approach to find the correctencoding. What is the resolution of the conversion? Extend the ratio and resolutionequations to any applied voltage in the range Vmin to Vmax.

The successive approximation approach is illustrated in Fig: 8.In Fig: 8, V max′ and V min′ are the next stage Vmax and Vmin values respectively. From the figure,

it follows that the encoding for 1.2V is decimal equivalent of 10011110 which is 158.Resolution of the conversion is: 0.03921VThe ratio and resolution equations are extended to any voltage in the range Vmax and Vmin as:

Ratio Equation:e−(Vmin)Vmax−Vmin

= d2n−1



where Vmax and Vmin represent maximum and minimum analog signal voltage respectively, n is thenumber of bits available for conversion, d is the digital encoding and e is the present analog voltage.

Resolution of conversion:Vmax−Vmin

2n−1

where Vmax and Vmin represent maximum and minimum analog signal voltage respectively, n is thenumber of bits available for conversion.

16 Illustrate how program and data memory fetches can be overlapped in Harvard archi-tecture.

In a Harvard architecture, the program memory space is distinct from the data memory space. Hence,instruction and data fetches can be performed simultaneously.

Consider 2 instructions in a pipelined machine:

MOV R1, # 30

ADD R2, R3

Cycle 1: Fetch PM[PC] into IR where PC = 0;Cycle 2: Fetch DM[30] and store in R1.

Fetch PM[PC] into IR where PC = 1;

Program and data memory fetches are overlapped in cycle 2 i.e, program and data fetches occursimultaneously.

18 Write a simple algorithm for finding GCD of two integer numbers. Write the FSMDfor this algorithm and explain how it can be optimized and write the optimized FSMDand its advantages.

Algorithm for finding GCD of two integers:

0: int x, y;

1: while (1) {

2: while (!go_i);

3: x = x_i;

4: y = y_i;

5: while (x !=y) {

6: if (x<y)

7: y = y-x;

else

8: x = x-y;

}

9: d_o = x;

}

FSMD for the GCD algorithm is shown in Fig: 9a.



(a) FSMD for GCD algorithm (b) (1)Orignal FSMD for GCD algorithm and (2) optimized FSMD

Figure 9: FSMD for GCD and its optimization

Optimizing the FSMD:

Consider the original FSMD for GCD, which is redrawn in Fig: 9b(1). State 1 is not necessary sinceits outgoing transitions have constant values. States 2 and 2-J can be merged since there are no loopoperations. States 3 and 4 can be merged since they perform assignment operations that are independentof one another. States 5 and 6 can be merged. States 6-J and 5-J can be eliminated, with the transitionsfrom states 7 and 8 pointing directly to state 5. State 1-J can be eliminated. The resulting optimizedFSMD is shown in Fig: 9b(2).

The advantages of optimizing FSMD is that the resulting FSMD has fewer states which simplifiesthe design of the processor.

20 With a neat diagram explain how a pulse width modulator works. What are theconsiderations in selecting the clock, the prescalar, and the counter? Assuming an8-bit up-counter, calculate the count to be loaded in the ’cycle-high’ register to getpulses of duty cycle 75%.

A pulse width modulator (PWM) generates an output signal that repeatedly switches between highand low values. The duration of the high value and low value is controlled by indicating the desiredperiod, and the desired duty cycle, which is the percentage of time the signal is high compared to thesignal’s period. The pulse’s width corresponds to the pulse’s high time, as shown in Fig: 10.

The PWM makes use of the fact that a DC motor does not come to an immediate stop when itsinput voltage is lowered to 0, but rather it coasts. Using a PWM, we can set the duty cycle to achievethe appropriate average voltage and we can also set the period small enough for smooth operation ofthe motor. Assuming the PWM’s output is 5V when high and 0V when low, we can obtain an averageoutput of 1.25V by setting the duty cycle to 25% since 5V*25% = 1.25V. This duty cycle is shown inFig: 10(a). An average output of 2.50V can be obtained by setting the duty cycle to 50% as shown inFig: 10(b). A duty cycle of 75% would result in average output of 3.75V as shown in Fig: 10(c).



Figure 10: Operation of a PWM (a) 25% duty cycle (b) 50% duty cycle (c) 75% duty cycle

Figure 11: Controlling a DC motor with a PWM: (a) relationship between applied voltage and DC motor speed, (b)internal PWM architecture, (c) pseudo-code, (d) connection to DC motor.

The speed of a DC motor is proportional to the voltage applied to the motor. Suppose that for afixed load, the motor yields the revolutions per minute(rpm) shown in Fig: 11(a) for the given inputvoltages. Suppose that we use a PWM as part of a system that includes two 8-bit registers called clk divand cycle high , an 8-bit counter and an 8-bit divider, as shown in Fig: 11(b). The PWM works as



follows. Initially, the value of clk div is loaded into the register. The clk div register works as a clockdivider. After a specified amount of time has elapsed, a pulse is sent to the counter register. This causesthe counter to increment itself. The comparator looks at the values in the counter register and thecycle high register. When the counter value is less than cycle high, a 1 (+5V) is output. When countervalue is higher than the value in cycle high, a 0 (0V) is output. When the counter value reaches 254,counter is reset to 0 and the process repeats. Thus, clk div determines the PWM’s period, specifyingthe number of cycles in the period and cycle high determines the duty cycle, indicating how many of aperiod’s cycles should output a 1.

To determine the value of clk div i.e., prescalar, various values are tried and tested to see if thefrequency is too fast or too slow for our particular motor. For the motor to run at 6,900rpm, we need a75% duty cycle. Therefore, 254*0.75 = 191. This value i.e., BFh is loaded into the cycle high register.

22 Explain the concept of data path in the embedded systems.

The datapath in embedded systems consists of the circuitry for transforming data and for storingtemporary data. The datapath contains an ALU capable of transforming data through operationssuch as addition, subtraction, logical AND, logical OR, inverting and shifting. The ALU generatesstatus signals, stored in a status register, indicating particular data conditions. Such conditions includeindicating whether data is zero or whether an addition of two data items generates a carry. The datapathalso contains registers capable of storing temporary data. Temporary data may include data broughtin from memory but not yet sent through the ALU, data coming from the ALU that will be needed forlater ALU operations or will be sent back to memory, and data that must be moved from one memorylocation to another. The internal bus carries data within the datapath, while the external data buscarries data to and from the data memory.

The size of a processor is measured as the bit-width of the datapath components. A bit is theprocessor’s basic data unit, representing either a 0 or a 1, while 8 bits are referred to as a byte. An N-bitprocessor may have N-bit-wide registers, an N-bit-wide ALU, and N-bit-wide internal bus over whichdata moves among datapath components, and an N-bit-wide external bus over which data is brought inand out of the datapath.

24 Explain the terms: Dhrystone Benchmark, Linker and Moore’s law.

1. Dhrystone Benchmark: A benchmark is a program intended to be run on different processorsto compare their performance. The Dhrystone benchmark was originally developed in 1984 byReinhold Weicker specifically as a performance benchmark; it performs no useful work. It focusseson exercising a processor’s integer arithmetic and string-handling capabilities. Its current versionis written in C and is in the public domain. It is typically executed thousands of times and aprocessor is said to be able to execute so many Dhrystones per second. Another commonly usedspeed comparison unit is based on the Dhrystone and is called MIPS. Its origin is based on thespeed of Digital’s VAX 11/780 which could execute 1,757 Dhrystones/second. Thus, for a VAX11/780, 1MIPS = 1,757 Dhrystones/second. This unit is commonly referred to as Dhrystone MIPS.

2. Linker: A linker allows a programmer to create a program in separably assembled or compiledfiles; it combines the machine instructions of each into a single program, incorporating instructionsfrom standard library routines. A linker designed for embedded processors will also try to elimi-nate binary code associated with uncalled procedures and functions as well as memory allocatedto unused variables in order to reduce the overall program footprint.



3. Moore’s Law: The most important trend in embedded systems is a trend related to ICs: ICtransistor capacity has double roughly every 18 months for the past several decades. This trend waspredicted by Intel co-founder Gordon Moore in 1965. He predicted that semiconductor transistordensity would double every 18 to 24 months. The trend is therefore know as Moore’s law. Thetrend is mainly caused by improvements in IC manufacturing that result in smaller parts, such astransistor parts and wires, on the surface of the IC.

28 Design a single purpose processor that outputs Fibonacci numbers upto n. Start witha function computing the desired result, translate it into a state diagram and sketcha possible data path.

Algorithm that outputs Fibonacci numbers upto n and computes the desired result is given below:

0: int n1, n2, count, n, temp;

1: while(1)

{

2: while(!go_i);

3: n1 = 1;

4: n2 = 1;

5: n = n_i;

6: count = 0;

7: while(count<n)

{

8: if(count!=0 && count!=1)

{

9: temp = n1; //save n1

10: n1 = n2;

11: n2 = n1+temp; //next fibonacci number

}

12: fb_o = n2; //output n2 as the fibonacci number

13: count=count+1; //increment counter

}

}

The state diagram is as shown in Fig: 12 and the probable datapath is shown in Fig: 13

29 What is memory hierarchy? How does cache operate? Discuss the cache mappingtechniques and bring out their merits and short comings.

In most systems, inexpensive and fast memory is required. But inexpensive memory tends to beslow, whereas fast memory tends to be expensive. The solution to this problem is to create a memoryhierarchy as illustrated in Fig: 14.

An inexpensive but slow memory is used to store all the program and data. A small amount of fastbut expensive cache memory is used to store copies of likely accessed parts of main memory. A two-levelcache scheme is commonly used.



Figure 12: State Diagram for generating fibonacci numbers.

Figure 13: Probable datapath for generating fibonacci numbers.



Figure 14: An example memory hierarchy

Cache Operation: A cache operates as follows. When we want the processor to access a mainmemory address, we first check for a copy of that location in cache. If the copy is in the cache, called acache hit, then we can access it quickly. If the copy is not there, called a cache miss, then we must firstread the address and perhaps some of its neighbours into the cache.

Cache Mapping Techniques: Cache mapping is the method for assigning main memory addressesto the far fewer number of available cache addresses, and for determining whether a particular mainmemory address contents are in the cache. Cache mapping can be accomplished using one of three basictechniques:

1. Direct mapping: In this technique, the main memory address is divided into two fields, theindex and the tag. The index represents the cache address, and thus the number of index bitsis determined by the cache size, i.e., index size = log2(cache size). Many different main memoryaddresses will map to the same cache address. When we store a main memory address content inthe cache, we also store the tag. To determine if a desired main memory address is in the cache,we go to the cache address indicated by the index, and we then compare the tag there with thedesired tag. If the tags match, then we check the valid bit. The valid bit indicates whether thedata stored in that cache slot has previously been loaded into the cache from the main memory.We use the offset portion of the memory address to grab a particular word within the cache line. Acache line, also known as a cache block, is the number of (inseparable) adjacent memory addressesloaded from or stored into the main memory at a time. This technique is illustrated in Fig: 15.

2. Fully-associative mapping: In this technique, each cache address contains not only a mainmemory address content, but also the complete main memory address. To determine if a desiredmain memory address is in the cache, we simultaneously (associatively) compare all the addressesstored in the cache with the desired address. This is illustrated in Fig: 16.

3. Set-associative mapping: This technique is a compromise between direct and fully-associativemapping. As in direct-mapping, an index maps each main memory address to a cache address, butnow each cache address contains the content and tags of two or more memory locations, called aset or a line. To determine if a desired main memory address is in the cache, we go to the cache



Figure 15: Direct Mapped Cache Mappping Technique

Figure 16: Fully Associative Cache Mappping Technique

Figure 17: Two-way Associative Cache Mappping Technique

address indicated by the index, and we then simultaneously (associatively) compare all the tagsat that location (i.e., of that set) with the desired tag. A cache with a set of size N is called anN-way set-associative cache. 2-way, 4-way and 8-way set associative caches are common. This isillustrated in Fig: 17.



30 Define the following: 1)Assembler 2) Linker 3) Debugger 4) Emulator

1. Assembler: Assemblers translate assembly instructions to binary machine instructions. In addi-tion to just replacing opcode and operand mnemonics by binary equivalents, an assembler may alsotranslate symbolic labels into actual addresses. For example, a programmer may add a symboliclabel END to an instruction A and may reference END in a branch instruction. The assemblerdetermines the actual binary address of A, and replaces references to END by this address. Themapping of assembly instructions to machine instructions is one-to-one.

2. Linker: A linker allows a programmer to create a program in separately-assembled files; it com-bines the machine instructions of each into a single program, perhaps incorporating instructionsfrom standard library routines.

3. Debugger: Debuggers help programmers evaluate and correct their programs. They run on thedevelopment processor and support stepwise program execution, executing one instruction and thenstopping, proceeding to the next instruction when instructed by the user. They permit execution upto user-specified breakpoints, which are instructions that when encountered cause the program tostop executing. Whenever the program stops, the user can examine values of various memory andregister locations. A source-level debugger enables step-by-step execution in the source programlanguage, whether assembly language or a structured language. A good debugging capability iscrucial, as today’s programs can be quite complex and hard to write correctly.

4. Emulator: Emulators support debugging of the program while it executes on the target proces-sor. An emulator typically consists of a debugger coupled with a board connected to the desktopprocessor via a cable. The board consists of the target processor plus some support circuitry (oftenanother processor). The board may have another cable with a device having the same pin configu-ration as the target processor, allowing one to plug this device into a real embedded system. Suchan in-circuit emulator enables one to control and monitor the program’s execution in the actualembedded system circuit. In-circuit emulators are available for nearly any processor intended forembedded use, although they can be quite expensive if they are to run at real speeds.

31 Explain the various events that take place when a processor executes an instruction.Explain how pipelining improves the execution speed.

Microprocessor’s execution of instructions consists of several basic stages:

1. Fetch instruction: the task of reading the next instruction from memory into the instructionregister.

2. Decode instruction: the task of determining what operation the instruction in the instructionregister represents (e.g., add, move, etc.).

3. Fetch operands: the task of moving the instructions operand data into appropriate registers.

4. Execute operation: the task of feeding the appropriate registers through the ALU and back intoan appropriate register.

5. Store results: the task of writing a register into memory.

If each stage takes one clock cycle, then we can see that a single instruction may take several cyclesto complete.



Figure 18: Pipelining: (a) nonpipelined dish cleaning (b) pipelined dish cleaning (c) pipelined instruction execution.

Pipelining improves speed: Pipelining is a common way to increase the instruction throughputof a microprocessor. Consider a simple analogy of two people approaching the chore of washing anddrying 8 dishes. In one approach, the first person washes all 8 dishes, and then the second person driesall 8 dishes. Assuming 1 minute per dish per person, this approach requires 16 minutes. The approachis clearly inefficient since at any time only one person is working and the other is idle. Obviously, abetter approach is for the second person to begin drying the first dish immediately after it has beenwashed. This approach requires only 9 minutes – 1 minute for the first dish to be washed, and then 8more minutes until the last dish is finally dry. This latter approach is referred to as pipelining.

Each dish is like an instruction, and the two tasks of washing and drying are like the five stageslisted above. By using a separate unit (each akin a person) for each stage, we can pipeline instructionexecution. After the instruction fetch unit fetches the first instruction, the decode unit decodes it whilethe instruction fetch unit simultaneously fetches the next instruction. The idea of pipelining is illustratedin Fig: 18. For pipelining to work well, instruction execution must be decomposable into roughly equallength stages, and instructions should each require the same number of cycles.

Branches pose a problem for pipelining, since we don’t know the next instruction until the currentinstruction has reached the execute stage. One solution is to stall the pipeline when a branch is in thepipeline, waiting for the execute stage before fetching the next instruction. An alternative is to guesswhich way the branch will go and fetch the corresponding instruction next; if right, we proceed with nopenalty, but if we find out in the execute stage that we were wrong, we must ignore all the instructionsfetched since the branch was fetched, thus incurring a penalty. Modern pipelined microprocessors oftenhave very sophisticated branch predictors built in.

34 Given the cache designs in Table: 6, find out the one with the best performance bycalculating the average cost of access.



Bytes Set-associative Miss Rate Hit Cost Miss Cost4K 8-way 6% 1 cycle 12 cycles8K 4-way 4% 2 cycles 12 cycles16K 2-way 2% 3 cycles 12 cycles

Table 6: Cache Designs

Bytes Set-associative Miss Rate Hit Cost Miss Cost Average Access Cost4K 8-way 6% 1 cycle 12 cycles 0.94 ∗ 1 + 0.06 ∗ 12 = 1.668K 4-way 4% 2 cycles 12 cycles 0.96 ∗ 2 + 0.04 ∗ 12 = 2.416K 2-way 2% 3 cycles 12 cycles 0.98 ∗ 3 + 0.02 ∗ 12 = 3.18

Table 7: Average Access Cost

The average cost of access is given by:

Average access cost = HitRate ∗HitCost+MissRate ∗MissCost

The average costs for the given three cache designs is tabulated in Table: 7The best performance is given by the cache design with the least average access cost i.e., 1.66 which

corresponds to the design having 4Kbytes 8-way set-associative cache with 6% miss rate, each hit costing1 cycle and miss costing 12 cycles.

43 Explain how UART is used for communication highlighting the advantages of UART

A UART (Universal Asynchronous Receiver/Transmitter) receives serial data and stores it as paralleldata (usually one byte), and takes parallel data and transmits it as serial data. Such serial commu-nication is beneficial when we need to communicate bytes of data between devices separated by longdistances, or when we simply have few available I/O pins. We must set the transmission and receptionrate, called the baud rate, which indicates the frequency that the signal changes. Common rates include2400, 4800, 9600, and 19.2k. An extra bit may be added to each data word, called parity, to detecttransmission errors – the parity bit is set to high or low to indicate if the word has an even or oddnumber of bits.

Internally, a simple UART may possess a baud-rate configuration register, and two independentlyoperating processors, one for receiving and the other for transmitting. The transmitter may possess aregister, often called a transmit buffer, that holds data to be sent. This register is a shift register, sothe data can be transmitted one bit at a time by shifting at the appropriate rate. Likewise, the receiverreceives data into a shift register, and then this data can be read in parallel. This is shown in Fig: 19a.In order to shift at the appropriate rate based on the configuration register, a UART requires a timer.

The receiver is constantly monitoring the receive pin(rx ) for a start bit. The start bit is typicallysignalled by a high to low transition on the rx pin. After the start bit has been detected, the receiverstarts sampling the rx pin at predetermined intervals, shifting each sampled bit into the receive shiftregister. The receiver also adds an additional bit called parity which it uses to determine if the receiveddata is correct.

The transmitter sends a start bit over its transmit pin (tx), signaling the beginning of a transmissionto the remote UART. Then, the transmitter shifts out the data in its transmit buffer over its tx pin ata predetermined rate. The transmitter also transmits an additional parity bit. At this point, UARTprocessor signals its host processor, indicating it is ready to send more data.



(a) A PC communicating serially with an embedded device (b) Transmission protocol used by the two UARTs

Figure 19: Serial Communication using UARTs.

In order for two serially connected UARTs to communicate with each other, they must agree on thetransmission protocol in use. A sample transmission protocol is shown in Fig: 19b. The transmissionprotocol used by UARTs determines the rate at which bits are sent and received. This is called thebaud rate. The protocol specifies the number of bits of data and the type of parity sent during eachtransmission. It also specifies the minimum number of bits used to separate two consecutive datatransmissions.

To use a UART, we must configure its baud rate by writing to the configuration register, and thenwe must write data to the transmit register and/or read data from the received register. Unfortunately,configuring the baud rate is usually not as simple as writing the desired rate (e.g., 4800) to a register.For example, to configure the UART of an 8051, we must use the following equation:

baud rate = (2smod/32) ∗ oscfreq/(12 ∗ (256 − TH1))

smod corresponds to 2 bits in a special-function register, oscfreq is the frequency of the oscillator, andTH1 is an 8-bit rate register of a built-in timer. We could use a general-purpose processor to implementa UART completely in software. If we used a dedicated general-processor, the implementation would beinefficient in terms of size. We could alternatively integrate the transmit and receive functionality withour main program. This would require creating a routine to send data serially over an I/O port, makinguse of a timer to control the rate. It would also require using an interrupt service routine to captureserial data coming from another I/O port whenever such data begins arriving. However, as with thetimer functionality, adding send and receive functionality can detract from time for other computations.

Advantages: The UART takes parallel data and transmits it as serial data. Such serial communicationis advantageous when we need to communicate bytes of data between devices that are separated by longdistances, or when those devices simply have few available I/O pins.

47 What is a WDT and what is its use? A 16-bit timer operates at a clock frequency of20MHz. Determine the resolution and range of this timer. If a ÷4 prescalar is used,what is the range and resolution of this design?

A watchdog timer (WDT) can be thought of as having the inverse functionality of a regular timer.We configure a watchdog timer with a real-time value, just as with a regular timer. However, instead ofthe timer generating a signal for us every X time units, we must generate a signal for the timer everyX time units. If we fail to generate this signal in time, then the timer generates a signal indicating thatwe failed.

One common use of a watchdog timer is to enable an embedded system to restart itself in case of afailure. In such case, we modify the system’s program to include statements that reset the watchdogtimer. We place these statements such that the watchdog timer will reset atleast once during every timeout interval if the program is executing normally. We connect the fail signal from the watchdog timerto the microprocessor’s reset pin. Suppose the program has an unexpected failure, such as entering anundesired infinite loop or waiting for an input event that never arrives. The watchdog timer will time



out, and thus the microprocessor will reset itself, starting its program from the beginning. In systemswhere such a full reset during system operation is not practical, we might instead connect the fail signalto an interrupt pin, and create an interrupt service routine that jumps to some safe part of the program.We might even combine these two responses, first jumping to an interrupt service routine to test partsof the system and record what went wrong, and then resetting the system. The interrupt service routinemay record information as to the number of failures and the causes of each, so that it can be evaluatedto determine if a particular part requires replacement.

Another common use is to support time outs in a program while keeping the program structuresimple. For example, we may desire that a user respond to questions on a display within some timeperiod. Rather than sprinkling response-time checks throughout our program, we can use a watchdogtimer to check for us, thus keeping our program neater.

Range and resolution of given timer:

Resolution =1

period=

1

20MHz= 50ns

Range = 65536*50ns = 3.2768ms

When ÷4 prescalar is used:

Resolution = 50ns*4 = 200ns

Range = 65536*200ns = 13.1072ms

52 Explain the features of flash memory, SRAM and OTP RAM

1. Flash Memory: Flash memory is an extensiion of EEPROM that was developed in the late1980s. While also using the floating-gate principle of EEPROM, flash memory is designed suchthat large blocks of memory can be erased all at once, rather than just one word at a time as intraditional EEPROM. A block is typically several thousand bytes large. This fast erase abilitycan vastly improve the performance of embedded systems, where large data items must be storedin nonvolatile memomory, systems like digital cameras, TV set-up boxes, cell phones and medicalequipment. It can also speed up manufacturing throughput, since programming the completecontents of flash may be faster than programming a similar-sized EEPROM.

Each block in a flash memory can be erased and reprogrammed tens of thousands of times beforethe block loses its ability to store data, and can store its data for 10 years or more. A drawback offlash memory is that writing to a single word in a flash may be slower than writing to a single wordin EEPROM, since an entire block will need to be read, the word within it updated, and then theblock written back.

Figure 20: Memory cell internals of SRAM



Figure 21: Write ability and storage performance of memories, showing relative degrees along each axis (not to scale).

2. SRAM: Static RAM or SRAM, uses a memory cell shown in Fig: 20, consisting of a flip-flop tostore a bit. Each bit thus requires about six transistors. This RAM type is called static becauseit will hold its data as long as power is supplied, in contrast to dynamic RAM. Static RAM istypically used for high-performance parts of a system (e.g., cache).

3. OTP ROM: User-programmable ROMs are generally refereed to as programmable ROMs, orPROMs. The most basic PROM uses a fuse for each programmable connection. To programa PROM device, the user provides a file that indicates the desired ROM contents. A piece ofequipment called a ROM programmer then configures each programmable connection accordingto the file. The ROM programmer blows fuses by passing a large current wherever a connectionshould exist. However, once a fuse is blown, the connection can never be reestablished. For thisreason, basic PROM is often referred to as one-time-programmable ROM, or OTP ROM.

OTP ROMs have the lowest write ability of all PROMs, as illustrated in Fig: 21, since theycan only be written once, and they require a programmer device. However, they have very highstorage permanence, since their stored bits won’t change unless someone reconnects the device toa programmer and blows some more fuses. Because of their high storage performance, OTP ROMsare commonly used in final products, versus other PROMs, which are more susceptible to havingtheir connections inadvertently modified from radiation, maliciousness, or just the mere passage ofmany years.

OTP ROMs are also cheaper than other PROMs, often costing under a dollar each. This alsomakes them more attractive in final products versus other types of PROM, and also versus mask-programmed ROM when time-to-market constraints or unit costs make them a better choice.

References

[1] Embedded System Design - A Unified Hardware/Software Introduction, Frank Vahid, Tony Givargis,Third Edition.


embeddedassignment1 darshan

Documents