ujjwal kumar dept. of it gaya college,gaya · 2020. 7. 17. · ujjwal kumar dept. of it gaya...
TRANSCRIPT
Ujjwal kumarDept. of IT Gaya College,Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
A vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to the scalar processors, whose instructions operate on single data items.
Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Early Work
▪ Development started in the early 1960s at Westinghouse
▪ Goal of the Solomon project was to substantially increase arithmetic
▪ performance by using many simple co-processors under the control of
a single master CPU
▪ Allowed single algorithm to be applied to large data set Supercomputers
▪ Dominated supercomputer design through the 1970s into the 1990s Cray platforms were the most notable vector supercomputers
▪ Cray -1: Introduced in 1976
▪ Cray-2, Cray X-MP, Cray Y-MP Demise
▪ In the late 1990s, the price-to-performance ratio drastically increased for conventional microprocessor designs
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
CPU that implements an instruction set that operates on 1-D arrays, called vectors
Vectors contain multiple data elements Number of data elements per vector is
typically referred to as the vector length. Both instructions and data are pipelined to
reduce decoding time
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
A vector is a set of scalar data items, all of the same type, stored in memory.
A vector processor is an ensemble of hardware resources, including vector registers, functional pipelines, processing elements, and register counters, for performing vector operations.
Vector processing occurs when arithmetic or logical operations are applied to vectors.
Vector processing speedup 10..20 compared with scalar processing.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
A process that allows the CPU to execute a single instruction simultaneously on multiple pieces of data, rather than by repetitive looping.
Superscalar designs can take advantage of parallelism in scalar operations, it is possible to take advantage of similar parallelism in vector codes , Thus, it makes sense to provide multiple vector processors in a system.
Here the main issue is memory access.Ujjwal Kumar Dept. of IT Gaya College, Gaya
Memory-to-Memory Architecture (Traditional) Register-to-Register Architecture (Modern)
Ujjwal Kumar Dept. of IT Gaya College, Gaya
For all vector operation, operands are fetched directly from main memory, then routed to the functional unit
Results are written back to main memory Major reason for demise was due to large
startup time
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
All vector operations occur between vector registers.
If necessary, operands are fetched from main memory into a set of vector registers (load-store unit).
Includes all vector machines since the late 1980s:▪ Convex, Cray, Fujitsu, Hitachi, NEC
SIMD processors are based on this architecture
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Add two vectors to produce a third. Subtract two vectors to produce a third Multiply two vectors to produce a third Divide two vectors to produce a third Load a vector from memory Store a vector to memory
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Vector Registers▪ Typically 8-32 vector registers with 64 - 128 64-bit elements▪ Each contains a vector of double-precision numbers▪ Register size determines the maximum vector length▪ Each includes at least 2 read and 1 write ports
Vector Functional Units (FUs)▪ Fully pipelined, new operation every cycle▪ Performs arithmetic and logic operations▪ Typically 4-8 different units
Vector Load-Store Units (LSUs)▪ Moves vectors between memory and registers
Scalar Registers▪ Single elements for interconnecting FUs, LSUs, and registers
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Increase Memory Bandwidth▪ Memory banks are used to reduce load/store latency▪ Allow multiple simultaneous outstanding memory requests
Strip Mining▪ Generates code to allow vector operands whose size is less than or
greater than size of vector registers Vector Chaining
▪ Equivalent to data forwarding in vector processors▪ Results of one pipeline are fed into operand registers of another pipeline
Scatter and Gather▪ Retrieves data elements scattered throughout memory and packs them
into sequential vectors in vector registers▪ Promotes data locality and reduces data pollution
Multiple Parallel Lanes, or Pipes▪ Allows vector operation to be performed in parallel on multiple elements
of the vector.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Require Lower Instruction Bandwith▪ Reduced by fewer fetches and decodes
Easier Addressing of Main Memory▪ Load/Store units access memory with known patterns
Elimination of Memory Wastage▪ Unlike cache access, every data element that is requested by the
Processor is actually used – no cache misses▪ Latency only occurs once per vector during pipelined loading
Simplification of Control Hazards▪ Loop-related control hazards from the loop are eliminated
Scalable Platform▪ Increase performance by using more hardware resources
Reduced Code Size▪ Short, single instruction can describe N operations
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Modern graphics processing units (GPUs) include an array of shader pipelines which may be driven by compute kernels, which can be considered vector processors
Other CPU designs include some multiple instructions for vector processing on multiple (vectorised) data sets, typically known as MIMD (Multiple Instruction, Multiple Data) and realized with VLIW (Very Long Instruction Word)
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Performs same instruction on multiple data points concurrently.
Takes advantage of data level parallelism within an algorithm.
Commonly used in image and signal processing applications
▪ Large number of samples or pixels calculated with the same instruction
Ujjwal Kumar Dept. of IT Gaya College, Gaya
These architectures include instruction set extensions which allow both sequential and parallel instructions to be executed.
Some architectures include separate SIMD coprocessors for handling these instructions
In SIMD computers, ‘N’ number of processors are connected to a control unit and all the processors have their individual memory units. All the processors are connected by an interconnection network.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Developed by Fortune and Wyllie (1978) Objective: Modeling idealized parallel computers with
zero synchronization or memory access overhead
An n-processor PRAM has a globally addressable Memory
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Parallel random access machine; a theoretical model of parallel computation in which an arbitrary but finite number of processors can access any value in an arbitrarily large shared memory in a single time step.
Processors may execute different instruction streams, but work synchronously.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
A set of similar type of processors. All the processors share a common memory
unit. Processors can communicate among themselves through the shared memory only.
A memory access unit (MAU) connects the processors with the single shared memory.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
The PRAM model can apply to SIMD class machines if all processors execute identical instructions on the same cycle, or to MIMD class machines if the processors are executing different instructions.
Load imbalance is the only form of overhead in the PRAM model.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
EREW - Exclusive read, exclusive write : any memory location may only be accessed once in any one step.
CREW - Concurrent read, exclusive write : any memory location may be read any number of times during a single step, but only written to once, with the write taking place after the reads.
ERCW - Exclusive read, Concurrent Write : This allows exclusive read or concurrent writes to the same memory location.
CRCW - Concurrent read, concurrent write: any memory location may be written to or read from any number of times during a single step.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
There are many methods to implement the PRAM model, but the most prominent ones are −
Shared memory model Message passing model Data parallel model
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Shared memory emphasizes on control parallelism than on data parallelism.
In this model, multiple processes execute on different processors independently, but they share a common memory space.
Due to any processor activity, if there is any change in any memory location, it is visible to the rest of the processors.
As multiple processors access the same memory location, it may happen that at any particular point of time, more than one processor is accessing the same memory location.
To avoid this, some control mechanism is implemented to ensure mutual exclusion.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Shared memory programming has been implemented in the following −
▪ Thread libraries
▪ Distributed Shared Memory (DSM) Systems
▪ Program Annotation Packages
Ujjwal Kumar Dept. of IT Gaya College, Gaya
The thread library allows multiple threads of control that run concurrently in the same memory location.
Thread library provides an interface that supports multithreading through a library of subroutine.
It contains subroutines for▪ Creating and destroying threads▪ Scheduling execution of thread▪ passing data and message between threads▪ saving and restoring thread contexts
Examples of thread libraries include − SolarisTM threads for Solaris, POSIX threads as implemented in Linux, Win32 threads available in Windows NT and Windows 2000, and JavaTM threads as part of the standard JavaTMDevelopment Kit (JDK).
Ujjwal Kumar Dept. of IT Gaya College, Gaya
DSM systems create an abstraction of shared memory on loosely coupled architecture in order to implement shared memory programming without hardware support.
They implement standard libraries and use the advanced user-level memory management features present in modern operating systems.
Examples include Tread Marks System, Munin, IVY, Shasta, Brazos, and Cashmere.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
This is implemented on the architectures having uniform memory access characteristics.
The most notable example of program annotation packages is OpenMP.
OpenMP implements functional parallelism. It mainly focuses on parallelization of loops.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Merits of Shared Memory Programming▪ Global address space gives a user-friendly
programming approach to memory.▪ Due to the closeness of memory to CPU, data sharing
among processes is fast and uniform.▪ There is no need to specify distinctly the
communication of data among processes.▪ Process-communication overhead is negligible.▪ It is very easy to learn.
Demerits of Shared Memory Programming▪ It is not portable.▪ Managing data locality is very difficult.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Message passing is the most commonly used parallel programming approach in distributed memory systems. Here, the programmer has to determine the parallelism. In this model, all the processors have their own local memory unit and they exchange data through a communication network.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Processors use message-passing libraries for communication among themselves.
Along with the data being sent, the message contains the following components −▪ The address of the processor from which the message is being sent;▪ Starting address of the memory location of the data in the sending
processor;▪ Data type of the sending data;▪ Data size of the sending data;▪ The address of the processor to which the message is being sent;▪ Starting address of the memory location for the data in the receiving
processor. Processors can communicate with each other by any of the
following methods −▪ Point-to-Point Communication▪ Collective Communication▪ Message Passing Interface
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Point-to-point communication is the simplest form of message passing. Here, a message can be sent from the sending processor to a receiving processor by any of the following transfer modes −▪ Synchronous mode − The next message is sent only
after the receiving a confirmation that its previous message has been delivered, to maintain the sequence of the message.
▪ Asynchronous mode − To send the next message, receipt of the confirmation of the delivery of the previous message is not required.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Collective communication involves more than two processors for message passing. Following modes allow collective communications −
Barrier − Barrier mode is possible if all the processors included in the communications run a particular bock (known as barrier block) for message passing.
Broadcast − Broadcasting is of two types −▪ One-to-all − Here, one processor with a single operation
sends same message to all other processors.
▪ All-to-all − Here, all processors send message to all other processors.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
It may be of three types −▪ Personalized − Unique messages are sent to all
other destination processors.
▪ Non-personalized − All the destination processors receive the same message.
▪ Reduction − In reduction broadcasting, one processor of the group collects all the messages from all other processors in the group and combine them to a single message which all other processors in the group can access.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Merits of Message Passing
▪ Provides low-level control of parallelism;
▪ It is portable;
▪ Less error prone;
▪ Less overhead in parallel synchronization and data distribution.
Demerits of Message Passing▪ As compared to parallel shared-memory code, message-
passing code generally needs more software overhead.Ujjwal Kumar Dept. of IT Gaya College, Gaya
There are many message-passing libraries. Here, we will discuss two of the most-used message-passing libraries −
▪ Message Passing Interface (MPI)
▪ Parallel Virtual Machine (PVM)
Ujjwal Kumar Dept. of IT Gaya College, Gaya
It is a universal standard to provide communication among all the concurrent processes in a distributed memory system.
Most of the commonly used parallel computing platforms provide at least one implementation of message passing interface.
It has been implemented as the collection of predefined functions called library and can be called from languages such as C, C++, Fortran, etc.
MPIs are both fast and portable as compared to the other message passing libraries.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Merits of Message Passing Interface▪ Runs only on shared memory architectures or
distributed memory architectures;▪ Each processors has its own local variables;▪ As compared to large shared memory computers,
distributed memory computers are less expensive. Demerits of Message Passing Interface
▪ More programming changes are required for parallel algorithm;
▪ Sometimes difficult to debug; and▪ Does not perform well in the communication network
between the nodes.Ujjwal Kumar Dept. of IT Gaya College, Gaya
PVM is a portable message passing system, designed to connect separate heterogeneous host machines to form a single virtual machine.
It is a single manageable parallel computing resource. Large computational problems like superconductivity
studies, molecular dynamics simulations, and matrix algorithms can be solved more cost effectively by using the memory and the aggregate power of many computers.
It manages all message routing, data conversion, task scheduling in the network of incompatible computer architectures.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Very easy to install and configure; Multiple users can use PVM at the same time; One user can execute multiple applications; It’s a small package; Supports C, C++, Fortran; For a given run of a PVM program, users can
select the group of machines; It is a message-passing model, Process-based computation; Supports heterogeneous architecture.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
The major focus of data parallel programming model is on performing operations on a data set simultaneously.
The data set is organized into some structure like an array, hypercube, etc.
Processors perform operations collectively on the same data structure.
Each task is performed on a different partition of the same data structure.
It is restrictive, as not all the algorithms can be specified in terms of data parallelism. This is the reason why data parallelism is not universal.
Data parallel languages help to specify the data decomposition and mapping to the processors.
It also includes data distribution statements that allow the programmer to have control on data – for example, which data will go on which processor – to reduce the amount of communication within the processors.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Parallel computers use VLSI chips to fabricate processor arrays, memory arrays and large-scale switching networks.
Nowadays, VLSI technologies are 2-dimensional. The size of a VLSI chip is proportional to the amount of storage (memory) space available in that chip.
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
We can calculate the space complexity of an algorithm by the chip area (A) of the VLSI chip implementation of that algorithm.
If T is the time (latency) needed to execute the algorithm, then A.T gives an upper bound on the total number of bits processed through the chip (or I/O).
For certain computing, there exists a lower bound, f(s), such that A.T2 >= O (f(s))
Where A=chip area and T=time
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya
Ujjwal Kumar Dept. of IT Gaya College, Gaya