seminar report on gpu

1

GRAPHICS PROCESSING UNIT

A Seminar Report

Submitted by

AJMAL P M

In partial fulfillment for the award of the degree

Of

BACHELOR OF TECHNOLOGY

IN

ELECTRONICS & COMMUNICATION

At

SCMS SCHOOL OF ENGINEERING AND TECHNOLOGY

VIDYA NAGAR, KARUKUTTY, COCHIN, KERALA

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

October 2013

2

SCMS SCHOOL OF ENGINEERING & TECHNOLOGY

VIDYA NAGAR, KARUKUTTY, COCHIN, KERALA

(Affiliated to Mahatma Gandhi University, Kottayam)

Department Of Electronics and Communication Engineering

CERTIFICATE

This is to certify that the seminar report entitled “GRAPHICS PROCESSING UNIT” is the

bona fide report of the work done by AJMAL P M , Reg. No: 10015231 of Seventh Semester,

Electronics & Communication Engineering towards the partial fulfillment of the requirement

for the award of the Degree of Bachelor of Technology in Electronics & Communication

Engineering by the Mahatma Gandhi University during the academic year 2013.

Guide HOD

SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT

3

ACKNOWLEDGEMENT

I take this opportunity to thank, firstly the principal of SSET

Karukutty, Prof. P Madhavan for providing all the facilities that allowed me to complete this

seminar. I also extend my gratitude towards the HOD of ECE department, Prof. R

Sahadevanfor his constant support.

This seminar would not have been a success if not for the timely

advice, guidance and support of my guides Assistant Prof. ECE department Mr.Sabu G

Panicker, Mr. Sunil Jacob and Mr. Vinoj P.G, the lectures in ECE Department. I express my

sincere gratitude towards them.

I thank all my friends whose help and support were indispensable for

the successful completion of the seminar. I convey my whole hearted gratitude towards my

parents.

Last but not the least; I thank god almighty for blessings which helped

me throughout this humble endeavor.


4

ABSTRACT

A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically

for the processing of 3D graphics. The processor is built with integrated transform, lighting,

triangle setup/clipping, and rendering engines , capable of handling millions of math-intensive

processes per second. GPUs allow products such as desktop PCs, portable computers, and game

consoles to process real-time 3D graphics that only a few years ago were only available on high-

end workstations. Used primarily for 3-D applications, a graphics processing unit is a single-chip

processor that creates lighting effects and transforms objects every time a 3D scene is redrawn.

These are mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU.


5

TABLE OF CONTENTS

TITLE PAGE NO

INTRODUCTION……………………………………………………………….……...6

WHAT’S A GPU …………………………………………………………………...…...7

HISTORY AND STANDARDS ……………………………………………….…… .8

INTERFACING PORTS……………………..................................................................9

COMPONENTS OF A GPU ……………………….………………………………......13

WORKING......…………................................................................................................15

GPU COMPUTING…………………………………………………………………….17

MODERN GPU ARCHITECTURE.............................................................….............. 18

GPU IN MOBILES……………………………………………………………………...25

ADVANTAGES…………………………………………………………………….......28

APPLICATIONS………………………………………………………………………..30

CHALLENGES FACED……………………………………………………………… 31

CONCLUSION…………………………………………………………………….........33

REFERENCES………………………………………………………………………….34


6

INTRODUCTION

There are various applications that require a 3D world to be simulated as realistically as possible

on a computer screen. These include 3D animations in games, movies and other real world

simulations. It takes a lot of computing power to represent a 3D world due to the great amount of

information that must be used to generate a realistic 3D world and the complex mathematical

operations that must be used to project this 3D world onto a computer screen. In this situation,

the processing time and bandwidth are at a premium due to large amounts of both computation

and data.

The functional purpose of a GPU then, is to provide a separate dedicated graphics resources,

including a graphics processor and memory, to relieve some of the burden off of the main system

resources, namely the Central Processing Unit, Main Memory, and the System Bus, which would

otherwise get saturated with graphical operations and I/O requests. The abstract goal of a GPU,

however, is to enable a representation of a 3D world as realistically as possible. So these GPUs

are designed to provide additional computational power that is customized specifically to

perform these 3D tasks.


7

WHAT’S A GPU?

A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically for

the processing of 3D graphics. The processor is built with integrated transform, lighting, triangle

setup/clipping, and rendering engines, capable of handling millions of math-intensive processes

per second. GPUs form the heart of modern graphics cards, relieving the CPU (central

processing units) of much of the graphics processing load. GPUs allow products such as desktop

PCs, portable computers, and game consoles to process real-time 3D graphics that only a few

years ago were only available on high-end workstations.

Used primarily for 3-D applications, a graphics processing unit is a single-chip processor that

creates lighting effects and transforms objects every time a 3D scene is redrawn. These are

mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU. Lifting

this burden from the CPU frees up cycles that can be used for other jobs.

However, the GPU is not just for playing 3D-intense videogames or for those who create

graphics (sometimes referred to as graphics rendering or content-creation) but is a crucial

component that is critical to the PC's overall system speed. In order to fully appreciate the

graphics card's role it must first be understood.

Many synonyms exist for Graphics Processing Unit in which the popular one being the graphics

card .It’s also known as a video card, video accelerator, video adapter, video board, graphics

accelerator, or graphics adapter.


8

HISTORY AND STANDARDS

The first graphics cards, introduced in August of 1981 by IBM, were monochrome cards

designated as Monochrome Display Adapters (MDAs). The displays that used these cards were

typically text-only, with green or white text on a black background. Color for IBM-compatible

computers appeared on the scene with the 4-color Hercules Graphics Card (HGC), followed by

the 8-color Color Graphics Adapter (CGA) and 16-color Enhanced Graphics Adapter

(EGA). During the same time, other computer manufacturers, such as Commodore, were

introducing computers with built-in graphics adapters that could handle a varying number of

colors.

When IBM introduced the Video Graphics Array (VGA) in 1987, a new graphics standard

came into being. A VGA display could support up to 256 colors (out of a possible 262,144-color

palette) at resolutions up to 720x400. Perhaps the most interesting difference between VGA and

the preceding formats is that VGA was analog, whereas displays had been digital up to that

point. Going from digital to analog may seem like a step backward, but it actually provided the

ability to vary the signal for more possible

combinations than the strict on/off nature of digital.

Over the years, VGA gave way to Super Video Graphics Array (SVGA). SVGA cards were

based on VGA, but each card manufacturer added resolutions and increased color depth in

different ways. Eventually, the Video Electronics Standards Association (VESA) agreed on a

standard implementation of SVGA that provided up to 16.8 million colors and 1280x1024

resolution. Most graphics cards available today support Ultra

Extended Graphics Array (UXGA). UXGA can support a palette of up to 16.8 million colors

and resolutions up to 1600x1200 pixels.


9

Even though any card you can buy today will offer higher colors and resolution than the basic

VGA specification, VGA mode is the de facto standard for graphics and is the minimum on all

cards. In addition to including VGA, a graphics card must be able to connect to your computer.

While there are still a number of graphics cards that plug into an Industry Standard Architecture

(ISA) or Peripheral Component Interconnect (PCI) slot, most current graphics cards use the

Accelerated Graphics Port (AGP).

PERIPHERAL COMPONENT INTERCONNECT (PCI)

There are a lot of incredibly complex components in a computer. And all of these parts

need to communicate with each other in a fast and efficient manner. Essentially, a bus is

the channel or path between the components in a computer. During the early 1990s, Intel

introduced a new bus standard for consideration, the Peripheral Component Interconnect

(PCI).It provides direct access to system memory for connected devices, but uses a bridge

to connect to the front side bus and therefore to the CPU.

The illustration below shows how the various buses connect to the CPU.


10

PCI can connect up to five external components. Each of the five connectors for an external

component can be replaced with two fixed devices on the motherboard. The PCI bridge chip

regulates the speed of the PCI bus independently of the CPU's speed. This provides a higher

degree of reliability and ensures that PCI-hardware manufacturers know exactly what to design

for.

PCI originally operated at 33 MHz using a 32-bit-wide path. Revisions to the standard include

increasing the speed from 33 MHz to 66 MHz and doubling the bit count to 64. Currently, PCI-X

provides for 64-bit transfers at a speed of 133 MHz for an amazing 1-GBps (gigabyte per

second) transfer rate!

PCI cards use 47 pins to connect (49 pins for a mastering card, which can control the PCI bus

without CPU intervention). The PCI bus is able to work with so few pins because of hardware

multiplexing, which means that the device sends more than one signal over a single pin. Also,


11

PCI supports devices that use either 5 volts or 3.3 volts. PCI slots are the best choice for network

interface cards (NIC), 2-D video cards, and other high-bandwidth

devices. On some PCs, PCI has completely superseded the old ISA expansion slots.

Although Intel proposed the PCI standard in 1991, it did not achieve popularity until the arrival

of Windows 95 (in 1995). This sudden interest in PCI was due to the fact that Windows 95

supported a feature called Plug and Play (PnP). PnP means that you can connect a device or

insert a card into your computer and it is automatically recognized and configured to work in

your system. Intel created the PnP standard and incorporated it into the design for PCI. But it

wasn't until several years later that a mainstream operating system, Windows 95, provided

system-level support for PnP. The introduction of PnP

accelerated the demand for computers with PCI.

ACCELERATED GRAPHICS PORT (AGP)

The need for streaming video and real-time-rendered 3-D games requires an even faster

throughput than that provided by PCI. In 1996, Intel debuted the Accelerated Graphics Port

(AGP), a modification of the PCI bus designed specifically to facilitate the use of streaming

video and high-performance graphics.

AGP is a high-performance interconnect between the core-logic chipset and the graphics

controller for enhanced graphics performance for 3D applications. AGP relieves the graphics

bottleneck by adding a dedicated highspeed interface directly between the chipset and the

graphics controller as shown below.


12

Segments of system memory can be dynamically reserved by the OS for use by the graphics

controller. This memory is termed AGP memory or nonlocal video memory. The net result is

that the graphics controller is required to keep fewer texture maps in local memory.

AGP has 32 lines for multiplexed address and data. There are an additional 8 lines for sideband

addressing. Local video memory can be expensive and it cannot be used for other purposes by

the OS when unneeded by the graphics of the running applications. The graphics controller needs

fastaccess to local video memory for screen refreshes and various pixel elements including Z-

buffers, double buffering, overlay planes, and textures.

For these reasons, programmers can always expect to have more texture memory available via

AGP system memory. Keeping textures out of the frame buffer allows larger screen resolution,

or permits Z-buffering for a given large screen size. As the need for more graphics intensive

applications continues to scale upward, the amount of textures stored in system memory will

increase. AGP delivers these textures from system memory to the graphics controller at speeds

sufficient to make system memory usable as a secondary texture store.

PCI EXPRESS

PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe, is a

high-speed serial computer expansion bus standard designed to replace the older PCI, PCI-X,

and AGP bus standards. PCIe has numerous improvements over the aforementioned bus


13

standards, including higher maximum system bus throughput, lower I/O pin count and smaller

physical footprint, better performance-scaling for bus devices, a more detailed error detection

and reporting mechanism (Advanced Error Reporting (AER) ), and native hot-plug functionality.

More recent revisions of the PCIe standard support hardware I/O virtualization.

The PCIe electrical interface is also used in a variety of other standards, most notably

ExpressCard, a laptop expansion card interface.

Format specifications are maintained and developed by the PCI-SIG (PCI Special Interest

Group), a group of more than 900 companies that also maintain the conventional PCI

specifications. PCIe 3.0 is the latest standard for expansion cards that is in production and

available on mainstream personal computers.

COMPONENTS OF GPU

There are several components on a typical graphics card:

Graphics Processor

The graphics processor is the brains of the card, and is typically one of

three configurations:


14

Graphics co-processor: A card with this type of processor can handle all of the graphics chores

without any assistance from the computer's CPU. Graphics co- processors are typically found on

high-end video cards.

Graphics accelerator: In this configuration, the chip on the graphics card renders graphics

based on commands from the computer's CPU. This is the most common configuration used

today.

Frame buffer: This chip simply controls the memory on the card and sends information to the

digital-to-analog converter (DAC) . It does no processing of the image data and is rarely used

anymore.

Memory – The type of RAM used on graphics cards varies widely, but the most popular types

use a dual-ported configuration. Dual-ported cards can write to one section of memory while it is

reading from another section, decreasing the time it takes to refresh an image.

Graphics BIOS – Graphics cards have a small ROM chip containing basic information that tells

the other components of the card how to function in relation to each other. The BIOS also

performs diagnostic tests on the card's memory and input/ output (I/O) to ensure that everything

is functioning correctly.

Digital-to-Analog Converter (DAC) – The DAC on a graphics card is commonly known as a

RAMDAC because it takes the data it converts directly from the card's memory. RAMDAC

speed greatly affects the image you see on the monitor. This is because the refresh rate of the

image depends on how quickly the analog information gets to the monitor.

Display Connector – Graphics cards use standard connectors. Most cards use the 15-pin

connector that was introduced with Video Graphics Array (VGA).

Computer (Bus) Connector – This is usually Accelerated Graphics Port (AGP). This port

enables the video card to directly access system memory. Direct memory access helps to make


15

the peak bandwidth four times higher than the Peripheral Component Interconnect (PCI) bus

adapter card slots. This allows the central processor to do other tasks while the graphics chip on

the video card accesses system memory.

WORKING

The working of GPU can be explained by considering the graphics pipeline. There are different

steps involved in creating a complete 3D scene. It is done by different parts of the GPU, each of

which are assigned a particular job. During 3D rendering, there are different types of data the

travel across the bus. The two most common types are texture and geometry data. The

geometry data is the "infrastructure" that the rendered scene is built on. This is made up of

polygons (usually triangles) that are represented by vertices, the end-points that define each


16

polygon. Texture data provides much of the detail in a scene, and textures can be used to

simulate more complex geometry, add lighting, and give an object a simulated surface.

Many new graphics chips now have accelerated Transform and Lighting (T&L) unit, which

takes a 3D scene's geometry and transforms it into different coordinate spaces. It also performs

lighting calculations, again relieving the CPU from these math-intensive tasks.

Following the T&L unit on the chip is the triangle setup engine. It takes a scene's transformed

geometry and prepares it for the next stages of rendering by converting the scene into a form that

the pixel engine can then process. The pixel engine applies assigned texture values to each pixel.

This gives each pixel the correct color value so that it appears to have surface texture and does

not look like a flat, smooth object. After a pixel has been rendered it must be checked to see

whether it is visible by checking the depth value, or Z value.

A Z check unit performs this process by reading from the Z-buffer to see if there are any other

pixels rendered to the same location where the new pixel will be rendered. If another pixel is at

that location, it compares the Z value of the existing pixel to that of the new pixel. If the new

pixel is closer to the view camera, it gets written to the frame buffer. If it's not, it gets discarded.


17

After the complete scene is drawn into the frame buffer the RAMDAC converts this digital data

into analog that can be given to the monitor for display.

BLOCK DIAGRAM

GPU COMPUTING

GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate

general-purpose scientific and engineering applications. Pioneered five years ago by NVIDIA,

GPU computing has quickly become an industry standard, enjoyed by millions of users

worldwide and adopted by virtually all computing vendors.


18

GPU computing offers unprecedented application performance by offloading compute-intensive

portions of the application to the GPU, while the remainder of the code still runs on the CPU.

From a user's perspective, applications simply run significantly faster.

CPU + GPU is a powerful combination because CPUs consist of a few cores optimized for serial

processing, while GPUs consist of thousands of smaller, more efficient cores designed for

parallel performance. Serial portions of the code run on the CPU while parallel portions run on

the GPU.

MODERN GPU ARCHITECTURE

The G80 Architecture NVIDIA’s GeForce 8800 was the product that gave birth to the new GPU Computing model.

Introduced in November 2006, the G80 based GeForce 8800 brought several key innovations to

GPU Computing:


19

• G80 was the first GPU to support C, allowing programmers to use the power of the GPU

without having to learn a new programming language.

• G80 was the first GPU to replace the separate vertex and pixel pipelines with a single, unified

processor that executed vertex, geometry, pixel, and computing programs.

• G80 was the first GPU to utilize a scalar thread processor, eliminating the need for

programmers to manually manage vector registers.

• G80 introduced the single-instruction multiple-thread (SIMT) execution model where multiple

independent threads execute concurrently using a single instruction.

• G80 introduced shared memory and barrier synchronization for inter-thread communication.

In June 2008, NVIDIA introduced a major revision to the G80 architecture. The second

generation unified architecture—GT200 (first introduced in the GeForce GTX 280, Quadro FX

5800, and Tesla T10 GPUs)—increased the number of streaming processor cores (subsequently

referred to as CUDA cores) from 128 to 240. Each processor register file was doubled in size,

allowing a greater number of threads to execute on-chip at any given time.

Hardware memory access coalescing was added to improve memory access efficiency. Double

precision floating point support was also added to address the needs of scientific and high-

performance computing (HPC) applications. When designing each new generation GPU, it has

always been the philosophy at NVIDIA to improve both existing application performance and

GPU programmability; while faster application performance brings immediate benefits, it is the

GPU’s relentless advancement in programmability that has allowed it to evolve into the most

versatile parallel processor of our time. It was with this mindset that we set out to develop the

successor to the GT200 architecture.

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) is a parallel computing platform and

programming model created by NVIDIA and implemented by the graphics processing units

(GPUs) that they produce. CUDA gives program developers direct access to the virtual

instruction set and memory of the parallel computational elements in CUDA GPUs.


20

Using CUDA, the GPUs can be used for general purpose processing (i.e., not exclusively

graphics); this approach is known as GPGPU. Unlike CPUs, however, GPUs have a parallel

throughput architecture that emphasizes executing many concurrent threads slowly, rather than

executing a single thread very quickly.

The CUDA platform is accessible to software developers through CUDA-accelerated libraries,

compiler directives (such as OpenACC), and extensions to industry-standard programming

languages, including C, C++and Fortran. C/C++ programmers use 'CUDA C/C++', compiled

with "nvcc", NVIDIA's LLVM-based C/C++ compiler, and Fortran programmers can use 'CUDA

Fortran', compiled with the PGI CUDA Fortran compiler from The Portland Group.

In addition to libraries, compiler directives, CUDA C/C++ and CUDA Fortran, the CUDA

platform supports other computational interfaces, including the Khronos Group's OpenCL,

Microsoft's Direct Compute, and C++ AMP. In the computer game industry, GPUs are used not

only for graphics rendering but also in game physics calculations (physical effects like debris,


21

smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate

non-graphical applications in computational biology, cryptography and other fields by an order

of magnitude or more.

CUDA provides both a low level API and a higher level API. The initial CUDA SDK was made

public on 15 February 2007, for Microsoft Windows and Linux. Mac OS X support was later

added in version 2.0,which supersedes the beta released February 14, 2008. CUDA works with

all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the Tesla line.

CUDA is compatible with most standard operating systems. Nvidia states that programs

developed for the G8x series will also work without modification on all future Nvidia video

cards, due to binary compatibility.

Advantages

CUDA has several advantages over traditional general-purpose computation on GPUs (GPGPU)

using graphics APIs:

Scattered reads – code can read from arbitrary addresses in memory

Shared memory – CUDA exposes a fast shared memory region (up to 48KB per Multi-

Processor) that can be shared amongst threads. This can be used as a user-managed

cache, enabling higher bandwidth than is possible using texture lookups.

Faster downloads and read backs to and from the GPU

Full support for integer and bitwise operations, including integer texture lookups

FERMI ARCHITECTURE

The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA

cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The

512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory

partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM


22

memory. A host interface connects the GPU to the CPU via PCI-Express. The Giga Thread

global scheduler distributes thread blocks to SM thread schedulers.

Third Generation Streaming Multiprocessor

The third generation SM introduces several architectural innovations that make it not only the

most powerful SM yet built, but also the most programmable and efficient.

512 High Performance CUDA cores

Each SM features 32 CUDA processors—a fourfold increase over prior SM designs. Each

CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit

(FPU). Prior GPUs used IEEE 754-1985 floating point arithmetic. The Fermi architecture

implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add

(FMA) instruction for both single and double precision arithmetic. FMA improves over a

multiply-add (MAD) instruction by doing the multiplication and addition with a single final


23

rounding step, with no loss of precision in the addition. FMA is more accurate than performing

the operations separately. GT200 implemented double precision FMA.

In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,

multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly

designed integer ALU supports full 32-bit precision for all instructions, consistent with standard

programming language requirements. The integer ALU is also optimized to efficiently support

64-bit and extended precision operations. Various instructions are supported, including

Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population

count.

KEPLER ARCHITECTURE

With the launch of Fermi GPU in 2009, NVIDIA ushered in a new era in the high performance

computing (HPC) industry based on a hybrid computing model where CPUs and GPUs work

together workloads. And in just a couple of years, NVIDIA Fermi GPUs powerss me of the

fastest supercomputers in the world as well as tens of thousands of research clusters globally.

Now, with the new Kepler GK110 GPU, NVIDIA raises the bar for the HPC industry, yet again.

Comprised of 7.1 billion transistors, the Kepler GK110 GPU is an engineering marvel created to

address the most daunting challenges in HPC. Kepler is designed from the ground up to

maximize computational performance with superior power efficiency. The architecture has

innovations that make hybrid computing dramatically easier, applicable to a broader set of

applications, and more accessible. Kepler GK110 GPU is a computational workhorse with

teraflops of integer, single precision, and double precision performance and the highest memory

bandwidth. The first GK110 based product will be the Tesla K20 GPU computing accelerator.

SMX - Next Generation Streaming Multiprocessor

At the heart of the Kepler GK110 GPU is the new SMX unit, which comprises of several

architectural innovations that make it not only the most powerful Streaming Multiprocessor (SM)

we’ve ever built but also the most programmable and power-efficient. It delivers more processing


24

performance and efficiency through this new, innovative streaming multiprocessor design that

allows a greater percentage of space to be applied to processing cores versus control logic.

Dynamic Parallelism

Any kernel can launch another kernel and can create the necessary streams, events, and

dependencies needed to process additional work without the need for host CPU interaction. This

simplified programming model is easier to create, optimize, and maintain. It also creates a

programmer friendly environment by maintaining the same syntax for GPU launched workloads

as traditional CPU kernel launches. Dynamic Parallelism broadens what applications can now

accomplish with GPUs in various disciplines. Applications can launch small and medium sized

parallel workloads dynamically where it was too expensive to do so previously.

Hyper-Q — Maximizing the GPU Resources

Hyper-Q enables multiple CPU cores to launch work on a single GPU simultaneously, thereby

dramatically increasing GPU utilization and slashing CPU idle times. This feature increases


25

the total number of connections between the host and the the Kepler GK110 GPU by allowing 32

simultaneous, hardware managed connections, compared to the single connection available with

Fermi.

Hyper-Q is a flexible solution that allows connections for both CUDA streams and Message

Passing Interface (MPI) processes, or even threads from within a process. Existing applications

that were previously limited by false dependencies can see up to a 32x performance increase

without changing any existing code.

GPU IN MOBLIES

Mobile devices are quickly becoming our most valuable personal computers. Whether we’re

reading email, surfing the Web, interacting on social networks, taking pictures, playing games, or

using countless apps, our smartphones and tablets are becoming indispensable. Many people are

also using mobile devices such as Microsoft’s Surface RT and Lenovo’s Yoga 11 because of

their versatile form-factors, ability to run business apps, physical keyboards, and outstanding

battery life.

Amazing new visual computing experiences are possible on mobile devices thanks to ever more

powerful GPU subsystems. A fast GPU allows rich and fluid 2D or 3D user interfaces, high

resolution display output, speedy Web page rendering, accelerated photo and video editing, and


26

more realistic 3D gaming. Powerful GPUs are also becoming essential for auto infotainment

applications, such as highly detailed and easy to read 3D navigation systems and digital

instrument clusters, or rear-seat entertainment and driver assistance systems.

Each new generation of NVIDIA® Tegra® mobile processors has delivered significantly higher

CPU and GPU performance while improving its architectural and power efficiency. Tegra

processors have enabled amazing new mobile computing experiences in smartphones and tablets,

such as full-featured Web browsing, console class gaming, fast UI and multitasking

responsiveness, and Blu-ray quality video playback.

At CES 2013, NVIDIA announced Tegra 4 processor, the world’s first quad-core SoC using

four ARM Cortex-A15 CPUs, a Battery Saver Cortex A15 core, and a 72-core NVIDIA GPU.

With its increased number of GPU cores, faster clocks, and architectural efficiency

improvements, the Tegra 4 processor’s GPU delivers approximately 20x the GPU horsepower of

Tegra 2 processor. The Tegra 4 processor also combines its CPUs, GPU, and ISP to create a

Computational Photography Engine for near-real-time HDR still frame photo and 1080p 30

video capture.

The Need for Fast Mobile GPUs One of the most popular application categories that demands fast GPU processing is 3D games.

Mobile 3D games have evolved from simple 2D visuals to now rival console gaming experiences

and graphics quality. In fact, some games that take full advantage of Tegra 4’s GPU and CPU

cores are hard to distinguish graphically from PC games! Not only have the mobile games

evolved, but mobile gaming as an application segment is one of the fastest growing in the

industry today. Visually rich PC and console games such as Max Payne and Grand Theft Auto

III are now available on mobile devices.

High quality, high resolution retina displays (with resolutions high enough that the human eye

cannot discern individual pixels at normal viewing distances) are now being used in various

mobile devices. Such high resolution displays require fast GPUs to deliver smooth UI

interactions, fast Web page rendering, snappy high-res photo manipulation, and of course high

quality 3D gaming. Similarly, connecting a smartphone or tablet to an external high-resolution

4K screen absolutely requires a powerful GPU.


27

With two decades of GPU industry leadership, the mobile GPU in the NVIDIA® Tegra® 4

Family of mobile processors is architected to deliver the performance demanded by console-class

mobile games, modern user interfaces, and high-resolution displays, while reducing power

consumption to fall within mobile power budgets. You will see that the Tegra 4 processor is also

the most architecturally efficient GPU subsystem in any mobile SoC today.

Tegra 4 Family GPU Features and Architecture

The Tegra 4 processor’s GPU accelerates both 2D and 3D rendering. Although 2D rendering is

often considered a “given” nowadays, it’s critically important to the user experience. The 2D

engine of the Tegra 4 processor’s GPU provides all the relevant low-level 2D composition

functionality, including alpha-blending, line drawing, video scaling, BitBLT, color space

conversion, and screen rotations. Working in concert with the display subsystem and video

decoder units, the GPU also helps support 4K video output to high-end 4K video display.

The 3D engine is fully programmable, and includes high-performance geometry and pixel

processing capability enabling advanced 3D user interfaces and console-quality gaming

experiences. The GPU also accelerates Flash processing in Web pages and GPGPU (General

Purpose GPU) computing, as used in NVIDIA’s new Computational Photography Engine,

NVIDIA Chimera™ architecture, that implements near-real-time HDR photo and video

photography, HDR Panoramic image processing, and “Tap-to-Track” objecting tracking.

As shown in the Tegra 4 family diagram in Figure 1 below, Tegra 4 processor includes a 72 core

GPU subsystem. The Tegra 4’s processor’s GPU has 6x the number of shader processing cores

of Tegra 3 processor, which translates to roughly 3-4x delivered game performance and

sometimes even higher. The NVIDIA Tegra 4i processor uses the same GPU architecture as the

Tegra 4 processor, but a 60 core variant instead of 72 cores. Even at 60 cores it delivers an

astounding amount of graphics performance for mainstream smartphone devices.


28

ADVANTAGES

Antialiasing

• One of the major problems of computer graphics is that any curved or angular lines exhibit a

lot of jaggedness, almost like the steps on a set of stairs.

• This jaggedness in 3D graphics is often referred to by gamers simply as 'jaggies', but the

effect is formally known as Aliasing.

• Aliasing occurs because the image on our screen is only a pixellated sample of the original

3D information , the graphics card has calculated.

• So a technique originally called Full Scene Anti-Aliasing (FSAA), and now simply

Antialiasing (AA), can be used by our graphics card to make the jagged lines in computer

graphics appear smoother without having to increase our resolution.

Anisotropic Filtering

• Antialiasing will not resolve graphical glitch common in 3D graphics - namely blurry

surfaces on more distant objects.

• A technique called Texture Filtering can help to raise the level of detail of distant

textures by increasing the numbers of texture samples to construct the relevant pixels on

the screen.

• These texture filtering methods are 'isotropic' methods, which means that they use a

filtering pattern which is square.

• However texture blurriness occurs on textures which are receding into the distance,

which means they require a non-square (typically rectangular or trapezoidal) filtering

pattern.

• Such a pattern is referred to as an 'an-isotropic' filtering method, which leads us to the

technique called Anisotropic Filtering (AF).


29

APPLICATIONS

BIOINFORMATICS

Genomic researchers employs GPU accelerated HPC to reduce processing time.

Earlier storage & speed problem existed due to imbalance b/w DNA sequencing &

computer technology.

Second generation sequencing machines accelerated by GPU , enabled genomes to be

mapped 50,000 faster than a decade ago.

High memory bandwidth & multi core processing power of GPU led to developments in

this field.

ELECTRONIC DESIGN AUTOMATION

Increase in VLSI design complexity poses a significant challenge to EDA.

Semiconductor industries always seek for faster simulation solutions.

As performance of simulation falls it leads to incomplete verification and missed

functional bugs

Recent trends in high performance computers use GPUs as massively parallel co-

processor to achieve speed up of intensive EDA simulations such as

1. Verilog Simulation

2. Computational Lithography

3. Spice Circuit Simulation

DEFENCE AND INTELLIGENCE

D&I community relies on accurate & timely info in it's strategic & day to day activities.

Image Processing : 100 million finger print images are stored in FBI’s data base.


30

GPU’s accelerate the image processing work flow, including geo rectification, filtering

algorithms,3D reconstruction etc.

Persistent Video Surveillance : DOD’s in every country have surveillance camera’s fitted

in almost everywhere.

GPUs represent a great tool to archive real time performance for video processing and

analytics algorithms.

MEDICAL IMAGING

Only 2% of doctors can operate on a beating heart.

Patient stand to lose 1 point of IQ every 10 min with heart stopped.

GPU enables real-time motion compensation to virtually stop beating heart for surgeons.

Reconstruction of image becomes much faster when devices are embedded with GPU.

.


31

CHALLENGES FACED

Scaling the performance and capabilities of all parallel-processor chips, including GPUs, is

challenging. First, as power supply voltage scaling has diminished, future architectures must

become more inherently energy efficient. Second, the road map for memory bandwidth

improvements is slowing down and falling further behind the computational capabilities

available on die. Third, even after 40 years of research, parallel programming is far from a

solved problem. Addressing these challenges will require research innovations that depart from

the evolutionary path of conventional architectures and programming systems.

Power and energy Because of leakage constraints, power supply voltage scaling has largely stopped, causing energy

per operation to now scale only linearly with process feature size. Meanwhile, the number of

devices that can be placed on a chip continues to increase quadratically with decreasing feature

size. The result is that all computers, from mobile devices to supercomputers, have or will

become constrained by power and energy rather than area. Because we can place more

processors on a chip than we can afford to power and cool, a chip’s utility is largely determined

by its performance at a particular power level, typically 3 W for a mobile device and 150 W for a

desktop or server component.

Programmability The technological challenges we have outlinedhere and the architectural choices that they

necessitate present several challenges for programmers. Current programming practice

predominantly focuses on sequential, homogeneous machines with a flat view of memory.

Contemporary machines have already moved away from this simple model, and future machines

will only continue that trend. Driven by the massive concurrency of GPUs, these trends present

many research challenges. Locality. The increasing time and energy costs of distant memory

accesses demands increasingly deep memory hierarchies. Performancesensitive code must

exercise some explicit control over the movement of data within this hierarchy.


32

However, few programming systems provide any means for programs to express the

programmer’s knowledge about the data access patterns’ structure or to explicitly control data

placement within the memory hierarchy. Current architectures encourage a flat view of memory

with implicit data caching occurring at multiple levels of the machine.

As is abundantly clear in today’s clusters, this approach is not sustainable for scalable machines.

Memory model. Scaling memory systems will require relaxing traditional notions of memory

consistency and coherence to facilitate greater memory-level parallelism. Although full cache

coherence is a convenient abstraction that can aid in programming productivity, the cost of

coherence and its protocol traffic even at the chip level makes supporting it for all memory

accesses unattractive. Programming systems must let programmers reason about memory

accesses with different attributes and behaviors, including some that are kept coherent and others

that are not.


33

CONCLUSION

From the introduction of the first 3D accelerator in 1996 these units have come a long way to be

truly called a “Graphics Processing Unit”. So it is not a wonder that this piece of hardware is

often referred to as an exotic product as far as computer peripherals are concerned. By observing

the current pace at which work is going on in developing GPUs we can surely come to a

conclusion that we will be able to see better and faster GPUs in the near future with variety of

applications. The role of GPUs in scientific computing is increasing and therefore it can be

considered more likely as a co-processor than just a special purpose unit for graphics.


34

REFERENCES

David Luebke, ” GPU Computing - Past, Present and Future”, GPU Technology

conference, San Francisco, October 11-14, 2011

Stephen W. Keckler ,” GPUs and The Future of Parallel Computing”,IEEE_MICRO_2011

Zhang k, Kang J.U ,”Graphics Processing Unit based Ultra High Speed Techniques”,

Volume 18,Issue:4,year:2012

Ajit Datar ,”Graphics Processing Unit Architecture (GPU Arch) with focus NVIDIA

GeForce 6800 GPU” Publication Date : 14 April 2008

Peter Geldof , “Generic Computing On GPU” :University of Amsterdam , Netherlands :

may 2007

Proceedings of the CVPR Workshop on Computer Vision on GPU, Anchorage, Alaska,

USA, June 2008.

seminar report on gpu

Documents