seminar report on gpu
DESCRIPTION
A report on the latest innovations on GRAPHICS PROCESSING UNITTRANSCRIPT
1
GRAPHICS PROCESSING UNIT
A Seminar Report
Submitted by
AJMAL P M
In partial fulfillment for the award of the degree
Of
BACHELOR OF TECHNOLOGY
IN
ELECTRONICS & COMMUNICATION
At
SCMS SCHOOL OF ENGINEERING AND TECHNOLOGY
VIDYA NAGAR, KARUKUTTY, COCHIN, KERALA
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
October 2013
2
SCMS SCHOOL OF ENGINEERING & TECHNOLOGY
VIDYA NAGAR, KARUKUTTY, COCHIN, KERALA
(Affiliated to Mahatma Gandhi University, Kottayam)
Department Of Electronics and Communication Engineering
CERTIFICATE
This is to certify that the seminar report entitled “GRAPHICS PROCESSING UNIT” is the
bona fide report of the work done by AJMAL P M , Reg. No: 10015231 of Seventh Semester,
Electronics & Communication Engineering towards the partial fulfillment of the requirement
for the award of the Degree of Bachelor of Technology in Electronics & Communication
Engineering by the Mahatma Gandhi University during the academic year 2013.
Guide HOD
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
3
ACKNOWLEDGEMENT
I take this opportunity to thank, firstly the principal of SSET
Karukutty, Prof. P Madhavan for providing all the facilities that allowed me to complete this
seminar. I also extend my gratitude towards the HOD of ECE department, Prof. R
Sahadevanfor his constant support.
This seminar would not have been a success if not for the timely
advice, guidance and support of my guides Assistant Prof. ECE department Mr.Sabu G
Panicker, Mr. Sunil Jacob and Mr. Vinoj P.G, the lectures in ECE Department. I express my
sincere gratitude towards them.
I thank all my friends whose help and support were indispensable for
the successful completion of the seminar. I convey my whole hearted gratitude towards my
parents.
Last but not the least; I thank god almighty for blessings which helped
me throughout this humble endeavor.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
4
ABSTRACT
A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically
for the processing of 3D graphics. The processor is built with integrated transform, lighting,
triangle setup/clipping, and rendering engines , capable of handling millions of math-intensive
processes per second. GPUs allow products such as desktop PCs, portable computers, and game
consoles to process real-time 3D graphics that only a few years ago were only available on high-
end workstations. Used primarily for 3-D applications, a graphics processing unit is a single-chip
processor that creates lighting effects and transforms objects every time a 3D scene is redrawn.
These are mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
5
TABLE OF CONTENTS
TITLE PAGE NO
INTRODUCTION……………………………………………………………….……...6
WHAT’S A GPU …………………………………………………………………...…...7
HISTORY AND STANDARDS ……………………………………………….…… .8
INTERFACING PORTS……………………..................................................................9
COMPONENTS OF A GPU ……………………….………………………………......13
WORKING......…………................................................................................................15
GPU COMPUTING…………………………………………………………………….17
MODERN GPU ARCHITECTURE.............................................................….............. 18
GPU IN MOBILES……………………………………………………………………...25
ADVANTAGES…………………………………………………………………….......28
APPLICATIONS………………………………………………………………………..30
CHALLENGES FACED……………………………………………………………… 31
CONCLUSION…………………………………………………………………….........33
REFERENCES………………………………………………………………………….34
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
6
INTRODUCTION
There are various applications that require a 3D world to be simulated as realistically as possible
on a computer screen. These include 3D animations in games, movies and other real world
simulations. It takes a lot of computing power to represent a 3D world due to the great amount of
information that must be used to generate a realistic 3D world and the complex mathematical
operations that must be used to project this 3D world onto a computer screen. In this situation,
the processing time and bandwidth are at a premium due to large amounts of both computation
and data.
The functional purpose of a GPU then, is to provide a separate dedicated graphics resources,
including a graphics processor and memory, to relieve some of the burden off of the main system
resources, namely the Central Processing Unit, Main Memory, and the System Bus, which would
otherwise get saturated with graphical operations and I/O requests. The abstract goal of a GPU,
however, is to enable a representation of a 3D world as realistically as possible. So these GPUs
are designed to provide additional computational power that is customized specifically to
perform these 3D tasks.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
7
WHAT’S A GPU?
A Graphics Processing Unit (GPU) is a microprocessor that has been designed specifically for
the processing of 3D graphics. The processor is built with integrated transform, lighting, triangle
setup/clipping, and rendering engines, capable of handling millions of math-intensive processes
per second. GPUs form the heart of modern graphics cards, relieving the CPU (central
processing units) of much of the graphics processing load. GPUs allow products such as desktop
PCs, portable computers, and game consoles to process real-time 3D graphics that only a few
years ago were only available on high-end workstations.
Used primarily for 3-D applications, a graphics processing unit is a single-chip processor that
creates lighting effects and transforms objects every time a 3D scene is redrawn. These are
mathematically-intensive tasks, which otherwise, would put quite a strain on the CPU. Lifting
this burden from the CPU frees up cycles that can be used for other jobs.
However, the GPU is not just for playing 3D-intense videogames or for those who create
graphics (sometimes referred to as graphics rendering or content-creation) but is a crucial
component that is critical to the PC's overall system speed. In order to fully appreciate the
graphics card's role it must first be understood.
Many synonyms exist for Graphics Processing Unit in which the popular one being the graphics
card .It’s also known as a video card, video accelerator, video adapter, video board, graphics
accelerator, or graphics adapter.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
8
HISTORY AND STANDARDS
The first graphics cards, introduced in August of 1981 by IBM, were monochrome cards
designated as Monochrome Display Adapters (MDAs). The displays that used these cards were
typically text-only, with green or white text on a black background. Color for IBM-compatible
computers appeared on the scene with the 4-color Hercules Graphics Card (HGC), followed by
the 8-color Color Graphics Adapter (CGA) and 16-color Enhanced Graphics Adapter
(EGA). During the same time, other computer manufacturers, such as Commodore, were
introducing computers with built-in graphics adapters that could handle a varying number of
colors.
When IBM introduced the Video Graphics Array (VGA) in 1987, a new graphics standard
came into being. A VGA display could support up to 256 colors (out of a possible 262,144-color
palette) at resolutions up to 720x400. Perhaps the most interesting difference between VGA and
the preceding formats is that VGA was analog, whereas displays had been digital up to that
point. Going from digital to analog may seem like a step backward, but it actually provided the
ability to vary the signal for more possible
combinations than the strict on/off nature of digital.
Over the years, VGA gave way to Super Video Graphics Array (SVGA). SVGA cards were
based on VGA, but each card manufacturer added resolutions and increased color depth in
different ways. Eventually, the Video Electronics Standards Association (VESA) agreed on a
standard implementation of SVGA that provided up to 16.8 million colors and 1280x1024
resolution. Most graphics cards available today support Ultra
Extended Graphics Array (UXGA). UXGA can support a palette of up to 16.8 million colors
and resolutions up to 1600x1200 pixels.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
9
Even though any card you can buy today will offer higher colors and resolution than the basic
VGA specification, VGA mode is the de facto standard for graphics and is the minimum on all
cards. In addition to including VGA, a graphics card must be able to connect to your computer.
While there are still a number of graphics cards that plug into an Industry Standard Architecture
(ISA) or Peripheral Component Interconnect (PCI) slot, most current graphics cards use the
Accelerated Graphics Port (AGP).
PERIPHERAL COMPONENT INTERCONNECT (PCI)
There are a lot of incredibly complex components in a computer. And all of these parts
need to communicate with each other in a fast and efficient manner. Essentially, a bus is
the channel or path between the components in a computer. During the early 1990s, Intel
introduced a new bus standard for consideration, the Peripheral Component Interconnect
(PCI).It provides direct access to system memory for connected devices, but uses a bridge
to connect to the front side bus and therefore to the CPU.
The illustration below shows how the various buses connect to the CPU.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
10
PCI can connect up to five external components. Each of the five connectors for an external
component can be replaced with two fixed devices on the motherboard. The PCI bridge chip
regulates the speed of the PCI bus independently of the CPU's speed. This provides a higher
degree of reliability and ensures that PCI-hardware manufacturers know exactly what to design
for.
PCI originally operated at 33 MHz using a 32-bit-wide path. Revisions to the standard include
increasing the speed from 33 MHz to 66 MHz and doubling the bit count to 64. Currently, PCI-X
provides for 64-bit transfers at a speed of 133 MHz for an amazing 1-GBps (gigabyte per
second) transfer rate!
PCI cards use 47 pins to connect (49 pins for a mastering card, which can control the PCI bus
without CPU intervention). The PCI bus is able to work with so few pins because of hardware
multiplexing, which means that the device sends more than one signal over a single pin. Also,
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
11
PCI supports devices that use either 5 volts or 3.3 volts. PCI slots are the best choice for network
interface cards (NIC), 2-D video cards, and other high-bandwidth
devices. On some PCs, PCI has completely superseded the old ISA expansion slots.
Although Intel proposed the PCI standard in 1991, it did not achieve popularity until the arrival
of Windows 95 (in 1995). This sudden interest in PCI was due to the fact that Windows 95
supported a feature called Plug and Play (PnP). PnP means that you can connect a device or
insert a card into your computer and it is automatically recognized and configured to work in
your system. Intel created the PnP standard and incorporated it into the design for PCI. But it
wasn't until several years later that a mainstream operating system, Windows 95, provided
system-level support for PnP. The introduction of PnP
accelerated the demand for computers with PCI.
ACCELERATED GRAPHICS PORT (AGP)
The need for streaming video and real-time-rendered 3-D games requires an even faster
throughput than that provided by PCI. In 1996, Intel debuted the Accelerated Graphics Port
(AGP), a modification of the PCI bus designed specifically to facilitate the use of streaming
video and high-performance graphics.
AGP is a high-performance interconnect between the core-logic chipset and the graphics
controller for enhanced graphics performance for 3D applications. AGP relieves the graphics
bottleneck by adding a dedicated highspeed interface directly between the chipset and the
graphics controller as shown below.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
12
Segments of system memory can be dynamically reserved by the OS for use by the graphics
controller. This memory is termed AGP memory or nonlocal video memory. The net result is
that the graphics controller is required to keep fewer texture maps in local memory.
AGP has 32 lines for multiplexed address and data. There are an additional 8 lines for sideband
addressing. Local video memory can be expensive and it cannot be used for other purposes by
the OS when unneeded by the graphics of the running applications. The graphics controller needs
fastaccess to local video memory for screen refreshes and various pixel elements including Z-
buffers, double buffering, overlay planes, and textures.
For these reasons, programmers can always expect to have more texture memory available via
AGP system memory. Keeping textures out of the frame buffer allows larger screen resolution,
or permits Z-buffering for a given large screen size. As the need for more graphics intensive
applications continues to scale upward, the amount of textures stored in system memory will
increase. AGP delivers these textures from system memory to the graphics controller at speeds
sufficient to make system memory usable as a secondary texture store.
PCI EXPRESS
PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe, is a
high-speed serial computer expansion bus standard designed to replace the older PCI, PCI-X,
and AGP bus standards. PCIe has numerous improvements over the aforementioned bus
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
13
standards, including higher maximum system bus throughput, lower I/O pin count and smaller
physical footprint, better performance-scaling for bus devices, a more detailed error detection
and reporting mechanism (Advanced Error Reporting (AER) ), and native hot-plug functionality.
More recent revisions of the PCIe standard support hardware I/O virtualization.
The PCIe electrical interface is also used in a variety of other standards, most notably
ExpressCard, a laptop expansion card interface.
Format specifications are maintained and developed by the PCI-SIG (PCI Special Interest
Group), a group of more than 900 companies that also maintain the conventional PCI
specifications. PCIe 3.0 is the latest standard for expansion cards that is in production and
available on mainstream personal computers.
COMPONENTS OF GPU
There are several components on a typical graphics card:
Graphics Processor
The graphics processor is the brains of the card, and is typically one of
three configurations:
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
14
Graphics co-processor: A card with this type of processor can handle all of the graphics chores
without any assistance from the computer's CPU. Graphics co- processors are typically found on
high-end video cards.
Graphics accelerator: In this configuration, the chip on the graphics card renders graphics
based on commands from the computer's CPU. This is the most common configuration used
today.
Frame buffer: This chip simply controls the memory on the card and sends information to the
digital-to-analog converter (DAC) . It does no processing of the image data and is rarely used
anymore.
Memory – The type of RAM used on graphics cards varies widely, but the most popular types
use a dual-ported configuration. Dual-ported cards can write to one section of memory while it is
reading from another section, decreasing the time it takes to refresh an image.
Graphics BIOS – Graphics cards have a small ROM chip containing basic information that tells
the other components of the card how to function in relation to each other. The BIOS also
performs diagnostic tests on the card's memory and input/ output (I/O) to ensure that everything
is functioning correctly.
Digital-to-Analog Converter (DAC) – The DAC on a graphics card is commonly known as a
RAMDAC because it takes the data it converts directly from the card's memory. RAMDAC
speed greatly affects the image you see on the monitor. This is because the refresh rate of the
image depends on how quickly the analog information gets to the monitor.
Display Connector – Graphics cards use standard connectors. Most cards use the 15-pin
connector that was introduced with Video Graphics Array (VGA).
Computer (Bus) Connector – This is usually Accelerated Graphics Port (AGP). This port
enables the video card to directly access system memory. Direct memory access helps to make
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
15
the peak bandwidth four times higher than the Peripheral Component Interconnect (PCI) bus
adapter card slots. This allows the central processor to do other tasks while the graphics chip on
the video card accesses system memory.
WORKING
The working of GPU can be explained by considering the graphics pipeline. There are different
steps involved in creating a complete 3D scene. It is done by different parts of the GPU, each of
which are assigned a particular job. During 3D rendering, there are different types of data the
travel across the bus. The two most common types are texture and geometry data. The
geometry data is the "infrastructure" that the rendered scene is built on. This is made up of
polygons (usually triangles) that are represented by vertices, the end-points that define each
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
16
polygon. Texture data provides much of the detail in a scene, and textures can be used to
simulate more complex geometry, add lighting, and give an object a simulated surface.
Many new graphics chips now have accelerated Transform and Lighting (T&L) unit, which
takes a 3D scene's geometry and transforms it into different coordinate spaces. It also performs
lighting calculations, again relieving the CPU from these math-intensive tasks.
Following the T&L unit on the chip is the triangle setup engine. It takes a scene's transformed
geometry and prepares it for the next stages of rendering by converting the scene into a form that
the pixel engine can then process. The pixel engine applies assigned texture values to each pixel.
This gives each pixel the correct color value so that it appears to have surface texture and does
not look like a flat, smooth object. After a pixel has been rendered it must be checked to see
whether it is visible by checking the depth value, or Z value.
A Z check unit performs this process by reading from the Z-buffer to see if there are any other
pixels rendered to the same location where the new pixel will be rendered. If another pixel is at
that location, it compares the Z value of the existing pixel to that of the new pixel. If the new
pixel is closer to the view camera, it gets written to the frame buffer. If it's not, it gets discarded.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
17
After the complete scene is drawn into the frame buffer the RAMDAC converts this digital data
into analog that can be given to the monitor for display.
BLOCK DIAGRAM
GPU COMPUTING
GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate
general-purpose scientific and engineering applications. Pioneered five years ago by NVIDIA,
GPU computing has quickly become an industry standard, enjoyed by millions of users
worldwide and adopted by virtually all computing vendors.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
18
GPU computing offers unprecedented application performance by offloading compute-intensive
portions of the application to the GPU, while the remainder of the code still runs on the CPU.
From a user's perspective, applications simply run significantly faster.
CPU + GPU is a powerful combination because CPUs consist of a few cores optimized for serial
processing, while GPUs consist of thousands of smaller, more efficient cores designed for
parallel performance. Serial portions of the code run on the CPU while parallel portions run on
the GPU.
MODERN GPU ARCHITECTURE
The G80 Architecture NVIDIA’s GeForce 8800 was the product that gave birth to the new GPU Computing model.
Introduced in November 2006, the G80 based GeForce 8800 brought several key innovations to
GPU Computing:
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
19
• G80 was the first GPU to support C, allowing programmers to use the power of the GPU
without having to learn a new programming language.
• G80 was the first GPU to replace the separate vertex and pixel pipelines with a single, unified
processor that executed vertex, geometry, pixel, and computing programs.
• G80 was the first GPU to utilize a scalar thread processor, eliminating the need for
programmers to manually manage vector registers.
• G80 introduced the single-instruction multiple-thread (SIMT) execution model where multiple
independent threads execute concurrently using a single instruction.
• G80 introduced shared memory and barrier synchronization for inter-thread communication.
In June 2008, NVIDIA introduced a major revision to the G80 architecture. The second
generation unified architecture—GT200 (first introduced in the GeForce GTX 280, Quadro FX
5800, and Tesla T10 GPUs)—increased the number of streaming processor cores (subsequently
referred to as CUDA cores) from 128 to 240. Each processor register file was doubled in size,
allowing a greater number of threads to execute on-chip at any given time.
Hardware memory access coalescing was added to improve memory access efficiency. Double
precision floating point support was also added to address the needs of scientific and high-
performance computing (HPC) applications. When designing each new generation GPU, it has
always been the philosophy at NVIDIA to improve both existing application performance and
GPU programmability; while faster application performance brings immediate benefits, it is the
GPU’s relentless advancement in programmability that has allowed it to evolve into the most
versatile parallel processor of our time. It was with this mindset that we set out to develop the
successor to the GT200 architecture.
CUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) is a parallel computing platform and
programming model created by NVIDIA and implemented by the graphics processing units
(GPUs) that they produce. CUDA gives program developers direct access to the virtual
instruction set and memory of the parallel computational elements in CUDA GPUs.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
20
Using CUDA, the GPUs can be used for general purpose processing (i.e., not exclusively
graphics); this approach is known as GPGPU. Unlike CPUs, however, GPUs have a parallel
throughput architecture that emphasizes executing many concurrent threads slowly, rather than
executing a single thread very quickly.
The CUDA platform is accessible to software developers through CUDA-accelerated libraries,
compiler directives (such as OpenACC), and extensions to industry-standard programming
languages, including C, C++and Fortran. C/C++ programmers use 'CUDA C/C++', compiled
with "nvcc", NVIDIA's LLVM-based C/C++ compiler, and Fortran programmers can use 'CUDA
Fortran', compiled with the PGI CUDA Fortran compiler from The Portland Group.
In addition to libraries, compiler directives, CUDA C/C++ and CUDA Fortran, the CUDA
platform supports other computational interfaces, including the Khronos Group's OpenCL,
Microsoft's Direct Compute, and C++ AMP. In the computer game industry, GPUs are used not
only for graphics rendering but also in game physics calculations (physical effects like debris,
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
21
smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate
non-graphical applications in computational biology, cryptography and other fields by an order
of magnitude or more.
CUDA provides both a low level API and a higher level API. The initial CUDA SDK was made
public on 15 February 2007, for Microsoft Windows and Linux. Mac OS X support was later
added in version 2.0,which supersedes the beta released February 14, 2008. CUDA works with
all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the Tesla line.
CUDA is compatible with most standard operating systems. Nvidia states that programs
developed for the G8x series will also work without modification on all future Nvidia video
cards, due to binary compatibility.
Advantages
CUDA has several advantages over traditional general-purpose computation on GPUs (GPGPU)
using graphics APIs:
Scattered reads – code can read from arbitrary addresses in memory
Shared memory – CUDA exposes a fast shared memory region (up to 48KB per Multi-
Processor) that can be shared amongst threads. This can be used as a user-managed
cache, enabling higher bandwidth than is possible using texture lookups.
Faster downloads and read backs to and from the GPU
Full support for integer and bitwise operations, including integer texture lookups
FERMI ARCHITECTURE
The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA
cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The
512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory
partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
22
memory. A host interface connects the GPU to the CPU via PCI-Express. The Giga Thread
global scheduler distributes thread blocks to SM thread schedulers.
Third Generation Streaming Multiprocessor
The third generation SM introduces several architectural innovations that make it not only the
most powerful SM yet built, but also the most programmable and efficient.
512 High Performance CUDA cores
Each SM features 32 CUDA processors—a fourfold increase over prior SM designs. Each
CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit
(FPU). Prior GPUs used IEEE 754-1985 floating point arithmetic. The Fermi architecture
implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add
(FMA) instruction for both single and double precision arithmetic. FMA improves over a
multiply-add (MAD) instruction by doing the multiplication and addition with a single final
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
23
rounding step, with no loss of precision in the addition. FMA is more accurate than performing
the operations separately. GT200 implemented double precision FMA.
In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,
multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly
designed integer ALU supports full 32-bit precision for all instructions, consistent with standard
programming language requirements. The integer ALU is also optimized to efficiently support
64-bit and extended precision operations. Various instructions are supported, including
Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population
count.
KEPLER ARCHITECTURE
With the launch of Fermi GPU in 2009, NVIDIA ushered in a new era in the high performance
computing (HPC) industry based on a hybrid computing model where CPUs and GPUs work
together workloads. And in just a couple of years, NVIDIA Fermi GPUs powerss me of the
fastest supercomputers in the world as well as tens of thousands of research clusters globally.
Now, with the new Kepler GK110 GPU, NVIDIA raises the bar for the HPC industry, yet again.
Comprised of 7.1 billion transistors, the Kepler GK110 GPU is an engineering marvel created to
address the most daunting challenges in HPC. Kepler is designed from the ground up to
maximize computational performance with superior power efficiency. The architecture has
innovations that make hybrid computing dramatically easier, applicable to a broader set of
applications, and more accessible. Kepler GK110 GPU is a computational workhorse with
teraflops of integer, single precision, and double precision performance and the highest memory
bandwidth. The first GK110 based product will be the Tesla K20 GPU computing accelerator.
SMX - Next Generation Streaming Multiprocessor
At the heart of the Kepler GK110 GPU is the new SMX unit, which comprises of several
architectural innovations that make it not only the most powerful Streaming Multiprocessor (SM)
we’ve ever built but also the most programmable and power-efficient. It delivers more processing
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
24
performance and efficiency through this new, innovative streaming multiprocessor design that
allows a greater percentage of space to be applied to processing cores versus control logic.
Dynamic Parallelism
Any kernel can launch another kernel and can create the necessary streams, events, and
dependencies needed to process additional work without the need for host CPU interaction. This
simplified programming model is easier to create, optimize, and maintain. It also creates a
programmer friendly environment by maintaining the same syntax for GPU launched workloads
as traditional CPU kernel launches. Dynamic Parallelism broadens what applications can now
accomplish with GPUs in various disciplines. Applications can launch small and medium sized
parallel workloads dynamically where it was too expensive to do so previously.
Hyper-Q — Maximizing the GPU Resources
Hyper-Q enables multiple CPU cores to launch work on a single GPU simultaneously, thereby
dramatically increasing GPU utilization and slashing CPU idle times. This feature increases
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
25
the total number of connections between the host and the the Kepler GK110 GPU by allowing 32
simultaneous, hardware managed connections, compared to the single connection available with
Fermi.
Hyper-Q is a flexible solution that allows connections for both CUDA streams and Message
Passing Interface (MPI) processes, or even threads from within a process. Existing applications
that were previously limited by false dependencies can see up to a 32x performance increase
without changing any existing code.
GPU IN MOBLIES
Mobile devices are quickly becoming our most valuable personal computers. Whether we’re
reading email, surfing the Web, interacting on social networks, taking pictures, playing games, or
using countless apps, our smartphones and tablets are becoming indispensable. Many people are
also using mobile devices such as Microsoft’s Surface RT and Lenovo’s Yoga 11 because of
their versatile form-factors, ability to run business apps, physical keyboards, and outstanding
battery life.
Amazing new visual computing experiences are possible on mobile devices thanks to ever more
powerful GPU subsystems. A fast GPU allows rich and fluid 2D or 3D user interfaces, high
resolution display output, speedy Web page rendering, accelerated photo and video editing, and
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
26
more realistic 3D gaming. Powerful GPUs are also becoming essential for auto infotainment
applications, such as highly detailed and easy to read 3D navigation systems and digital
instrument clusters, or rear-seat entertainment and driver assistance systems.
Each new generation of NVIDIA® Tegra® mobile processors has delivered significantly higher
CPU and GPU performance while improving its architectural and power efficiency. Tegra
processors have enabled amazing new mobile computing experiences in smartphones and tablets,
such as full-featured Web browsing, console class gaming, fast UI and multitasking
responsiveness, and Blu-ray quality video playback.
At CES 2013, NVIDIA announced Tegra 4 processor, the world’s first quad-core SoC using
four ARM Cortex-A15 CPUs, a Battery Saver Cortex A15 core, and a 72-core NVIDIA GPU.
With its increased number of GPU cores, faster clocks, and architectural efficiency
improvements, the Tegra 4 processor’s GPU delivers approximately 20x the GPU horsepower of
Tegra 2 processor. The Tegra 4 processor also combines its CPUs, GPU, and ISP to create a
Computational Photography Engine for near-real-time HDR still frame photo and 1080p 30
video capture.
The Need for Fast Mobile GPUs One of the most popular application categories that demands fast GPU processing is 3D games.
Mobile 3D games have evolved from simple 2D visuals to now rival console gaming experiences
and graphics quality. In fact, some games that take full advantage of Tegra 4’s GPU and CPU
cores are hard to distinguish graphically from PC games! Not only have the mobile games
evolved, but mobile gaming as an application segment is one of the fastest growing in the
industry today. Visually rich PC and console games such as Max Payne and Grand Theft Auto
III are now available on mobile devices.
High quality, high resolution retina displays (with resolutions high enough that the human eye
cannot discern individual pixels at normal viewing distances) are now being used in various
mobile devices. Such high resolution displays require fast GPUs to deliver smooth UI
interactions, fast Web page rendering, snappy high-res photo manipulation, and of course high
quality 3D gaming. Similarly, connecting a smartphone or tablet to an external high-resolution
4K screen absolutely requires a powerful GPU.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
27
With two decades of GPU industry leadership, the mobile GPU in the NVIDIA® Tegra® 4
Family of mobile processors is architected to deliver the performance demanded by console-class
mobile games, modern user interfaces, and high-resolution displays, while reducing power
consumption to fall within mobile power budgets. You will see that the Tegra 4 processor is also
the most architecturally efficient GPU subsystem in any mobile SoC today.
Tegra 4 Family GPU Features and Architecture
The Tegra 4 processor’s GPU accelerates both 2D and 3D rendering. Although 2D rendering is
often considered a “given” nowadays, it’s critically important to the user experience. The 2D
engine of the Tegra 4 processor’s GPU provides all the relevant low-level 2D composition
functionality, including alpha-blending, line drawing, video scaling, BitBLT, color space
conversion, and screen rotations. Working in concert with the display subsystem and video
decoder units, the GPU also helps support 4K video output to high-end 4K video display.
The 3D engine is fully programmable, and includes high-performance geometry and pixel
processing capability enabling advanced 3D user interfaces and console-quality gaming
experiences. The GPU also accelerates Flash processing in Web pages and GPGPU (General
Purpose GPU) computing, as used in NVIDIA’s new Computational Photography Engine,
NVIDIA Chimera™ architecture, that implements near-real-time HDR photo and video
photography, HDR Panoramic image processing, and “Tap-to-Track” objecting tracking.
As shown in the Tegra 4 family diagram in Figure 1 below, Tegra 4 processor includes a 72 core
GPU subsystem. The Tegra 4’s processor’s GPU has 6x the number of shader processing cores
of Tegra 3 processor, which translates to roughly 3-4x delivered game performance and
sometimes even higher. The NVIDIA Tegra 4i processor uses the same GPU architecture as the
Tegra 4 processor, but a 60 core variant instead of 72 cores. Even at 60 cores it delivers an
astounding amount of graphics performance for mainstream smartphone devices.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
28
ADVANTAGES
Antialiasing
• One of the major problems of computer graphics is that any curved or angular lines exhibit a
lot of jaggedness, almost like the steps on a set of stairs.
• This jaggedness in 3D graphics is often referred to by gamers simply as 'jaggies', but the
effect is formally known as Aliasing.
• Aliasing occurs because the image on our screen is only a pixellated sample of the original
3D information , the graphics card has calculated.
• So a technique originally called Full Scene Anti-Aliasing (FSAA), and now simply
Antialiasing (AA), can be used by our graphics card to make the jagged lines in computer
graphics appear smoother without having to increase our resolution.
Anisotropic Filtering
• Antialiasing will not resolve graphical glitch common in 3D graphics - namely blurry
surfaces on more distant objects.
• A technique called Texture Filtering can help to raise the level of detail of distant
textures by increasing the numbers of texture samples to construct the relevant pixels on
the screen.
• These texture filtering methods are 'isotropic' methods, which means that they use a
filtering pattern which is square.
• However texture blurriness occurs on textures which are receding into the distance,
which means they require a non-square (typically rectangular or trapezoidal) filtering
pattern.
• Such a pattern is referred to as an 'an-isotropic' filtering method, which leads us to the
technique called Anisotropic Filtering (AF).
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
29
APPLICATIONS
BIOINFORMATICS
Genomic researchers employs GPU accelerated HPC to reduce processing time.
Earlier storage & speed problem existed due to imbalance b/w DNA sequencing &
computer technology.
Second generation sequencing machines accelerated by GPU , enabled genomes to be
mapped 50,000 faster than a decade ago.
High memory bandwidth & multi core processing power of GPU led to developments in
this field.
ELECTRONIC DESIGN AUTOMATION
Increase in VLSI design complexity poses a significant challenge to EDA.
Semiconductor industries always seek for faster simulation solutions.
As performance of simulation falls it leads to incomplete verification and missed
functional bugs
Recent trends in high performance computers use GPUs as massively parallel co-
processor to achieve speed up of intensive EDA simulations such as
1. Verilog Simulation
2. Computational Lithography
3. Spice Circuit Simulation
DEFENCE AND INTELLIGENCE
D&I community relies on accurate & timely info in it's strategic & day to day activities.
Image Processing : 100 million finger print images are stored in FBI’s data base.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
30
GPU’s accelerate the image processing work flow, including geo rectification, filtering
algorithms,3D reconstruction etc.
Persistent Video Surveillance : DOD’s in every country have surveillance camera’s fitted
in almost everywhere.
GPUs represent a great tool to archive real time performance for video processing and
analytics algorithms.
MEDICAL IMAGING
Only 2% of doctors can operate on a beating heart.
Patient stand to lose 1 point of IQ every 10 min with heart stopped.
GPU enables real-time motion compensation to virtually stop beating heart for surgeons.
Reconstruction of image becomes much faster when devices are embedded with GPU.
.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
31
CHALLENGES FACED
Scaling the performance and capabilities of all parallel-processor chips, including GPUs, is
challenging. First, as power supply voltage scaling has diminished, future architectures must
become more inherently energy efficient. Second, the road map for memory bandwidth
improvements is slowing down and falling further behind the computational capabilities
available on die. Third, even after 40 years of research, parallel programming is far from a
solved problem. Addressing these challenges will require research innovations that depart from
the evolutionary path of conventional architectures and programming systems.
Power and energy Because of leakage constraints, power supply voltage scaling has largely stopped, causing energy
per operation to now scale only linearly with process feature size. Meanwhile, the number of
devices that can be placed on a chip continues to increase quadratically with decreasing feature
size. The result is that all computers, from mobile devices to supercomputers, have or will
become constrained by power and energy rather than area. Because we can place more
processors on a chip than we can afford to power and cool, a chip’s utility is largely determined
by its performance at a particular power level, typically 3 W for a mobile device and 150 W for a
desktop or server component.
Programmability The technological challenges we have outlinedhere and the architectural choices that they
necessitate present several challenges for programmers. Current programming practice
predominantly focuses on sequential, homogeneous machines with a flat view of memory.
Contemporary machines have already moved away from this simple model, and future machines
will only continue that trend. Driven by the massive concurrency of GPUs, these trends present
many research challenges. Locality. The increasing time and energy costs of distant memory
accesses demands increasingly deep memory hierarchies. Performancesensitive code must
exercise some explicit control over the movement of data within this hierarchy.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
32
However, few programming systems provide any means for programs to express the
programmer’s knowledge about the data access patterns’ structure or to explicitly control data
placement within the memory hierarchy. Current architectures encourage a flat view of memory
with implicit data caching occurring at multiple levels of the machine.
As is abundantly clear in today’s clusters, this approach is not sustainable for scalable machines.
Memory model. Scaling memory systems will require relaxing traditional notions of memory
consistency and coherence to facilitate greater memory-level parallelism. Although full cache
coherence is a convenient abstraction that can aid in programming productivity, the cost of
coherence and its protocol traffic even at the chip level makes supporting it for all memory
accesses unattractive. Programming systems must let programmers reason about memory
accesses with different attributes and behaviors, including some that are kept coherent and others
that are not.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
33
CONCLUSION
From the introduction of the first 3D accelerator in 1996 these units have come a long way to be
truly called a “Graphics Processing Unit”. So it is not a wonder that this piece of hardware is
often referred to as an exotic product as far as computer peripherals are concerned. By observing
the current pace at which work is going on in developing GPUs we can surely come to a
conclusion that we will be able to see better and faster GPUs in the near future with variety of
applications. The role of GPUs in scientific computing is increasing and therefore it can be
considered more likely as a co-processor than just a special purpose unit for graphics.
SEMINAR REPORT 2013 GRAPHICS PROCESSING UNIT
34
REFERENCES
David Luebke, ” GPU Computing - Past, Present and Future”, GPU Technology
conference, San Francisco, October 11-14, 2011
Stephen W. Keckler ,” GPUs and The Future of Parallel Computing”,IEEE_MICRO_2011
Zhang k, Kang J.U ,”Graphics Processing Unit based Ultra High Speed Techniques”,
Volume 18,Issue:4,year:2012
Ajit Datar ,”Graphics Processing Unit Architecture (GPU Arch) with focus NVIDIA
GeForce 6800 GPU” Publication Date : 14 April 2008
Peter Geldof , “Generic Computing On GPU” :University of Amsterdam , Netherlands :
may 2007
Proceedings of the CVPR Workshop on Computer Vision on GPU, Anchorage, Alaska,
USA, June 2008.