journeyman tour
TRANSCRIPT
-
8/7/2019 Journeyman Tour
1/21
ACMParallelComputingTechPack
JourneymansProgrammingTour
November,2010
ParallelComputingCommitteePaulSteinberg,Intel,CoChair
MatthewWolf,CERCS,GeorgiaTech,CoChair
JudithBishop,Microsoft
ClayBreshears,Intel
Barbara
Mary
Chapman,
University
of
Houston
DanielJ.Ernst,UniversityofWisconsinEauClaire
AndrewFitzGibbon,ShodorFoundation
DanGarcia,UniversityofCalifornia,Berkeley
BenedictGaster,AMD
KatherineHartsell,Oracle
TomMurphy,ContraCostaCollege
StevenParker,NVIDIA
CharliePeck,EarlhamCollege
JenniferTeal,Intel
SpecialthankstoAbiSundaram,Intel
http://techpack.acm.org/parallel/paul_steinberg.cfmhttp://techpack.acm.org/parallel/paul_steinberg.cfmhttp://techpack.acm.org/parallel/paul_steinberg.cfmhttp://techpack.acm.org/parallel/matthew_wolf.cfmhttp://techpack.acm.org/parallel/matthew_wolf.cfmhttp://techpack.acm.org/parallel/matthew_wolf.cfmhttp://techpack.acm.org/parallel/matthew_wolf.cfmhttp://techpack.acm.org/parallel/paul_steinberg.cfm -
8/7/2019 Journeyman Tour
2/21
TableofContentsIntroduction
TheBasicsofParallelComputingParallelism
Parallelcomputing
Isconcurrencythesameasparallelism?
ParallelDecompositionsIntroduction
Taskdecomposition
Datadecomposition
Pipelineparallelism
ParallelHardwareMemorysystems
Processingcharacteristics
Coordination
Scalability
Heterogeneousarchitectures
ParallelProgrammingModels,Libraries,andInterfacesIntroduction
Sharedmemorymodelprogramming
Posixthreads
Win32threads
Java
OpenMP
ThreadingBuildingBlocks(TBB)
Distributedmemorymodelprogramming
Messagepassinginterface(MPI)
GeneralpurposeGPUprogramming
TheOpenComputeLanguage(OpenCL)
CUDA
Hybridparallelsoftwarearchitectures
Parallellanguages
ConcurrentMLandConcurrentHaskell
ToolsCompilers
Autoparallelization
Threaddebuggers
Tuners/performanceprofilers
Memorytools
-
8/7/2019 Journeyman Tour
3/21
PARALLELISMCOMPUTINGJOURNEYMANSPROGRAMMINGTOUR
INTRODUCTIONIneverydomainthetoolsthatallowustotacklethebigproblems,andexecutethe
complexcalculationsthatarenecessarytosolvethem,arecomputerbased.The
evolutionofcomputerarchitecturetowardshardwareparallelismmeansthat
software/computationalparallelismhasbecomeanecessarypartofthecomputer
scientistandengineerscoreknowledge.Indeed,understandingandapplying
computationalparallelismisessentialtogaininganythinglikeasustainedperformance
onmoderncomputers. Goingforward,performancecomputingwillbeevenmore
dependentonscalingacrossmanycomputingcoresandonhandlingtheincreasingly
complexnatureofthecomputingtask. Thisistrueirrespectiveofwhetherthedomain
problemispredictingclimatechange,analyzingproteinfolding,orproducingthelatest
animatedblockbuster.
TheParallelismTechPackisacollectionofguidedreferencestohelpstudents,
practitioners,andeducatorscometotermswiththelargeanddynamicbodyof
knowledgethatgoesbythenameparallelism. Wehaveorganizeditasaseriesof
tours;eachtourinthetechpackcorrespondstooneparticularguidedroutethroughthat
bodyofknowledge. Thisparticulartourisgearedtowardsthosewhohavesome
reasonableskillsaspractitionersofserialprogrammingbutwhohavenotyetreally
exploredparallelisminanycoherentway. AllofthetoursintheParallelismTechPack
arelivingdocumentsthatprovidepointerstoresourcesforthenoviceandtheadvanced
programmer,forthestudentandtheworkingengineer.FuturetourswithintheTech
Packwilladdressothertopics.
TheauthorsofthisTechPackaredrawnfrombothindustryandacademia. Despitethis
groupswidevarietyofexperiencesinutilizingparallelplatforms,interfaces,and
applications,weallagreethatparallelismisnowafundamentalconceptforallof
computing.
ScopeofTour: Thistourapproachesparallelismfromthepointofviewof
someonecomfortablewithprogrammingbutnotyetfamiliarwithparallel
concepts. Itwasdesignedtoeaseintothetopicwithsomeintroductorycontext,followedbylinkstoreferencesforfurtherstudy. Thetopicspresentedarebyno
meansexhaustive. Instead,thetopicswerechosensothatacarefulreader
shouldachieveareasonablycompletefeelforthefundamentalconceptsand
paradigmsusedinparallelcomputingacrossmanyplatforms. Excitingareas
liketransactionalmemory,parallelisminfunctionallanguages,distributed
sharedmemoryconstructs,andsoonwillbeaddressedinothertoursbutalso
-
8/7/2019 Journeyman Tour
4/21
shouldbeseenasbuildingonthefoundationsputforthhere.
OnlineReadings
HerbSutter.2005.Thefreelunchisover:Afundamentalturntowardconcurrencyin
software.Dr.DobbsJ.33,3(March).http://www.gotw.ca/publications/concurrency
ddj.htm.
JamesLaurus.2009.SpendingMooresdividend.Commun.ACM52,5(May).
http://doi.acm.org/10.1145/1506409.1506425.
1.THEBASICSOFPARALLELCOMPUTINGParallelismisapropertyofacomputationinwhichportionsofthecalculationsare
independentofeachother,allowingthemtobeexecutedatthesametime.Themore
parallelismexistsinaparticularproblem,themoreopportunitythereisforusing
parallelsystemsandparallellanguagefeaturestoexploitthisparallelismandgainan
overallperformanceimprovement.Forexample,considerthefollowingpseudocode:
floata=E+A;
floatb=E+B;
floatc=E+C;
floatd=E+D;
floatr=a+b+c+d.
Thefirstfourassignmentsareindependentofeachother,andtheexpressionsE+A,E+B,E+C,andE+Dcanallbecalculatedinparallel,thatis,atthesametime,whichcan
potentiallyprovideaperformanceimprovementoverexecutingthemsequentially,that
is,oneatatime.
Parallelcomputingisdefinedasthesimultaneoususeofmorethanoneprocessorto
solveaproblem,exploitingthatprogramsparallelismtospeedupitsexecutiontime.
Isconcurrencythesameasparallelism?Whileconcurrencyandparallelismarerelated,
theyarenotthesame!Concurrencymostlyinvolvesasetofprogrammingabstractions
toarbitratecommunicationbetweenmultipleprocessingentities(likeprocessesor
threads). Thesetechniquesareoftenusedtobuilduserinterfacesandother
asynchronoustasks.Whileconcurrencydoesnotprecluderunningtasksinparallel(and
theseabstractionsareusedinmanytypesofparallelprogramming),itisnota
necessarycomponent.Parallelism,ontheotherhand,isconcernedwiththeexecutionof
multipleoperationsinparallel,thatis,atthesametime.Thefollowingdiagramshows
parallelprogramsasasubsetofconcurrentones,togetherformingasubsetofall
possibleprograms:
http://www.gotw.ca/publications/concurrency-ddj.htmhttp://www.gotw.ca/publications/concurrency-ddj.htmhttp://www.gotw.ca/publications/concurrency-ddj.htmhttp://doi.acm.org/10.1145/1506409.1506425http://doi.acm.org/10.1145/1506409.1506425http://www.gotw.ca/publications/concurrency-ddj.htmhttp://www.gotw.ca/publications/concurrency-ddj.htm -
8/7/2019 Journeyman Tour
5/21
all programs
concurrent
programs
parallel
programs
2.PARALLELDECOMPOSITIONSIntroductionThereareanumberofdecompositionmodelsthatarehelpfultothinkaboutwhen
breakingcomputationintoindependentwork. Sometimesitisclearwhichmodeltopick.
Atothertimesitismoreofajudgmentcall,dependingonthenatureoftheproblem,
howtheprogrammerviewstheproblem,andtheprogrammersfamiliaritywiththe
availabletoolsets.Forexample,ifyouneedtogradefinalexamsforacoursewith
hundredsofstudents,therearemanydifferentwaystoorganizethejobwithmultiple
graderssoastofinishintheshortestamountoftime.
Tutorials
TheEPCCcenteratEdinburghhasanumberofgoodtutorials.Thetutorialsmostuseful
inthiscontextarethefollowing.
IntroductiontoHighPerformanceComputing andDecomposingthePotentially
Parallel.http://www2.epcc.ed.ac.uk/computing/training/document_archive/.
http://www2.epcc.ed.ac.uk/computing/training/document_archive/http://www2.epcc.ed.ac.uk/computing/training/document_archive/ -
8/7/2019 Journeyman Tour
6/21
BlaiseBarney. AnIntroductiontoParallelComputing.LawrenceLivermoreNational
Labs. https://computing.llnl.gov/tutorials/parallel_comp/.
Videos
IntroductiontoParallelProgrammingVideoLectureSeries:Part02.Parallel
DecompositionMethods.
Thisvideopresentsthreemethodsfordividingcomposition,andpipelining.
http://software.intel.com/enus/courseware/course/view.php?id=381.
IntroductiontoParallelProgrammingVideoLectureSeries:Part04. SharedMemory
Considerations.
Thisvideoprovidestheviewerwithadescriptionofthesharedmemorymodelof
parallelprogramming.Implementationstrategiesfordomaindecompositionandtask
decompositionproblems
using
threads
within
a
shared
memory
execution
environment
areillustrated.http://software.intel.com/enus/courseware/course/view.php?id=249.
Taskdecomposition,sometimescalledfunctionaldecomposition,dividestheproblem
bythetypeoftasktobedoneandthenassignsaparticulartasktoeachparallelworker.
Asanexample,togradehundredsoffinalexams,alltestpaperscanbepiledontoa
tableandagroupofgraderscaneachbeassignedasinglequestionortypeofquestion
toscore,whichisthetasktobeexecuted.Soonegraderhasthetaskofscoringallessay
questions,anothergraderwouldscorethemultiplechoicequestions,andanotherwould
scorethetrue/falsequestions.
Videos
IntroductiontoParallelProgrammingVideoLectureSeries:Part09.Implementinga
TaskDecomposition.
http://software.intel.com/enus/courseware/course/view.php?id=378.
Thisvideodescribeshowtodesignandimplementataskdecompositionsolution.An
illustrativeexampleforsolvingthe8Queensproblemisused.Multipleapproachesare
presentedwiththeprosandconsforeachdescribed.Aftertheapproachisdecided
upon,codemodificationsusingOpenMParepresented.Potentialdatatraceerrorswith
asharedstackdatastructureholdingboardconfigurations(thetaskstobeprocessed)
areofferedandasolutionisfoundandimplemented.
Datadecomposition,sometimescalleddomaindecomposition,dividesthe problem
intoelementstobeprocessedandthenassignsasubsetoftheelementstoeachparallel
worker. Asanexample,togradehundredsoffinalexamsalltestpaperscanbestacked
ontoatableanddividedintopilesofequalsize.Eachgraderwouldthentakeastackof
examsandgradetheentiresetofquestions.
https://computing.llnl.gov/tutorials/parallel_comp/http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=249http://software.intel.com/en-us/courseware/course/view.php?id=378http://software.intel.com/en-us/courseware/course/view.php?id=378http://software.intel.com/en-us/courseware/course/view.php?id=378http://software.intel.com/en-us/courseware/course/view.php?id=378http://software.intel.com/en-us/courseware/course/view.php?id=249http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381http://software.intel.com/en-us/courseware/course/view.php?id=381https://computing.llnl.gov/tutorials/parallel_comp/ -
8/7/2019 Journeyman Tour
7/21
TutorialsBlaiseBarney. AnIntroductiontoParallelComputing.LawrenceLivermoreNational
Lab.https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning.
Pipelineparallelismisaspecialformoftaskdecompositionwheretheoutputfromone
process,orstage,astheyareoftencalled,servesdirectlyastheinputtothenextprocess.Thisimposesamuchmoretightlycoordinatedstructureontheprogramthanis
typicallyfoundineitherplaintaskordatadecompositions. Asanexample,tograde
hundredsoffinalexams,alltestpaperscanbepiledontoatableandagroupofgraders
arrangedinaline.Thefirstgradertakesapaperfromthepile,scoresallquestionson
thefirstpageandpassesthepapertothesecondgrader;thesecondgraderreceivesa
paperfromthefirstgraderandscoresallthequestionsonthesecondpageandpasses
thepapertothethirdgrader,andsoon,untiltheexamisfullygraded.
3.PARALLELHARDWARETheprevioussectiondescribedsomeofthecategoriesofparallelcomputation. Inorder
todiscussparallelcomputing,however,wealsoneedtoaddressthewaythat
computinghardwarecanalsoexpressparallelism.
Memorysystems.Fromaverybasicarchitecturestandpoint,thereareseveralgeneral
classificationsofparallelcomputingsystems:
Inasharedmemorysystem,theprocessingelementsallshareaglobalmemoryaddress
space. PopularsharedmemorysystemsincludemulticoreCPUsand manycoreGPUs
(GraphicsProcessingUnit).
Inadistributedmemorysystem,multipleindividualcomputingsystemswiththeirown
memoryspacesareconnectedtoeachotherthroughanetwork.
Thesesystemtypesarenotmutuallyexclusive.Inhybridsystems,inwhichmodern
computationalclustersareclassified,systemsconsistofdistributedmemorynodes,each
ofwhichisasharedmemorysystem.
Processingcharacteristics.Inaparallelapplication,calculationsareperformedinthe
samewaytheyareintheserialcase,onaCPUofsomekind.However,inparallel
computingtherearemultipleprocessingentities(tasks,threadsorprocesses)insteadofone.Thisresultsinaneedfortheseentitiestocommunicatevalueswitheachotheras
theycompute. Thiscommunicationhappensacrossanetworkofsomekind.Coordination,suchasmanagingaccesstoshareddatastructuresinathreaded
environment,isalsoaformofcommunication. Ineithercase,communicationaddsa
costtotheruntimeofaprogram,inanamountthatvariesgreatlybasedonthedesignof
https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioninghttps://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning -
8/7/2019 Journeyman Tour
8/21
theprogram.Ideally,parallelprogrammerswanttominimizetheamountof
communicationdone(comparedtotheamountofcomputation).
Scalability.Animportantcharacteristicofparallelprogramsistheirabilitytoscale,bothintermsofthecomputingresourcesusedbytheprogramandthesizeofthedataset
processedbytheprogram. Therearetwotypesofscalingweconsiderwhenanalyzingparallelprograms,strongandweakscaling.Strongscalingexaminesthebehavioroftheprogramwhenthesizeofthedatasetisheldconstantwhilethenumberofprocessing
unitsincreases.Weakscalingexamineswhathappenswhenthesizeofthedatasetisincreasedproportionallyasthenumberofprocessingunitsincreases.Generally
speaking,itiseasiertodesignparallelprogramsthatdowellwithweakscalingthanitis
todesignprogramsthatdowellwithstrongscaling.
Heterogeneousarchitectures(e.g.,IBMCellArchitecture,AMDsFusionArchitecture,
andIntelsSandyBridgeArchitecture).Heterogeneoussystemsmayconsistofmany
differentdevices,
each
with
its
own
capabilities
and
performance
properties,
all
exposed
withinasinglesystem.Suchsystems,whilenotnew(embeddedsystemonachip
designshavebeenaroundforovertwodecades),thesearchitecturesarebecomingmore
prevalentinthemainstreamdesktopandsupercomputingenvironments.Thisisdueto
theemergenceofacceleratorssuchasIBMCellBroadband,andmorerecentlythewide
adoptionoftheGeneralpurposecomputingongraphicsprocessingunits(GPGPU)
programmingmodel,whereCPUsandGPUsareconnectedtoformasinglesystem.
NVIDIAsComputeUnifiedDeviceArchitecture(CUDA)devicesarethemostcommon
GPGPUsinusecurrently.
4.PARALLELPROGRAMMINGMODELS,LIBRARIES,ANDINTERFACESIntroduction
Thismaterialisgroupedbyparallelprogrammingmodel.Thefirstsectioncovers
librariesandinterfacesdesignedtobeusedinasharedmemorymodel;thesecond
coverstoolsforthedistributedmemorymodel;andthethirdcoverstoolsforthe
GPGPUmodel. Anothercomponentofthistourcovershybridmodelswheretwoor
moreofthesemodelsmaybecombinedintoasingleparallelapplication.
Sharedmemorymodelprogramming.
Posixthreadsareastandardsetofthreadingprimitives. Lowlevelthreadingmethod
whichunderliesmanyofthemoremodernthreadingabstractionslikeOpenMPandTBB.
ThefollowingaresomeresourcestoassistyouinunderstandingPosixthreadsbetter.
-
8/7/2019 Journeyman Tour
9/21
TutorialsPOSIXthread(pthread)libraries.
http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html
Books
DavidButenhof.1997.ProgrammingwithPOSIXThreads,AddisonWesley.http://www.amazon.com/ProgrammingPOSIXThreadsDavidButenhof/dp/0201633922
ThisbookoffersanindepthdescriptionoftheIEEEoperatingsysteminterface
standard,POSIXAE(PortableOperatingSystemInterface)threads,commonlycalled
Pthreads.ItswrittenforexperiencedCprogrammers,butassumesnoprevious
knowledgeofthreads,andexplainsbasicconceptssuchasasynchronousprogramming,
thelifecycleofathread,andsynchronization.AbbreviatedPublishersAbstract.
BradfordNichols,DickButtlar,andJacquelineProulxFarrell.1996.Pthreads
Programming:A
POSIX
Standard
for
Better
Multiprocessing,
O
Reilly.
http://oreilly.com/catalog/9781565921153
POSIXthreads,orpthreads,allowmultipletaskstorunconcurrentlywithinthesame
program.Thisbookdiscusseswhentousethreadsandhowtomakethemefficient.It
featuresrealisticexamples,alookbehindthescenesattheimplementationand
performanceissues,andspecialtopicssuchasDCEandrealtimeextensions.Abbreviated
PublishersAbstract.
JoyDuffy.2008.ConcurrentProgrammingonWindows,AddisonWesley.http://www.amazon.com/ConcurrentProgrammingWindowsJoe
Duffy/dp/032143482X
Thisbookoffersanindepthdescriptionoftheissueswithconcurrency,introducing
generalmechanismsandtechniques,andcoveringdetailsofimplementationswithinthe
.NETframeworkonWindows.Therearenumerousexamplesofgoodandbadpractice,
anddetailsonhowtoimplementyourownconcurrentdatastructuresandalgorithms.
Win32threads,alsocalledNativeThreadingbyWindowsdevelopers,arestillthe
defaultmethodusedbymanytointroduceparallelismintocodeinWindows
environments.Nativethreadingcanbedifficulttoimplementandmaintain.Microsoft
hasarichbodyofmaterialavailableontheMicrosoftDeveloperNetwork.
Materialthatprovidesmoreinformationaboutthreadsfollows.
OnlineResources
MicrosoftDeveloperNetwork.AnonlineintroductiontoWindowsthreadingconcepts.
http://msdn.microsoft.com/enus/library/ms684841(VS.85).aspx.
http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.htmlhttp://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://oreilly.com/catalog/9781565921153http://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.msdn.com/http://www.msdn.com/http://www.msdn.com/http://www.msdn.com/http://www.msdn.com/http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspxhttp://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspxhttp://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspxhttp://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspxhttp://www.msdn.com/http://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://www.amazon.com/Concurrent-Programming-Windows-Joe-Duffy/dp/032143482Xhttp://oreilly.com/catalog/9781565921153http://www.amazon.com/Programming-POSIX-Threads-David-Butenhof/dp/0201633922http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html -
8/7/2019 Journeyman Tour
10/21
BooksJohnsonM.Hart.2010.WindowsSystemProgramming(4thed.),AddisonWesley.
http://www.amazon.com/WindowsProgrammingAddisonWesleyMicrosoft
Technology/dp/0321657748
Thisbookcontainsextensivenewcoverageof64bitprogramming,parallelism,multicoresystems,andothercrucialtopics.JohnsonHartsrobustcodeexampleshave
beendebuggedandtestedinboth32bitand64bitversions,onsingleand
multiprocessorsystems,andunderWindows7,Vista,Server2008,andWindowsXP.
HartcoversWindowsexternalsattheAPIlevel,presentingpracticalcoverageofallthe
servicesWindowsprogrammersneed,andemphasizinghowWindowsfunctions
actuallybehaveandinteractinrealworldapplications.AbbreviatedPublishersAbstract.
Java:Sinceversion5.0concurrencysupporthasbeenafundamentalcomponentoftheJavaspecification.JavafollowsasimilarapproachtoPOSIXandotherthreadingAPIs,
introducingthread
creation
and
synchronization
primitives
into
the
language
as
a
high
levelAPI,throughthepackagejava.util.concurrent.Thereareanumberofapproaches
tointroducingparallelismintoJavacodebutconventionallytheyfollowastandard
pattern.EachthreadiscreatedasaninstanceoftheclassThread,definingaclassthatimplementstherunnableinterface(thinkofthisasthePOSIXentrypointforafunction),
thatmustimplementthemethodpublicvoidrun().JustlikePOSIX,thethreadis
terminatedwhenthemethodreturns.
TherearealargenumberofresourcestohelpwithunderstandingJavathreadsbetter,
andthefollowingisjustasmallselection.
Tutorials
OraclehasalargenumberofJavaonlinetutorials,andonethatintroducesJavathreads.
http://download.oracle.com/javase/tutorial/essential/concurrency/index.html.
ForthedevelopernewtoJavaandorconcurrency,theJavaforBeginnersportal
providesanexcellentsetoftutorialsandonethatisspecificallyonthethreadingmodel.
http://www.javabeginner.com/learnjava/javathreadstutorial.
Books
ScottOaksandHenryWong.2004.JavaThreads,OReilly.
http://oreilly.com/catalog/9780596007829
Thisisawelldevelopedbookthat,whilenotthemostuptodateresourceonthe
subject,providesanexcellentreferenceguide.
OpenMP: OpenMPisadirectivebased,sharedmemoryparallelprogramingmodel. ItismostusefulforparallelizingindependentloopiterationsinbothCandFortran. New
http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://download.oracle.com/javase/tutorial/essential/concurrency/index.htmlhttp://www.javabeginner.com/learn-java/java-threads-tutorialhttp://www.javabeginner.com/learn-java/java-threads-tutorialhttp://www.javabeginner.com/learn-java/java-threads-tutorialhttp://www.javabeginner.com/learn-java/java-threads-tutorialhttp://www.javabeginner.com/learn-java/java-threads-tutorialhttp://www.javabeginner.com/learn-java/java-threads-tutorialhttp://www.javabeginner.com/learn-java/java-threads-tutorialhttp://oreilly.com/catalog/9780596007829http://oreilly.com/catalog/9780596007829http://www.javabeginner.com/learn-java/java-threads-tutorialhttp://download.oracle.com/javase/tutorial/essential/concurrency/index.htmlhttp://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748http://www.amazon.com/Windows-Programming-Addison-Wesley-Microsoft-Technology/dp/0321657748 -
8/7/2019 Journeyman Tour
11/21
facilitiesinOpenMP3.0allowforindependenttaskstoexecuteinparallel. Tobeused,
OpenMPmustbesupportedbyyourcompiler. Thestandardislimitedinscopeonthe
typesofparallelismthatyoucanimplement,butitiseasytouseandagoodstarting
pointforlearningparallelprogramming.Someresourcestohelpwithunderstanding
OpenMPbetterarelistedbelow.
Tutorials
BlaiseBarney.OpenMPTutorial,LawrenceLivermoreNationalLab.
https://computing.llnl.gov/tutorials/openMP/.
Thisexcellenttutorialisgearedtothosewhoarenewtoparallelprogrammingwith
OpenMP.BasicunderstandingofparallelprogramminginC/C++orFORTRANis
assumed.
OpenMPexercises.TimMattsonandLarryMeadows.IntelCorporation.
ThistutorialprovidesanexcellentintroductiontoOpenMP,includingcodeandexamples.http://openmp.org/mpdocuments/OMP_Exercises.zipand
http://openmp.org/mpdocuments/omphandsonSC08.pdf.
GettingStartedwithOpenMP. Textbasedtutorial;readandlearnwithexamples.
http://software.intel.com/enus/articles/gettingstartedwithopenmp/.
AnIntroductiontoOpenMP3.0.Thisdeckcontainsmoreadvancedtechniques(e.g.,
inclusionofwaitstatements)thatwouldneedmoreexplanationtobeusedsafely.
https://iwomp.zih.tudresden.de/downloads/2.Overview_OpenMP.pdf.
Videos
AnIntroductiontoParallelProgramming:VideoLectureSeries.
http://software.intel.com/enus/courseware/course/view.php?id=224.
ThismultipartintroductioncontainsmanyunitsonOpenMP,andincludescoding
exercisesandcodesamples.
Communitysites
www.openmp.org.Contains
the
current
and
past
OpenMP
language
specifications,
lists
ofcompilersthatsupportOpenMP,references,andotherresources.
CheatsheetFORTRANandC/C++.
C++:http://www.openmp.org/mpdocuments/OpenMP3.0SummarySpec.pdf.
FORTRAN:http://www.openmp.org/mpdocuments/OpenMP3.0FortranCard.pdf.
https://computing.llnl.gov/tutorials/openMP/http://openmp.org/mp-documents/OMP_Exercises.ziphttp://openmp.org/mp-documents/OMP_Exercises.ziphttp://openmp.org/mp-documents/OMP_Exercises.ziphttp://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://software.intel.com/en-us/articles/getting-started-with-openmp/http://software.intel.com/en-us/articles/getting-started-with-openmp/http://software.intel.com/en-us/articles/getting-started-with-openmp/http://software.intel.com/en-us/articles/getting-started-with-openmp/http://software.intel.com/en-us/articles/getting-started-with-openmp/http://software.intel.com/en-us/articles/getting-started-with-openmp/http://software.intel.com/en-us/articles/getting-started-with-openmp/http://software.intel.com/en-us/articles/getting-started-with-openmp/https://iwomp.zih.tu-dresden.de/downloads/2.Overview_OpenMP.pdfhttps://iwomp.zih.tu-dresden.de/downloads/2.Overview_OpenMP.pdfhttps://iwomp.zih.tu-dresden.de/downloads/2.Overview_OpenMP.pdfhttp://software.intel.com/en-us/courseware/course/view.php?id=224http://software.intel.com/en-us/courseware/course/view.php?id=224http://software.intel.com/en-us/courseware/course/view.php?id=224http://www.openmp.org/http://www.plutospin.com/OpenMPref.htmlhttp://www.openmp.org/mp-documents/OpenMP3.0-SummarySpec.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-SummarySpec.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-SummarySpec.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-SummarySpec.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-SummarySpec.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdfhttp://www.openmp.org/mp-documents/OpenMP3.0-SummarySpec.pdfhttp://www.plutospin.com/OpenMPref.htmlhttp://www.openmp.org/http://software.intel.com/en-us/courseware/course/view.php?id=224https://iwomp.zih.tu-dresden.de/downloads/2.Overview_OpenMP.pdfhttp://software.intel.com/en-us/articles/getting-started-with-openmp/http://openmp.org/mp-documents/omp-hands-on-SC08.pdfhttp://openmp.org/mp-documents/OMP_Exercises.ziphttps://computing.llnl.gov/tutorials/openMP/ -
8/7/2019 Journeyman Tour
12/21
Books
BarbaraChapman,GabrieleJost,andRuudvanderPas,2007.UsingOpenMP,Portable
SharedMemoryParallelProgramming,MITPress.ACMMembers,readithere:http://learning.acm.org/books/book_detail.cfm?isbn=9780262533027&type=24.
MichaelJ.Quinn,1004.ParallelProgramminginCwithMPIandOpenMP,McGrawHill.
Thisbookaddressestheneedsofstudentsandprofessionalswhowanttolearnhowto
design,analyze,implement,andbenchmarkparallelprogramsinCusingMPIand/or
OpenMP.ItintroducesadesignmethodologywithcoverageofthemostimportantMPI
functionsandOpenMPdirectives.Italsodemonstrates,throughawiderangeof
examples,howtodevelopparallelprogramsthatwillexecuteefficientlyontodays
parallelplatforms.AbbreviatedPublishersAbstract.Breshears.2009.TheArtofConcurrency:AThreadMonkeysGuidetoWritingParallel
Applications,OReilly.http://oreilly.com/catalog/9780596521547
ThisbookcontainsnumerousexamplesofappliedOpenMPcode.
WrittenbyanIntelengineerwithovertwodecadesofparallelandconcurrent
programmingexperience,TheArtofConcurrencyisoneofthefewresourcestofocuson
implementingalgorithmsinthesharedmemorymodelofmulticoreprocessors,rather
thanjusttheoreticalmodelsordistributedmemoryarchitectures.Thebookprovides
detailedexplanationsandusablesamplestohelpyoutransformalgorithmsfromserial
toparallelcode,alongwithadviceandanalysisforavoidingmistakesthatprogrammers
typicallymakewhenfirstattemptingthesecomputations.
RohitChandra,RameshMenon,LeoDagum,DavidKohr,DrorMaydan,andJeff
McDonald.2001.ParallelProgramminginOpenMP,MorganKaufmann.
AimedattheworkingresearcherorscientificC/C++orFortranprogrammer,Parallel
ProgramminginOpenMPbothexplainswhattheOpenMPstandardisandhowtouseit
tocreatesoftwarethattakesfulladvantageofparallelcomputing.Byaddingahandful
ofcompilerdirectives(orpragmas)inFortranorC/C++,plusafewoptionallibrarycalls,
programmerscan parallelize existingsoftwarewithoutcompletelyrewritingit.This
bookstartswithsimpleexamplesofhowtoparallelize loopsiterativecodethatin
scientificsoftwaremightworkwithverylargearrays.Samplecodereliesprimarilyon
Fortran(thelanguageofchoiceforhighendnumericalsoftware)withdescriptionsof
theequivalentcallsandstrategiesinC/C++.AbbreviatedPublishersAbstract.
ThreadingBuildingBlocks(TBB).IntelThreadingBuildingBlocks(IntelTBB.TBBis
athreadinglibraryusedtointroduceparallelismintoC/C++. TBBisarelativelyeasy
http://learning.acm.org/books/book_detail.cfm?isbn=9780262533027&type=24http://oreilly.com/catalog/9780596521547http://oreilly.com/catalog/9780596521547http://learning.acm.org/books/book_detail.cfm?isbn=9780262533027&type=24 -
8/7/2019 Journeyman Tour
13/21
waytointroducelooplevelparallelism,especiallyforprogrammersfamiliarwith
templatedcode.TBBisavailablebothasanopensourceprojectandascommercial
productfromtheIntelCorporation.
Tutorials
IntelThreadingBuildingBlocksTutorial.http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20
Documentation/Tutorial.pdf.WrittenbyIntelCorporation,thisisathoroughintroductiontothethreadinglibrary.
ThistutorialteachesyouhowtouseIntelThreadingBuildingBlocks(IntelTBB),a
librarythathelpsyouleveragemulticoreperformancewithouthavingtobeathreading
expert.
Multicoreinfo.combringstogetheranumberofTBBtutorials.
http://www.multicoreinfo.com/2009/07/parprogpart6/.
Codeexamples
ThisTBB.orgwebsitecontainsathoroughsetofcodingexamples.
http://www.threadingbuildingblocks.org/codesamples.php.
CodeexamplesfromarecentcodingwithTBBcontest.
http://software.intel.com/enus/articles/codingwithinteltbbsweepstakes/.
CommunitysitesContainsproductannouncements(releasesandupdates),linkstocodesamples,blogs,
andforumsonTBB.http://www.threadingbuildingblocks.org.
IntelsiteforcommercialversionofIntelThreadingBuildingBlocks.
http://www.threadingbuildingblocks.com.
BooksJamesReinders.2007.IntelThreadingBuildingBlocks,OReilly.http://oreilly.com/catalog/9780596514808
Thisguideexplainshowtomaximizethebenefitsofmulticoreprocessorsthrougha
portableC++librarythatworksonWindows,Linux,Macintosh,andUnixsystems.Withit,youlllearnhowtouseIntelThreadingBuildingBlocks(TBB)effectivelyforparallel
programming,withouthavingtobeathreadingexpert.WrittenbyJamesReinders,
ChiefEvangelistofIntelSoftwareProducts,andbasedontheexperienceofIntels
developersandcustomers,thisbookexplainsthekeytasksinmultithreadingandhow
toaccomplishthemwithTBBinaportableandrobustmanner.AbbreviatedPublishers
Abstract.
http://www.threadingbuildingblocks.org/http://www.threadingbuildingblocks.org/http://www.threadingbuildingblocks.org/http://www.threadingbuildingblocks.org/http://www.threadingbuildingblocks.org/http://www.threadingbuildingblocks.org/http://software.intel.com/en-us/intel-tbb/http://software.intel.com/en-us/intel-tbb/http://software.intel.com/en-us/intel-tbb/http://software.intel.com/en-us/intel-tbb/http://software.intel.com/en-us/intel-tbb/http://software.intel.com/en-us/intel-tbb/http://software.intel.com/en-us/intel-tbb/http://software.intel.com/en-us/intel-tbb/http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Tutorial.pdfhttp://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Tutorial.pdfhttp://www.multicoreinfo.com/2009/07/parprog-part-6/http://www.multicoreinfo.com/2009/07/parprog-part-6/http://www.multicoreinfo.com/2009/07/parprog-part-6/http://www.multicoreinfo.com/2009/07/parprog-part-6/http://www.multicoreinfo.com/2009/07/parprog-part-6/http://www.threadingbuildingblocks.org/codesamples.phphttp://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://www.threadingbuildingblocks.org/http://www.threadingbuildingblocks.com/http://oreilly.com/catalog/9780596514808http://oreilly.com/catalog/9780596514808http://www.threadingbuildingblocks.com/http://www.threadingbuildingblocks.org/http://software.intel.com/en-us/articles/coding-with-intel-tbb-sweepstakes/http://www.threadingbuildingblocks.org/codesamples.phphttp://www.multicoreinfo.com/2009/07/parprog-part-6/http://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Tutorial.pdfhttp://www.threadingbuildingblocks.org/uploads/81/91/Latest%20Open%20Source%20Documentation/Tutorial.pdfhttp://software.intel.com/en-us/intel-tbb/http://software.intel.com/en-us/intel-tbb/http://software.intel.com/en-us/intel-tbb/http://www.threadingbuildingblocks.org/ -
8/7/2019 Journeyman Tour
14/21
Breshears.2009.TheArtofConcurrency:AThreadMonkeysGuidetoWritingParallel
Applications,OReilly.http://oreilly.com/catalog/9780596521547
ThisbookcontainsnumerousexamplesofappliedTBBcodes.
Distributedmemory
model
programming.Theprecedinglibrariesandinterfacesassume
thattheresultsfromonethreadoftheoverallcomputationcanbemadedirectly
availabletoanyotherthread. However,someparallelhardware(suchasclusters)
forbiddirectaccessfromonememoryspacetoanother. Instead,processesmust
cooperatebysendingeachothermessagescontainingthedatatobeexchanged.
Messagepassinginterface(MPI).MPIisalibraryspecificationthatsupportsmessagepassingbetweenprogramimagesrunningondistributedmemorymachines,typically
clustersofsometype. Anumberofdifferentorganizationsdevelopandsupport
implementationsoftheMPIstandard,whichspecifiesinterfacesforC/C++and
FORTRAN.However,
bindings
for
Perl,
Python,
and
many
other
languages
also
exist.
MPIprovidesroutinesthatmanagethetransmissionofdatafromthememoryspaceof
oneprocesstothememoryspaceofanotherprocess.Distributedmemorymachines
requiretheuseofMPIoranothermessagepassinglibrarybytheparallelprogramin
ordertousemultipleprocessesrunningonmorethanonenode.Gettingstarted:Startwiththesixbasiccommands:
MPI_Init() InitializetheMPIworld
MPI_Finalize()
Terminate
the
MPI
world
MPI_Comm_rank() WhichprocessamI?
MPI_Comm_size() Howmanyprocessesexist?
MPI_Send() Senddata
MPI_Recv() Receivedata
Moveontomorecomplexcommunicationmodelsasneeded,thatis,tocollective
communication(one>many,many>one,many>many);andoradvanced
communicationtechniques:synchronousvsasynchronouscommunication,
blockingvsnonblockingcommunication.
OnlineReadings
Moodleswithslidesandcodeexamples.NCSIparallelanddistributedworkshop.
http://moodle.sceducation.org/course/category.php?id=17.
http://oreilly.com/catalog/9780596521547http://moodle.sc-education.org/course/category.php?id=17http://moodle.sc-education.org/course/category.php?id=17http://moodle.sc-education.org/course/category.php?id=17http://moodle.sc-education.org/course/category.php?id=17http://oreilly.com/catalog/9780596521547 -
8/7/2019 Journeyman Tour
15/21
Tutorials
WilliamGropp,RustyLusk,RobRoss,andRajeevThakur.2005.AdvancedMPI:I/Oand
OneSidedCommunication.http://www.mcs.anl.gov/research/projects/mpi/tutorial/.SuperComputinginPlainEnglish(SIPE).
http://www.oscer.ou.edu/Workshops/DistributedParallelism/sipe_distribmem_20090324
.pdf.Cheatsheets
http://wiki.sceducation.org/index.php/MPI_Cheat_Sheet.
Books
PeterPacheco.1997.ParallelProgrammingwithMPI.http://www.amazon.com/Parallel
ProgrammingMPIPeterPacheco/dp/1558603395
AhandsonintroductiontoparallelprogrammingbasedontheMessagePassing
Interface(MPI)standard,thedefactoindustrystandardadoptedbymajorvendorsof
commercialparallelsystems.Thistextbook/tutorial,basedontheClanguage,contains
manyfullydevelopedexamplesandexercises.Thecompletesourcecodeforthe
examplesisavailableinbothCandFortran77.Studentsandprofessionalswillfindthat
theportabilityofMPI,combinedwithathoroughgroundinginparallelprogramming
principles,willallowthemtoprogramanyparallelsystem,fromanetworkof
workstationstoaparallelsupercomputer.AbbreviatedPublishersAbstract.
General
purpose
GPU
programming.In
contrast
to
the
threading
models
presented
earlier,acceleratorbasedhardwareparallelism(likeGPUs)focusesonthefactthat
althoughresultsmaybeshareable,thatis,canbesentfromonepartofacomputationto
another,thecostofaccessingthememorymaynotbeuniform; CPUsseeCPUmemory
betterthanGPUmemory,andviceversa.
TheOpenComputeLanguage.(OpenCL)isanopenstandardforheterogeneous
computing,developedbytheKhronosOpenCLworkinggroup.Implementationsare
currentlyavailablefromabroadselectionofhardwareandsoftwarevendors,including
AMD,Apple,NVIDIA,andIBM.
OpenCLisintendedasalowlevelprogrammingmodel,designedaroundthenotionof
ahostapplication,commonlyaCPU,drivingasetofassociatedcomputedevices,where
parallelcomputationscanbeperformed.
AkeydesignfeatureofOpenCLisitsuseofasynchronouscommandqueues,associated
withindividualdevices,thatprovidetheabilitytoenqeuework(e.g.,datatransfersand
http://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/http://www.oscer.ou.edu/Workshops/DistributedParallelism/sipe_distribmem_20090324.pdfhttp://www.oscer.ou.edu/Workshops/DistributedParallelism/sipe_distribmem_20090324.pdfhttp://wiki.sc-education.org/index.php/MPI_Cheat_Sheethttp://wiki.sc-education.org/index.php/MPI_Cheat_Sheethttp://wiki.sc-education.org/index.php/MPI_Cheat_Sheethttp://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395http://wiki.sc-education.org/index.php/MPI_Cheat_Sheethttp://www.oscer.ou.edu/Workshops/DistributedParallelism/sipe_distribmem_20090324.pdfhttp://www.oscer.ou.edu/Workshops/DistributedParallelism/sipe_distribmem_20090324.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/http://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdfhttp://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdf -
8/7/2019 Journeyman Tour
16/21
parallelcodeexecution),andbuildupcomplexgraphsdescribingthedependencies
betweentasks.
Theexecutionmodelsupportsbothdataparallelandtaskparallelstyles,butOpenCL
wasdevelopedspecificallywithaneyetowardtodaysmanycore,throughput,GPU
stylearchitectures,andhenceexposesacomplexmemorystructurethatprovidesanumberoflevelsofsoftwaremanagedmemories.Thisisincontrasttothetraditional
singleaddressspacemodeloflanguageslikeCandC++,backedbylargecacheson
generalpurposeCPUs.
TheOpenCLstandardiscurrentlyinitsseconditeration,atversion1.1,andincludes
bothaCAPI,forprogrammingthehost,andanewC++WrapperAPI,addedto1.1and
intendedtobeusedforOpenCLC++development.Byexposingmultipleaddress
spaces,OpenCLprovidesaverypowerfulprogrammingmodeltoaccessthefull
potentialofmanycorearchitectures,butthiscomesatthecostofabstraction!
Thisisparticularlytrueinthecaseofperformanceportability,anditisoftendifficultto
achievegoodperformanceontwodifferentarchitectureswiththesamesourcecode.
ThiscanbeevenmoreevidentbetweenthedifferenttypesofOpenCLdevices(e.g.,
GPUsandCPUs).Thisshouldnotcomeasasurprise,astheOpenCLspecificationitself
statesthatitisalowlevelprogramminglanguage,andgiventhatthesedevicescan
haveverydifferenttypesofcomputecapabilities,carefultuningisoftenrequiredtoget
closetopeakperformance.
OnlineResources
OpenCL
1.1
Specification
(revision
33,
June
11,
2010).
http://www.khronos.org/registry/cl/specs/opencl1.1.pdf
OpenCL1.1C++WrapperAPISpecification(revision4,June14,2010).http://www.khronos.org/registry/cl/specs/openclcplusplus1.1.pdf
OpenCL1.1OnlineManualPages.
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
OpenCLQuickReferenceCard.http://www.khronos.org/opencl/.
Tutorial
Anexcellentbeginners helloworld tutorialintroductionusingOpenCL1.1sC++API.http://developer.amd.com/GPU/ATISTREAMSDK/pages/TutorialOpenCL.aspx
Videos
ATIStreamOpenCLTechnicalOverviewVideoSeries.
http://developer.amd.com/DOCUMENTATION/VIDEOS/OPENCLTECHNICALOVER
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdfhttp://www.khronos.org/registry/cl/specs/opencl-1.1.pdfhttp://www.khronos.org/registry/cl/specs/opencl-1.1.pdfhttp://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdfhttp://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdfhttp://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdfhttp://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdfhttp://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdfhttp://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/http://www.khronos.org/opencl/http://developer.amd.com/GPU/ATISTREAMSDK/pages/TutorialOpenCL.aspxhttp://developer.amd.com/DOCUMENTATION/VIDEOS/OPENCLTECHNICALOVERVIEWVIDEOSERIES/Pages/default.aspxhttp://developer.amd.com/DOCUMENTATION/VIDEOS/OPENCLTECHNICALOVERVIEWVIDEOSERIES/Pages/default.aspxhttp://developer.amd.com/GPU/ATISTREAMSDK/pages/TutorialOpenCL.aspxhttp://www.khronos.org/opencl/http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/http://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdfhttp://www.khronos.org/registry/cl/specs/opencl-1.1.pdf -
8/7/2019 Journeyman Tour
17/21
VIEWVIDEOSERIES/Pages/default.aspx
Thisfivepartvideotutorialseriesprovidesanexcellentintroductiontothebasicsof
OpenCL,includingitsexecutionandmemorymodels,andtheOpenCLCdevice
programminglanguage.
CommunitysitesTheKhronosGroupsmainpage,http://www.khronos.org,keepstrackofmajorevents
aroundOpenCLanditsotherlanguagessuchasOpenGLandWebGL,alongwithsome
usefuldiscussionforumsonthese.http://www.khronos.org/opencl.
Thewebsiteshttp://www.beyond3d.comandMarkHarrishttp://www.gpgpu.orgare
fullofinformationaboutmanycoreprogramming,inparticularthemodernGPUsof
AMDandNVIDIA,andprovidevibrantdiscussionsonOpenCLandCUDA(thedetails
ofthislanguagewillfollow),amongotherinterestingareasandtopics.
Thereisanevergrowingsetofexamplesthatcanbefoundallovertheweb,andeachof
themajorvendorsprovidesexcellentexampleswiththeircorrespondingSDKs.
CUDA.ComputeUnifiedDeviceArchitecture(CUDA)isNVIDIAsparallelcomputing
architecturethatenablesaccelerationincomputingperformancebyharnessingthe
poweroftheGPU(graphicsprocessingunit). CUDAisinessenceadataparallelmodel,
sharingalotincommonwiththeotherpopularGPGPUlanguageOpenCL,where
kernels(similartofunctions)areexecutedovera3Diterationspace;eachindexis
executedconcurrently,possiblyinparallel.
OnlineResourcesNVIDIAmaintainsacollectionoffeaturedtutorials,presentations,andexercisesonthe
CUDADeveloperZone.http://developer.nvidia.com/object/cuda_training.html.
OnlineReadings
FordetailsonNVIDIACUDAhardwareandtheunderlyingprogrammingmodels,the
followingarticlesarerelevant:
ErikLindholm,JohnNickolls,StuartOberman,andJohnMontrym.2008.NVIDIATesla:
Aunifiedgraphicsandcomputingarchitecture,IEEEMicro28,2(March),3955.
NVIDIAGF100.2010.http://www.nvidia.com/object/IO_89569.htm.(Whitepaper.)CodeexamplesTheCUDASDKincludesnumerouscodeexamples,alongwithCUDAversionsof
http://developer.amd.com/DOCUMENTATION/VIDEOS/OPENCLTECHNICALOVERVIEWVIDEOSERIES/Pages/default.aspxhttp://www.khronos.org/http://www.khronos.org/openclhttp://www.beyond3d.com/http://www.gpgpu.org/http://developer.nvidia.com/object/cuda_training.htmlhttp://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://www.nvidia.com/object/IO_89569.htmhttp://www.nvidia.com/object/IO_89569.htmhttp://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4523358http://developer.nvidia.com/object/cuda_training.htmlhttp://www.gpgpu.org/http://www.beyond3d.com/http://www.khronos.org/openclhttp://www.khronos.org/http://developer.amd.com/DOCUMENTATION/VIDEOS/OPENCLTECHNICALOVERVIEWVIDEOSERIES/Pages/default.aspx -
8/7/2019 Journeyman Tour
18/21
popularlibraries(cuBLASandcuFFT).
http://developer.nvidia.com/object/cuda_3_1_downloads.html
Books
DavidB.KirkandWenMeiW.Hwu.ProgrammingMassivelyParallelProcessors:AHands
onApproach,MorganKaufmann.http://www.amazon.com/dp/0123814723?tag=wwwnvidiacomc
20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0123814723&adid=1DT
2S034DXS37V3K5FFY.
ThisisarecentandpopulartextbookforteachingCUDA.Thisbookshowsbothstudent
andprofessionalalikethebasicconceptsofparallelprogrammingandGPUarchitecture.
Varioustechniquesforconstructingparallelprogramsareexploredindetail.Case
studiesdemonstratethedevelopmentprocess,whichbeginswithcomputational
thinkingandendswitheffectiveandefficientparallelprograms.AbbreviatedPublishers
Abstract.
Hybridparallelsoftwarearchitectures.Programswhichuseahybridparallel
architecturecombinetwoormorelibraries/models/languages(seeSection3)intoa
singleprogram;themotivationforthisextracomplexityistoallowasingleparallel
programimagetoharnessadditionalcomputationalresources.
ThemostcommonformsofhybridmodelscombineMPIwithOpenMPorMPIwith
CUDA. MPI/OpenMPisappropriateforuseonclusterresourceswherethenodesare
multicoremachines;MPIisusedtomovedataandresultsamongthedistributed
memories,andOpenMPisusedtoleveragethecomputepowerofthecoresonthe
individualnodes. MPI/CUDAisappropriateforuseonclusterresourceswherethe
nodesareequippedwithNVIDIAsGPGPUcards. Again,MPIisusedtomovedata
andresultsamongthedistributedmemoriesandCUDAisusedtoleveragethe
resourcesofeachGPGPUcard.
OnlineResourcesMPI/OpenMP:TheLouisianaOpticalNetworkInitiative(LONI)hasanicetutorialon
buildinghybridMPI/OpenMPapplications.Itcanbefoundat
https://docs.loni.org/wiki/Introduction_to_Programming_HybridApplications_UsingOp
enMP_and_MPI.ThisincludespointerstoLONIsOpenMPaswellasMPItutorials.
MPI/CUDA:TheNationalCenterforSupercomputingApplications(NCSA)hasa
tutorialthatincludesinformationaboutthis;seethesectionCombiningMPIand
CUDAin
http://www.ncsa.illinois.edu/UserInfo/Training/Workshops/CUDA/presentations/tutori
alCUDA.html.
http://developer.nvidia.com/object/cuda_3_1_downloads.htmlhttp://www.amazon.com/dp/0123814723?tag=wwwnvidiacomc-20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0123814723&adid=1DT2S034DXS37V3K5FFYhttp://www.amazon.com/dp/0123814723?tag=wwwnvidiacomc-20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0123814723&adid=1DT2S034DXS37V3K5FFYhttp://www.amazon.com/dp/0123814723?tag=wwwnvidiacomc-20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0123814723&adid=1DT2S034DXS37V3K5FFYhttp://www.amazon.com/dp/0123814723?tag=wwwnvidiacomc-20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0123814723&adid=1DT2S034DXS37V3K5FFYhttps://docs.loni.org/wiki/Introduction_to_Programming_Hybrid_Applications_Using_OpenMP_and_MPIhttps://docs.loni.org/wiki/Introduction_to_Programming_Hybrid_Applications_Using_OpenMP_and_MPIhttp://www.ncsa.illinois.edu/UserInfo/Training/Workshops/CUDA/presentations/tutorial-CUDA.htmlhttp://www.ncsa.illinois.edu/UserInfo/Training/Workshops/CUDA/presentations/tutorial-CUDA.htmlhttp://www.ncsa.illinois.edu/UserInfo/Training/Workshops/CUDA/presentations/tutorial-CUDA.htmlhttp://www.ncsa.illinois.edu/UserInfo/Training/Workshops/CUDA/presentations/tutorial-CUDA.htmlhttp://www.ncsa.illinois.edu/UserInfo/Training/Workshops/CUDA/presentations/tutorial-CUDA.htmlhttp://www.ncsa.illinois.edu/UserInfo/Training/Workshops/CUDA/presentations/tutorial-CUDA.htmlhttps://docs.loni.org/wiki/Introduction_to_Programming_Hybrid_Applications_Using_OpenMP_and_MPIhttps://docs.loni.org/wiki/Introduction_to_Programming_Hybrid_Applications_Using_OpenMP_and_MPIhttp://www.amazon.com/dp/0123814723?tag=wwwnvidiacomc-20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0123814723&adid=1DT2S034DXS37V3K5FFYhttp://www.amazon.com/dp/0123814723?tag=wwwnvidiacomc-20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0123814723&adid=1DT2S034DXS37V3K5FFYhttp://www.amazon.com/dp/0123814723?tag=wwwnvidiacomc-20&camp=14573&creative=327641&linkCode=as1&creativeASIN=0123814723&adid=1DT2S034DXS37V3K5FFYhttp://developer.nvidia.com/object/cuda_3_1_downloads.html -
8/7/2019 Journeyman Tour
19/21
ParallelLanguages
Wetouchhereonlybrieflyonthetopicofinherentlyparallellanguages. There
areavarietyofefforts,rangingfromextensionstoexistinglanguagestoradically
newapproaches. Many,suchasCilkorUPC,canberelativelyeasilyunderstood
intermsofthelibrariesandtechniquesdescribedabove. Afullerdiscussionof
thevarietyoflanguageeffortsandtoolswillbeinafurtherTechPack. Becauseit
issufficientlydifferent,however,aquicklookathowparallelismisincorporated
intofunctionallanguagesishelpful.
ConcurrentMLandConcurrentHaskell:Functionalprogramminglanguages,suchas
StandardMLandHaskell,providestrongfoundationforbuildingconcurrentand
parallelprogrammingabstractions,forthesinglereasonthattheyaredeclarative.
Declarativeinaparallelworld,i.e.avoidingtheissues(e.g.raceconditions)that
updatingaglobalstateforasharedmemorymodelcancause,providesforastrong
foundationtobuildconcurrencyabstractions.
ConcurrentMLisahighlevelmessagepassinglanguagethatsupportstheconstruction
offirstclasssynchronousabstractionscalledevents,embeddedintoStandardML.It
providesarichsetofconcurrencymechanismsbuiltonthenotionofspawningnew
threadsthatcommunicateviachannels.
ConcurrentHaskellisanextensiontothefunctionallanguageHaskellfordescribingthe
creationofthreads,thathavethepotentialtoexecuteinparallelwithother
computations.UnlikeConcurrentML,ConcurrentHaskellprovidesalimitedformof
sharedmemory,introducingMVars(mutablevariables)whichcanbeusedto
atomicallycommunicateinformationbetweenthreads.Unlikemorerelaxedshared
memorymodels(e.g.seeOpenMLandOpenCLinthefollowingtext),Concurrent
Haskellsruntimesystemensuresthattheoperationsforreadingfromandwritingto
MVarsoccuratomically.
Tutorials
SimonPeytonJonesandSatnamSingh.ATutorialonParallelandConcurrent
ProgramminginHaskellLectureNotesfromAdvancedFunctionalProgramming
SummerSchool2008.
http://research.microsoft.com/enus/um/people/simonpj/papers/parallel/AFP08
notes.pdf
Books
JohnH.Reppy.1999.ConcurrentML,CambridgeUniversityPress.
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/AFP08-notes.pdfhttp://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/AFP08-notes.pdfhttp://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/AFP08-notes.pdfhttp://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/AFP08-notes.pdfhttp://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/AFP08-notes.pdfhttp://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/AFP08-notes.pdfhttp://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/AFP08-notes.pdfhttp://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/AFP08-notes.pdf -
8/7/2019 Journeyman Tour
20/21
SimonPeytonJones.2007.BeautifulConcurrency.InBeautifulCode;editedbyGreg
Wilson,OReilly.http://research.microsoft.com/en
us/um/people/simonpj/papers/stm/index.htm#beautiful
5.TOOLSThereisavarietyoftoolsavailabletoassistprogrammersincreating,debugging,and
runningparallelcodes. Thissectionsummarizesthecategoriesoftools;amore
exhaustivelistoftoolsthatrunondifferenthardwareandsoftwareplatformswillbe
includedinasubsequentadditiontotheTechPack.
Compilers
Manyofthecompilerfamiliestoday,bothcommercialandopensource,directlysupport
someformofexplicitparallelism. (OpenMP,threads,etc.).
Autoparallelization
Theholygrailofmany,coresupportwouldbeacompilerthatcouldautomatically
extractparallelismatcompiletime. Unfortunately,thisisstillaworkinprogress. That
said,anumberofcompilerscanaddutilitythroughvectorizationandtheidentification
ofobviousparallelisminsimpleloops.
OnlineResources
http://en.wikipedia.org/wiki/Automatic_parallelization.
Thread
debuggers
IntelThreadChecker:http://software.intel.com/enus/intelthreadchecker/.
IntelParallelInspector:http://software.intel.com/enus/intelparallelinspector/.
MicrosoftVisualStudio2010tools:http://www.microsoft.com/visualstudio/enus/.
Hellgrind.http://valgrind.org/docs/manual/hgmanual.htmlisaValgrindtoolfor
detectingsynchronizationerrorsinC,C++andFORTRANprogramsthatusethePOSIX
pthreadsthreadingprimitives.ThemainabstractionsinPOSIXpthreadsareasetof
threadssharingacommonaddressspace,threadcreation,threadjoining,threadexit,
mutexes(locks),conditionvariables(interthreadeventnotifications),readerwriter
locks,spinlocks,semaphoresandbarriers.
Tuners/performanceprofilers
IntelVTunePerformanceAnalyzer&IntelThreadProfiler3.1forWindows. The
ThreadProfilercomponentofVtunehelpstunemultithreadedapplicationsfor
http://research.microsoft.com/en-us/um/people/simonpj/papers/stm/index.htm#beautifulhttp://research.microsoft.com/en-us/um/people/simonpj/papers/stm/index.htm#beautifulhttp://research.microsoft.com/en-us/um/people/simonpj/papers/stm/index.htm#beautifulhttp://en.wikipedia.org/wiki/Automatic_parallelizationhttp://software.intel.com/en-us/intel-thread-checker/http://software.intel.com/en-us/intel-thread-checker/http://software.intel.com/en-us/intel-thread-checker/http://software.intel.com/en-us/intel-thread-checker/http://software.intel.com/en-us/intel-thread-checker/http://software.intel.com/en-us/intel-thread-checker/http://software.intel.com/en-us/intel-thread-checker/http://software.intel.com/en-us/intel-parallel-inspector/http://software.intel.com/en-us/intel-parallel-inspector/http://software.intel.com/en-us/intel-parallel-inspector/http://software.intel.com/en-us/intel-parallel-inspector/http://software.intel.com/en-us/intel-parallel-inspector/http://software.intel.com/en-us/intel-parallel-inspector/http://software.intel.com/en-us/intel-parallel-inspector/http://www.microsoft.com/visualstudio/en-us/http://www.microsoft.com/visualstudio/en-us/http://www.microsoft.com/visualstudio/en-us/http://valgrind.org/docs/manual/hg-manual.htmlhttp://valgrind.org/docs/manual/hg-manual.htmlhttp://valgrind.org/docs/manual/hg-manual.htmlhttp://valgrind.org/docs/manual/hg-manual.htmlhttp://www.microsoft.com/visualstudio/en-us/http://software.intel.com/en-us/intel-parallel-inspector/http://software.intel.com/en-us/intel-thread-checker/http://en.wikipedia.org/wiki/Automatic_parallelizationhttp://research.microsoft.com/en-us/um/people/simonpj/papers/stm/index.htm#beautifulhttp://research.microsoft.com/en-us/um/people/simonpj/papers/stm/index.htm#beautiful -
8/7/2019 Journeyman Tour
21/21
performance.TheIntelThreadProfilertimelineviewshowswhatthethreadsaredoing
andhowtheyinteract.http://software.intel.com/enus/intelvtune/.
IntelParallelAmplifier. Atooltohelpfindmulticoreperformancebottleneckswithout
needingtoknowtheprocessorarchitectureorassemblycode.
http://software.intel.com/enus/intelparallelamplifier/.
MicrosoftVisualStudio2010tools.http://www.microsoft.com/visualstudio/enus/.
gprof:theGNUProfiler.http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html.
Memorytools
Hoard:http://www.hoard.org/.TheHoardmemoryallocatorisafast,scalable,and
memoryefficientmemoryallocator.Itrunsonavarietyofplatforms,includingLinux,
Solaris,andWindows.Hoardisadropinreplacementformalloc()thatcandramatically
improveapplicationperformance,especiallyformultithreadedprogramsrunningonmultiprocessors.Nochangetoyoursourceisnecessary.Justlinkitinorsetjustone
environmentvariable.
http://software.intel.com/en-us/intel-vtune/http://software.intel.com/en-us/intel-vtune/http://software.intel.com/en-us/intel-vtune/http://software.intel.com/en-us/intel-vtune/http://software.intel.com/en-us/intel-vtune/http://software.intel.com/en-us/intel-parallel-amplifier/http://software.intel.com/en-us/intel-parallel-amplifier/http://software.intel.com/en-us/intel-parallel-amplifier/http://software.intel.com/en-us/intel-parallel-amplifier/http://software.intel.com/en-us/intel-parallel-amplifier/http://software.intel.com/en-us/intel-parallel-amplifier/http://software.intel.com/en-us/intel-parallel-amplifier/http://www.microsoft.com/visualstudio/en-us/http://www.microsoft.com/visualstudio/en-us/http://www.microsoft.com/visualstudio/en-us/http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.htmlhttp://www.hoard.org/http://www.hoard.org/http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.htmlhttp://www.microsoft.com/visualstudio/en-us/http://software.intel.com/en-us/intel-parallel-amplifier/http://software.intel.com/en-us/intel-vtune/