arvind krishnamurthy (based on slides from tom anderson ...€¦ · life without random writes •...
TRANSCRIPT
![Page 1: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/1.jpg)
GFS
ArvindKrishnamurthy(basedonslidesfromTomAnderson
&DanPorts)
![Page 2: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/2.jpg)
GoogleStack
– GFS:large-scalestorageforbulkdata– Chubby:Paxosstorageforcoordination– BigTable:semi-structureddatastorage–MapReduce:bigdatacomputationonkey-valuepairs
–MegaStore,Spanner:transactionalstoragewithgeo-replication
![Page 3: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/3.jpg)
GFS
• Needed: distributedfilesystemforstoringresultsofwebcrawlandsearchindex
• WhynotuseNFS?(Thatis,someexistingdistributedfilesystem.)
![Page 4: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/4.jpg)
GFS
• Needed: distributedfilesystemforstoringresultsofwebcrawlandsearchindex
• WhynotuseNFS?– verydifferentworkloadcharacteristics!– designGFSforGoogleapps,Googleappsfor GFS
• Requirements:– Faulttolerance,availability,throughput,scale– Concurrentstreamingreadsandwrites
![Page 5: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/5.jpg)
GFSWorkload
• Producer/consumer– Hundredsofwebcrawlingclients– PeriodicbatchanalyticjobslikeMapReduce– Throughput,notlatency
• Bigdatasets(forthetime):– 1000servers,300TBofdatastored
• Later:BigTable tabletlogandSSTables• Evenlater:Workloadnow?
![Page 6: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/6.jpg)
GFSWorkload
• Fewmillion100MB+files–Many arehuge
• Reads:–Mostlylargestreamingreads– Somesortedrandomreads
• Writes:–Mostfileswrittenonce,neverupdated–Mostwritesareappends,e.g.,concurrentworkers
![Page 7: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/7.jpg)
GFSInterface
• app-levellibrary– notakernelfilesystem– NotaPOSIXfilesystem
• create,delete,open,close,read,write,append–Metadataoperationsarelinearizable– Filedata eventuallyconsistent (stalereads)
• Inexpensivefile,directorysnapshots
![Page 8: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/8.jpg)
Lifewithoutrandomwrites
• Results of a previous crawl:www.page1.com -> www.my.blogspot.comwww.page2.com -> www.my.blogspot.com• New results: page2 no longer has the link, but there is a new
page, page3:www.page1.com -> www.my.blogspot.comwww.page3.com -> www.my.blogspot.com• Option: delete old record (page2); insert new record (page3)– requires locking, hard to implement• GFS: append new records to the file atomically
![Page 9: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/9.jpg)
GFSArchitecture
• eachfilestoredas64MBchunks
• eachchunkon3+chunkservers
• singlemasterstoresmetadata
![Page 10: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/10.jpg)
“Single”MasterArchitecture
• Masterstoresmetadata:– Filenamespace,filename->chunklist– chunkID->listofchunkserversholdingit– Metadatastoredinmemory(~64B/chunk)
• Masterdoesnot storefile contents– Allrequestsforfiledatagodirectlytochunkservers
• Hotstandbyreplicationusingshadowmasters– Fastrecovery
• Allmetadataoperationsarelinearizable
![Page 11: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/11.jpg)
MasterFaultTolerance
• Onemaster,setofreplicas–MasterchosenbyChubby
• Masterlogs(some)metadataoperations– Changestonamespace,ACLs,file->chunkIDs– NotchunkID->chunkserver;whynot?
• Replicateoperationsatshadowmastersandlogtodisk,thenexecuteop
• Periodiccheckpointofmasterin-memorydata– Allowsmastertotruncatelog,speedrecovery– Checkpointproceedsinparallelwithnewops
![Page 12: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/12.jpg)
HandlingWriteOperations
• Mutationiswriteorappend• Goal:minimizemasterinvolvement
• Leasemechanism– Masterpicksonereplicaasprimary;givesitalease
– Primarydefinesaserialorderofmutations
• Dataflowdecoupledfromcontrolflow
![Page 13: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/13.jpg)
WriteOperations
• Applicationoriginateswriterequest
• GFSclienttranslatesrequestfrom(fname,data)-->(fname,chunk-index)sendsittomaster
• Masterrespondswithchunkhandleand(primary+secondary)replicalocations
• Clientpusheswritedatatoalllocations;dataisstoredinchunkservers’internalbuffers
• Clientsendswritecommandtoprimary
![Page 14: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/14.jpg)
WriteOperations(contd.)• Primarydeterminesserialorderfordatainstancesstoredinitsbufferandwritestheinstancesinthatordertothechunk
• Primarysendsserialordertothesecondariesandtellsthemtoperformthewrite
• Secondariesrespondtotheprimary
• Primaryrespondsbacktoclient
• Ifwritefailsatoneofthechunkservers,clientisinformedandretriesthewrite/append,butanotherclientmayreadstaledatafromchunkserver
![Page 15: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/15.jpg)
AtLeastOnceAppend
• Iffailureatprimaryoranyreplica,retryappend(atnewoffset)– Appendwilleventuallysucceed!–Maysucceedmultipletimes!
• Appclientlibraryresponsiblefor– Detectingcorruptedcopiesofappendedrecords– Ignoringextracopies(duringstreamingreads)
• Whynotappendexactlyonce?
![Page 16: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/16.jpg)
Question
DoestheBigTable tabletserveruse“atleastonceappend”foritsoperationlog?
![Page 17: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/17.jpg)
Caching
• GFScachesfilemetadataonclients– Ex:chunkID->chunkservers– Usedasahint:invalidateonuse– TBfile=>16Kchunks
• GFSdoesnotcache filedata onclients
![Page 18: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/18.jpg)
GarbageCollection
• Filedelete=>renametoahiddenfile• Backgroundtaskatmaster– Deleteshiddenfiles– Deletesanyunreferencedchunks
• Simplerthanforegrounddeletion–Whatifchunkserverispartitionedduringdelete?
• NeedbackgroundGCanyway– Stale/orphanchunks
![Page 19: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/19.jpg)
DataCorruption
• FilesstoredonLinux,andLinuxhasbugs– Sometimessilentcorruptions
• Filesstoredondisk,anddisksarenotfail-stop– Storedblockscanbecomecorruptedovertime– Ex:writestosectorsonnearbytracks– Rareeventsbecomecommonatscale
• Chunkservers maintainper-chunkCRCs(64KB)– LocallogofCRCupdates– VerifyCRCsbeforereturningreaddata– Periodicrevalidationtodetectbackgroundfailures
![Page 20: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/20.jpg)
Discussion
• Isthisagooddesign?• Canweimproveonit?• Willitscaletoevenlargerworkloads?
![Page 21: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/21.jpg)
~15yearslater
• Scaleismuchbigger:– now10Kserversinsteadof1K– now100PBinsteadof100TB
• Biggerworkloadchange:updatestosmallfiles!• Around2010:incrementalupdatesoftheGooglesearch index
![Page 22: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/22.jpg)
GFS->Colossus
• GFSscaledto~50millionfiles,~10PB• Developershadtoorganizetheirappsaroundlargeappend-onlyfiles(seeBigTable)
• Latency-sensitiveapplicationssuffered• GFSeventuallyreplacedwithanewdesign,Colossus
![Page 23: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/23.jpg)
Metadatascalability
• Mainscalabilitylimit:singlemasterstoresallmetadata
• HDFShassameproblem(singleNameNode)• Approach:partitionthemetadataamongmultiplemasters
• Newsystemsupports~100Mfilespermasterandsmallerchunksizes:1MBinsteadof64MB
![Page 24: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/24.jpg)
ReducingStorageOverhead
• Replication:3xstoragetohandletwocopies
• Erasurecodingmoreflexible:mpieces,ncheckpieces
– e.g.,RAID-5:2disks,1paritydisk(XORofothertwo)=>1failurew/only1.5storage
• Sub-chunkwritesmoreexpensive(read-modify-write)
• Afterafailure: getalltheotherpieces,generatemissingone
![Page 25: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,](https://reader036.vdocuments.site/reader036/viewer/2022070818/5f14c5692ac2e319e641a1c2/html5/thumbnails/25.jpg)
ErasureCoding
• 3-wayreplication:3xoverhead,2failurestolerated,easyrecovery
• GoogleColossus:(6,3)Reed-Solomoncode1.5xoverhead,3failures
• FacebookHDFS:(10,4)Reed-Solomon1.4xoverhead,4failures,expensiverecovery
• Azure:moreadvancedcode(12,4)1.33x,4failures,samerecoverycostasColossus