spark at bloomberg: dynamically composable analytics
TRANSCRIPT
![Page 1: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/1.jpg)
Spark @ Bloomberg: Dynamic Composable Analytics
Partha NageswaranSudarshan KadambiBLOOMBERG L.P.
![Page 2: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/2.jpg)
Bloomberg Spark Server
(Persistent)Spark
Context
Request Handler
2
SparkServerization atBloomberghasculminated inthecreationoftheBloomberg SparkServer
Function Transform
Registry (FTR)
Managed DataFrameRegistry
IngestionManager
RequestProcessor
RequestProcessor
Declarative QueryRequest
Processor
JVM
![Page 3: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/3.jpg)
Spark Serverization – Motivation
3
• Stand-aloneSparkAppsonisolatedclustersposechallenges:
– Redundancy in:
» CraftingandManagingofRDDs/DFs
» Codingofthesameorsimilartypesoftransforms/actions
– Managementofclusters,replicationofdata,etc.
– Analyticsareconfined tospecificcontentsetsmakingCross-AssetAnalytics muchharder
– NeedtohandleReal-timeingestion ineachApp
Spark Cluster
Spark App
Spark Cluster
Spark Server
Spark App
Spark App
Spark Cluster
Spark App
![Page 4: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/4.jpg)
Dynamic Composable Analytics• Compositional Analyticsarecommon placeintheFinancialSector
Decile Rankthe14-dayRelativeStrengthIndexofActiveEquityStocks:
DECILE(RSI(
Price,14,['IBMUSEquity','VODLNEquity',…]
))
• PriceisdataabstractedasaSparkDataFrame
• RSI,DECILEarebuilding blockanalytics,expressibleasSparktransformsandactions
4
![Page 5: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/5.jpg)
Dynamic Composable Analytics• Anotherusecase maywanttocomposePercentilewithRSI
PercentileRankthe14-dayRelativeStrengthIndexofActiveEquityStocks:
PERCENTILE(RSI(
Price,14,['IBMUSEquity','VODLNEquity',…]
))
• OrPercentilewithROC,etc.Andthecompositionsmaybearbitrarilycomplex
5
![Page 6: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/6.jpg)
Dynamic Composable Analyticsdef RSI(df:DataFrame,period:Int=14) : DataFrame ={
val smmaCoeff =udf((i:Double)=>scala.math.pow(period-1,i-1)/scala.math.pow(period,i))val rsi_from_rs =udf((n:Double,d:Double)=>100- 100*d/(d+n))val rsi_window=Window.partitionBy('id).orderBy('date.desc)
df.withColumn("weight",smmaCoeff(row_number.over(rsi_window))).withColumn("diff",'value- lead('value,1).over(rsi_window)).withColumn("U",when('diff>0,'diff).otherwise(0)).withColumn("D",when('diff<0,abs('diff)).otherwise(0)).groupBy('id).agg(rsi_from_rs(sum('U*'weight),sum('D*'weight))as'value)
}
def Decile(df:Dataframe) : DataFrame ={df.withColumn("value",ntile(10).over(Window.orderBy('value.desc)))
}
Ack: Andrew Foster6
![Page 7: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/7.jpg)
Function Transform Registry• MaintainaRegistryofAnalytic functions (FTR)withfunctions expressedas
ParametrizedSparkTransformsandActions
• Functionscancompose otherfunctions,alongwithadditional transforms,withintheRegistry
• Registrysupports 'bind'and'lookup'operations
7
Function Transform Registry (FTR)
DecileFUNCTIONS SPARK IMPL.
…Percentile …
… …
![Page 8: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/8.jpg)
Bloomberg Spark Server
(Persistent)Spark Context
Request Handler
8
Function Transform
Registry (FTR)
JVM
![Page 9: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/9.jpg)
Request Processor• RequestProcessors (RPs)aresparkapplications thatorchestratecomposition of
analyticsonDataFrames
– RPscomplywithaspecification thatallowsthemtobehostedbytheBloombergSparkServer
– Eachrequest(suchas:computetheDecile RankoftheRSI)ishandledbyaRequestProcessorthatlooksupfunctionsfromtheFTR,Composesthemandapplies themtoDataFrames
9
Request Handler
Request Processor
.
FTR
Declarative QueryRequest
Processor
![Page 10: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/10.jpg)
Bloomberg Spark Server
(Peristent)Spark
Context
Request Handler
10
Function Transform
Registry (FTR)
JVM
RequestProcessor
RequestProcessor
Declarative QueryRequest
Processor
![Page 11: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/11.jpg)
Managed Data Frames
• BesideslocatingfunctionsfromtheFTR,RequestProcessorshavetopassinDataFramestothesefunctionsasinputs
• RatherthaninstantiateDataFrames,lookupDataFramesfromaDataFramesRegistry
– SuchDataFramesarecalledManagedDataFrames(MDF)
– TheRegistrythatManagestheseDataFramesistheManagedDataFrameRegistry(MDFRegistry)
11
![Page 12: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/12.jpg)
Introducing Managed DataFrames (MDFs)
• AManagedDataFrame (MDF)isanamedDataFrame,optionallycombinedwithExecutionMetadata
– MDFscanbelocatedbynameORbyanyColumnNamedefined intheSchemaof thecorrespondingDF
• ExecutionMetadataincludes:
– DataDistribution metadatacapturesinformationabout thedatadepth, histogram information, etc.
– E.g.:AmanagedDataFrame forpricingof stocks,representing 2yearsofhistoricaldata andanotherforrepresenting 30yearsofhistoricaldata
MDF
Price DF<ID, Price>
Name: Shallow
PriceMDF
ExecutionMetadata:* 2 Yr Price
History
MDF
Price DF<ID, Price>
Name: Deep
PriceMDF
ExecutionMetadata:
* 30 Yr Price History
12
![Page 13: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/13.jpg)
Managed DataFrames
– DataDerivationmetadatawhicharemathematicalexpressions thatdefinehowadditional columnscanbesynthesized fromexistingcolumns intheschema
– E.g.:adjPrice isaderivedColumn, definedintermsofthebasePricecolumn
– Inessence,anMDFwithdataderivationmetadatahaveaSchemathatisaunionofthecontainedDFandthederivedcolumns
MDF
Name:ShallowPriceDF
ExecutionMetadata:* 2 Yr Price
History* adjPrice =
Price – 3% of Price
Price DF<ID, Price>
MDF
Name:Deep
PriceDF
ExecutionMetadata:
* 30 Yr Price History
* adjPrice = Price – 1.75% of
Price
Price DF<ID, Price>
13
![Page 14: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/14.jpg)
The MDF Registry
• TheMDFRegistrywithintheBloombergSparkServer providessupport for:
– BindingMDFsbyName
– LookingupMDFsbyName
– LookingupMDFbyaColumn Name(anelementoftheMDFSchema),etc.
• TheMDFRegistrymaintainsa'table'thatassociates theNameoftheMDFwiththeDFreference andColumnsintheDF
MDFRegistryName Columns DF
Ref.MetaData
ShallowPriceDF
Price,adjPrice
… …
DeepPriceDF
……
…
Price,adjPrice
14
![Page 15: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/15.jpg)
Bloomberg Spark Server
(Peristent)Spark
Context
Request Handler
15
Function Transform
Registry (FTR)
JVM
RequestProcessor
RequestProcessor
Declarative QueryRequest
Processor
Managed DataFrameRegistry
![Page 16: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/16.jpg)
Flow of Requests
• RequestProcessorswithintheSparkServerorchestrateanalytics
– TheseRPshaveaccesstotheRegistryandFTRs
– AreresponsibleforcomposingtransformsandactionsononeormoreMDFs
– MaydynamicallybindadditionalMDFs(materializedorotherwise)forusebyotherApps
Request Handler
Request Processor
.
MDF Registry
lookup MDFs
FTRs
applyFunction
MDFs
decoratewithTransforms
collect
16
![Page 17: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/17.jpg)
Bloomberg Spark Server
Spark Context
Request Processor
Request Processor
Declarative QueryRequest Processor
Request Handler
MDF Registry
MDF
17
Function Transform
Registry (FTR)
RSI …
use MDF
MDF
MDF
17
![Page 18: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/18.jpg)
Bloomberg Spark Server
Spark Context
Request Processor
Request Processor
Declarative QueryRequest Processor
Request Handler
MDF Registry
18
Function Transform
Registry (FTR)
RSI …
use
18
Ingestion Manager
MDF1
MDF2
1 2
1 2
![Page 19: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/19.jpg)
Schema Repository
19
• Enterprise-widedatapipeline
• External(toSpark)schemarepositoryandservice
• EnablesMDFlookupbyadatasetschemaelement
• Analyticexpressionscannowbecomposedoverdataelements
![Page 20: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/20.jpg)
Execution Metadata
20
• DatasetSourceConnection Identifiers
• BackingStores
• Real-time Topics
• StorageLevel&RefreshRate
• SubsetPredicate,etc.
![Page 21: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/21.jpg)
Ad-hoc Cross-Domain Analytics
21
• Registrationofpre-materializedDataFrames
• Collaborativeanalyticsbetweenapplicationworkflows
• DynamiccreationofManagedDataFrames
• SparkServershavedatapertainingtoasingledomainmaterialized
• Ad-hoc cross-domainanalyticsrequirescapabilitytosynthesizeMDFsondemand
![Page 22: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/22.jpg)
Content Subsetting
22
• Highvaluedatasub-settedwithinSpark
• Reducecostofqueryingexternaldatastore
• Specifiedasafilterpredicateattimeofregistration
• E.g.Membercompaniesofpopularindices[Dow30,S&P500,…]haverecordsplacedwithinSpark
![Page 23: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/23.jpg)
Content Subsetting
23
• SeamlessunificationofdatainSpark(DFsubset)andbackingstore(DFsubset’)
(DFsubset UDFsubset’).filter(query)= DFsubset.filter(query)UDFsubset’.filter(query)
• Datasetownersprovidedknobsforcostvsperformance.
• LRUcachelikemechanismplannedinthefuture
• MakesenseasacapabilitynativetoSparkdataframes
![Page 24: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/24.jpg)
Ingestion: Periodic Refresh
24
• PeriodicdatapullintoSparkfromthebackingstore
• Subsetcriteriaappliedduringdataretrieval
• Usedwhenadatasethasabackingstore,butnorealtimeupdatestreamthatwecantapinto
• Datasetownershavecontroloverstoragelevelofthedataframes createdwithinagivenMDF
![Page 25: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/25.jpg)
Ingestion: Stream Reconciliation
25
• Analyticsneedstobelow-latencywithrespecttoqueries,butalsodatafreshness
• Sincedataisbeingsub-settedwithinSpark,needtokeepthesubsetuptodate
• DatasetspublishedtodifferentKafkatopics.
• 1:1mappingbetween datasets,topicsandDStreams.
![Page 26: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/26.jpg)
Ingestion: Stream Reconciliation
26
Backing Store
U1 U2 U3 UN DFsubset
S1 S2 S3 SNDFN
MDF -PriceHistory
Real-Time Stream
(update state)
(Avro Deserialize, Subset Predicate)
(convert to DF-seq)
Similar intent as Structured Streaming, to be introduced in Spark 2.0
![Page 27: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/27.jpg)
Ingestion: Data Transformation• Datainbackingstoresmayneedrepresentationtransforms
beforebeingusedinqueries
• Datainmultipletablesdenormalized intoasingleDFwithinSpark
• Or,quicklyseeeffectofdifferentstoragerepresentationsonperformance,withoutchangingtherepresentationinthebackingstore
• Implementedvia.usertransformsassociatedwithagivenMDF
![Page 28: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/28.jpg)
Spark Server: Memory Management
28
• AnMDFcontainsmultiplegenerationofDFs,beinggeneratedanddestroyed
• MultiplegenerationsoperateduponbyRPsatgivenpointintime
• ReferencecountingtokeeptrackofwhatDFsarebeingusedandbywhom
• Longrunningqueriesabortedforforcedreclamation
![Page 29: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/29.jpg)
Query Consistency
29
• Multiplequeriesneedtooperateonsamesnapshotofdata
• Howtoachieve,ifdataconstantlychangingunderneath?
• EachDFwithinMDFassociatedwithtimeepoch
• Registrylookupwithareferencetime
• Time-alignsub-setted dataframeswithdatainbackingstore
![Page 30: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/30.jpg)
Spark for Online Analytics
30
– HighAvailabilityofSparkDriver• Highbootstrapcosttoreconstructingclusterandcachedstate• NaïveHAmodels(suchasmultipleactiveclusters)surfacequeryinconsistency
– HighAvailabilityofRDDPartitions• Withsubset oruniversecached,lostRDDpartitionskillqueryperformance
– PerformanceConsistency• Performancegatedbyslowestexecutor• HighAvailabilityandLowTailLatencycloselyrelated
– Interactionseffectsbetweenlow-latencyqueriesandlow-latencyupdates• NotoMinimalsandboxingbetweenjobssharingexecutorJVMs
FirstBloombergcontribution:SPARK-15352
![Page 31: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/31.jpg)
Spark Server Acknowledgements
Andrew Foster Joe Davey Shubham Chopra
Nimbus Goehausen Tracy Liang
![Page 32: Spark at Bloomberg: Dynamically Composable Analytics](https://reader031.vdocuments.site/reader031/viewer/2022021502/58f9a934760da3da068b6c31/html5/thumbnails/32.jpg)
THANK [email protected]@bloomberg.net