IS DATAPREPARATIONTHENEXT
BIGDATADISRUPTION?
The22ndInternationalConferenceonDistributedMultimediaSystemsDMS2016
GrandHotelSalerno,Salerno,ItalyNovember25- 26,2016
• SCENARIO
• BIGDATAINTHEDATADRIVENENTERPRISE
• WHATDATAPREPARATIONSHOULDCOVER
• CREATINGREADYDATAUSINGFRACTALS
• CASESTUDY
Agenda
SourceForrester2016
1. DOESTHEBUSINESSANALYSTUNDERSTANDTHEDATASCIENTIST?2. WHYDATADRIVENCOMPANIESAREHIRINGDATAJOURNALISTS?3. WHYDARKDATAEXTERNALTODATALAKESCONTINUETOGROW?4. WHYITISREQUIREDSOLONGTIMEFORMAKINGDATA?5. DATAPLAYANDNARRATIVES?
HOW LONG TIME AVAILABLE TO EXPLOIT DATA PROCESSING OUTPUT?
77%DataProcessing
23%DataAnalysis
SourceBloor2016
90%ISDARK
12%AVAILABLEFORBUSINESSINSIGHTS
88%ISJUSTSTORED
80%RECORDINGs,PDFs ANDTEXTs
sourceIDC2016
+4300%ANNUALDATAGENERATION
Datapreparationisaniterativeprocessforexploringandtransformingrawdataintoformssuitablefordatascience,datadiscovery,andanalytics.Self-servicedatapreparationtools(SSDP)areuser-orientedtoolsthatenabledatapreparationcapabilitiessuchasdatacataloging- inventorying,datadiscovery,dataexploration,datatransformation,datastructuring,surfacingofsensitiveattributesandanomalydetection.Thesetoolsareaimedatreducingthetimeandcomplexityofpreparingdataandimprovinganalystproductivity.
Preprocess
Prepare
Discover
Exploit
Raw Technicallycorrect
ReadyData
Patterns
Formatted
Multimediadomain
MissingMultimedia
Dependingonhowyoucountthem,thereareanywherefrom20to50providersofself-servicedatapreparationtools.However,they’renotallequal,andusersshouldcarefullyexaminetheofferingtomeasurethey’regettingwhattheyexpect.ManyBIandAdvancedAnalyticsvendors(Tableau,Qlik,Sas etc.)havejumpedontoSSDP,eveniftheircapabilitiesaren’tseparatefromtheircoreofferingsandshowslimitationsintermofPerformances,Neutrality,Customprocessing.Thekeyreasonwhyself-servicedataprepwillsurviveasitsowncategoryentityisthegrowingrealizationthatdatapreparationneedstobekeptseparatefromanalysisandDiscovery.Thevolumesandthenumberofdatasourceswillnotbedecreasing,andneitherwillthenumberofBItools.Tothatend,it’slikelythatself-servicedataprepwillremainaproductcategoryuntoitselffortheforeseeablefuture.
SourceBloor2016
BUT,WE’RE AFRAID TOCREATETHEM,LORDSARETAKING LONGER THAN 7DAYS
AND,UNFORTUNATELY,WORSE…IT SEEMS THAT
HUMANSHAVEN’TACCESSTOTHOSE
WORLDS
Bottomline:
Isthedatapreparationthebridgebetweenplanetsofdataandtheuser?
BigData isnotJusttechnology,responsibilityshouldbeallocatedonthebasisofthefollowingcriticalfactors:
1. Rawdatawill betransfered tothepreparationunit(push),or
2. thepreparationunit has toread datafromthedatalake (pull)?
3. thedatalake has been designed tostageortostorerawdata?
4. what about thevariability ofthecontext anddata?
PULL
ITDatalakepurpo
se
PUSH
STOR
ESTAG
E
DataCommunication mode
ENDUSER
IT
ENDUSER ENDUSERLowvariability
Highvariability
Bottom Line:Usage of data should be faster, cost less with minimum data
movement requirements
• materializerealityandlanguageinaconsistentdatabase
• couplelanguageandrealityusingkeyback features
• BindexternalalgorithmusingOpen(Standard?)UserExits
• fosterholisticviewsofdatathroughGridDataUnification
rowId Nname Ncity
1 1 1
2 2 2
3 3 3
4 2 2
Key Value NValue
Name Aldo 1
Name Sara 2
Name Anna 3
City Miami 1
… … …
DateBirth UDateB Age
11/1/90 1/11/90 26
12/2/89 2/12/89 26
1.1.68 1/1/68 48
31-1-61 1/31/61 56
Ncity city state
1 Miami Fl
2 NYC NY
3 Rome Italy
Map DictionaryLuggage
hierarchyDatacomplex Storagegroup
name city DateBirth
Aldo Miami 11/1/90
Sara NYC 12/2/89
Anna Rome 1.1.68
Sara NYC 31-1-61
Datasource
Fractalconversion
TransformDateBirth
Add Geoclassification
ADCisafractallikealgorithmthatconvertsaninputrawdataandrelateddataprocessingintoasetofchainedbinaryblocks,formulasandlongpointers.
WeshowthatADCrepresentsanimportantsetofcomputations…TheadvantagesofADCarethat:
itisdescribedbyasmallnumberofparametersandhasaprioriknownsizesoftheviews,theviewscanbegeneratedindependently,theoverheadofcombiningthegeneratedviewsispredictable,thedatasetcanbepartitionedintoanumberofindependentlygeneratedsubsets,theelementsofthedatasetarepseudorandom
ThesepropertiesmakeADCastrongcandidateforadataintensivegridbenchmark<M.Frumkin NASANASDivision>
MATERIALTESTING
• ComplexJson,Oracle,csv,wmv data
• ManualdataprocessingexecutedusingMathlab
• HoursofScientistworktodetectoutlier
• Impossibilitytoreplicatetestswithsameresults
• Scarceknowhowcapitalization
• BlendofdatahappensatNarrativewritingtime
BottomLine:Everydaywehearfromentrepreneursdoingtheirbesttoturntheirbigideasinaconsistentandsuccessfulonlinebusiness.HereITistheenablerbut,unfortunately,sometimestheTparthasanegativeinfluenceonthedevelopmentofthecoreidea.
TheidealtoolkitismadeforwhowishtoexploittheIpartoftheIT,sothatentrepreneurshavinggreatideas,cancrafttheirbusinessthemselves.Andtheyshould!