irmac april 2015 - dmbok2 dwbi new content

Click here to load reader

Upload: martin-sykora

Post on 13-Apr-2017

441 views

Category:

Documents


1 download

TRANSCRIPT

Clardent Analytics NexJ Product Plan 2014

Martin Sykora2015 AprilIRMAC TorontoDMBOK2 DWBINew Content

1

Agenda4/17/20152IntroductionIn the beginningReportsData Warehouse 2 schoolsBI Tools v1DMBOK v1DMBOK v2 New ContentData VaultIn a MinuteDrama: Exa ForkliftYellow ElephantsVisualizing ContentData SciencesVirtualizationDMBOK v2 Conceptual Architecture

2

IntroductionMartin Sykora25 years in data managementOracle, SAP, BusinessObjectsCurrent Director Analytics at NEXJ SYSTEMSDMBOK2 DWBI and Big Data SciencesQueens Masters of Analytics 2016

4/17/20153

Hello everyone! Thanks for making time to attend todays briefingI am Martin Sykora and I have been in the data management profession for 25 years nowWith career highlights working for Oracle, SAP and BusinessObjectsNow an exciting careerlocal Toronto based company called NEXJ Systemsfocused on customer data managementLAST Year, volunteered to write the DWBI chapter for DMBOKthinking it might take a few weeksin fact spent many monthsExtensive research including 40 or so books, hundreds of white papers, presentations, webinarsWhen working on the data sciences portion, especially artificial intelligenceNot too many publications on the data sciences areaVery technical, math centric and full of statistical theoryWound up reading many masters/doctoral submission papersDid a presentation last year at BOCX on data sciencesInterest level continued to growAnd finally compounded with my application to Queens Am proud to be starting Masters of Analytics this JuneThey say a good story makes a good presentation, and this is my story as well as the DMBOK chapters3

In the beginning4/17/20154Google started ten years laterApple first PC with GUI+mouseInternet without www

Return to an analog worldRoughly when my career begunSQL Unix getting startedGoogleBorn ten years later, 10 years before google Picture here as a Bulletin Board Service (www.masswerk.at/googleBBS/) it does actually work!If you need google on a vt, you can have itSeriously, if it could have existed, likely what it would have looked likeNEXTApple was selling us the MacThe first PC computer with GUI and mouseGenerous 1MB of RAMBut at 2600 USD that was expensiveRestated in todays dollars 5300 USD you can buy several unlocked iPhones!NEXTInternet without the world wide webNo content no browsers no searchingConnect to University via modem, read news group (akin blogs), It was very much an analog world!4

ReportsSimple process complex report developmentSources read, integrated, translated and aggregatedAggregate results stored in a tableTable contents were read, sub-totalled and sent to the report file or printer

How did we build reports?One of my assignments as a CO-OP studentProcess simple, Development difficult many 3gl tools like PL/1, COBOL, Fortran, JCLExtremely time consuming manual effortsLogic bound to each program and copied for each reportTable content aligned to satisfy individual report needs, not overall user needsNEXT Is there a better way? Can we share resources?Minimize storage? Consolidate logic?5

Data Warehouse 2 schoolsInmonCorporate Information FactoryNormalized tablesEnterprise Data Model

KimbalMarts that satisfy business process Dimensional Data ModelConformed Dimensions & Facts

Yes we can! There were two schools of thought at the timeInmonCorporate Information FactoryQuite remarkably, Architecture still resonates todayFocus here on table normalization for cleansing & integrationRelies heavily on an enterprise data modelDr KimbalLikely best remembered for dimensional modelling And requirements alignment and progress (bus)Two layers, one to stage, the second for presentationWarehouse formed by virtue of conformed dimensions & factsBoth schools addressed the design and population of a data warehouseBut not the consumption or usage

6

BI Tools V1Focus on tool alignment and usage complexity

My favorite slideExcellent depiction of the tool struggles of the 1990sPlethora of tools satisfying very specific nichesAll tools leveraging the data warehouse to increase worker performance, reduce costs, and recover investment2/3rds of the work&costs spent on data construction, remaining on BI toolsNote: statistics, or now part of data sciences in the IT domain7

DMBOKv1

Data Warehouse loading process (Kimball or Inmon)Operational Reporting & Analysis (OLAP)Performance Management (Dashboarding & Scorecarding)

Roughly where DMBOK v1 left off and v2 picked up fromExcellent coverage of the data warehousing process: load data from operational systems, cleans, standardize & integratePersist those elements necessary for history, create surrogate keys for dimensional data, retain atomic data not aggregatesIntegrate with Reference & Master Data, Confirm DimensionsPublish content out to data stores, data marts and/or olap cubesBI Tools largely rear-view mirror centric Operational reportingAdhoc query or self service analysisDashboard Instrumentation for performance management8

Internet TrafficWhat Happens in an Internet Minute?

5 Exabytes of data transferred monthlyAn Exabyte is a unit of information equal to one quintillion (1018) bytes5,000,000,000,000,000,000 bytes or 5x1018In non-math speakDime is 1.22 mm thick5 Exa of dimes stacked would reach from the Earth to the Sun 40,775 times

Single largest change initiator since v1: the internetThe internet brought us great leaps in communication and connectivity; truly shrank the worldAnd yes we remember it for email, web content, e-commerce and business2businessBut the truly profound affect in progress, is the combination of mobile devices, web connectivity and social mediaCustomers today interact and expect to do so with organizations using the internetAnd they share those interactions publicly on social sitesAnd that can influence behaviours both positively and negativelyWhat does this mean for an internet minute?204 Million emails61,000 hours of music100,000 tweets6 Million facebook views1.3 million youtube viewsQuestion how big a wake would this traffic generate?Show of hands please: 100GB per month or more, 1000, 100,000 GB per month globally, million, 100 millionNEXTHow about 5Exa bytes or 5 billion GB per month globallyHow big is an exabyte?Well, if you could stack dimes Show of hands: to the moon? Venus? Sun?NEXTTry 40,000 times!!

And thats just one month! Now that is truly BIG

9

Drama: Forklift Load

This slide is only for dramatic effect no data was harmed during the assembly or provisioning process

Drama slideEvery good story must have some dramaANd hollywood says a good chaseAnd crash & explosionFelt compelled to work that into the deckBudget constraintsNo cars, just forkliftsLets say were interested in analyzing some of this contentOk, we have a tried and true design pattern: load the data and then we can analyzeNEXTThats not going to work!

No data harmed thats goodWe need an alternativeAnalyzing in place sounds great, but how do we do that?

10

Hadoop - The Yellow Elephant

ExampleBranch customer churn e-mail queryQuery in placeHadoop can read many different file types without transformationNo need to transport the files to a processing or database serverHadoop requests processed with MapReduce jobsApache Hive QL InterfaceInvoke a SQL like scriptCreates MapReduce jobsCompiled results returnedMapReduceSends the algorithm to the dataApplied on best available node-file pairResults then compiled

Scalable, Durable, Commodity Hardware

Thankfully, Google&Yahoo had the same challengesSpawned what is now open source softwareCalled hadoopHadoop is a scalable architecture built on commodity hardware (JBOD Just About Any Disk)ANDKey differentatiorCan Query the data in place no need to copy, load, transform etcJust apply a read schema to whatever contents are loadedNEXTCustomer churn exampleBranch emails collected and stored on hadoopWrite a SQL like query in Hive to fetch those emails, apply filters and perhaps some column maths tooNEXTThe algortithm is sent to the data This is CRITICALL DIFFERENT note that the data is no longer transported, loadedThe ALGORITHM is sent to the DATAeach map reduce job is applying that query directly to the files stored on that nodeNEXTResults combined and returned for analysisAnd thats how we query content in place

11

Viewing Hadoop ContentNaturally we want to view the dataVery difficult to infer any relationshipsGraphing requires understanding of the data elementsBut we dont really know whats thereHow can we pick a chart if without comprehending the data elements?

What are we going to do with that content? How can we analyze it?Lets try our favorite desktop spreadsheet toolBut theres so many columns and rows its very hard to infer any relationshipsAnd the graphing tool forces you to understand the data first Pick your chart, then provide the data for that chartBut I DONT KNOW WHAT THE DATA IS? Dont understand it!! Lets try a few sample charts well not really helpingShould I keep trying charts and data combinations until a pattern jumps out?NEXTWhat if we visualize it first?12

Tree Map Churn: Age by WealthImmediately can see that customers 40 or below are leavingBut why is this happening, is this a regional issue?

Visualization tools provide a guided interface to discovering relationships in dark data data you dont have a deep and through understanding ofHere I used a visualization tool create this tree mapbased on SQL sources and it took less than an hour to prepare the data setMinutes to create the actual visualizationTree map is scalableNumber of rows do not matter: hundreds, thousands millions!Uses boxes to express size or magnitude, here the box size is customer wealth or total savings, investments and loansBox grouping based on age bands, these will not change too much eitherNow I can immediately see a pattern: Customers that are staying are marked with No on the leftChurning (boxes with Yes) is occurring for customer less than 44 years of age (youngsters)In minutes I have an insight which raises a question:Is this churn a regional issue? Is there some localization issue driving this age bracket to leave?13

Geographic Churn by WealthChurn customers scattered across multiple regionsAppears age related, could it be product/service offering?

Lets try a geographic or map visualizationMap interface, churn no customers remaining, churn yes below that have leftSize of the circle based on wealthAppears to be across the board, not tied to a specific city or regionAnother insight in minutes, with raising another questionCould this be product or service related?14

Box Plot Churn: Product by AgeMost impact from churn to investmentActionable insights from data visualizations

Lets try a box plotAlso called a candle-stick or high-low chartFiltered for churn customers only (ones that are leaving)Focused on the product wealth now by productAppears that investments Younger age bracket churningLeaving with material investment dollarsNot tied to a regional issueNow have an actionable insightThis took minutes/hours not weeks/months

But this is still a manual process15

Data SciencesVisualization is a manual process, DS applied mathematical methods to analyze, process and manage big data resultsDecomposed in the following sectionsData Mining Unsupervised learningProfiling, Data Reduction, Association, ClusteringPredictive Analytics Supervised LearningClassification, Decision Trees, CHAID, RegressionAdvanced Supervised LearningEnsemble, Neural Networks, Support Vector MachinesData Scientists typically develop, train and manage many algorithmic programs referred to as modelsThe input, or variables/parameters vary from model to model and the computed outcome have many business consumersHow do we bring these pieces together?

Visualizations are great for data insightBut still a manual processAnd still reactive in natureHow can we be better prepared? What events are driving the churn?Can we predict it before it actually happens?How can we measure that?This is the field of data sciencesStatistical methods, predictive analytics and learning algorithms provide scalable tools to process large data setsIn DMBOK, we have split this into three areas of interest1. Data Mining aka unsupervised learning ideal for discovery & exploration2. Predictive Analytcs aka Supervised learning event likelihood, scoring models3. Advanced Learning MIT Hand Writing digitizer, Facebook image recognition API, Google Computer driven carsBuilding a probability modelNeed construction, testing or training and monitoring in productionDifferent phases need access to different data; many elements may not survive initial explorationProcess aligns well with most SDLC How to govern with losing control of data or losing delviery agility?16

Virtualization4/17/201517Virtualization refers to technologies designed to provide a layer of abstraction between computer hardware systems and the software running on themCommon virtualization technologiesServer virtualization a single physical server supplies multiple user environments ideal for resource optimizationDatabase virtualization multiple copies of a single database image ideal for testing activitiesData virtualization integration of any data from disparate data sources into coherent data services

Virtualization as an abstraction layer between hardware and provisioned software components..Common virtualization technologiesServer virtualizationMany of us connect to virtual servers on a regular basis, widespread adoption very cost effective ideal for resource optimization especially in a data centre context; many operating system mounts can be provisioned by a single physical hostAnother is database environment virtualization; here we take one copy of a database and spawn multiple running instances of it. The database behaves just as it normally would, but footprint reduction can be achieved ideal for testing where a single copy can be cloned or seeded from production and multiple project teams can perform their destructive testsAnd relatively new, is data virtualizationCombining any data for analytic consumption

And were taking a deeper dive into that17

Data Virtualization4/17/201518

Lets walk through this from left to rightAny dataCould be structured content, like a relational database or flat fileCould be unstructured content or semi-structured content like an XML file or JSON or a web pageCould be streaming content like a real time feedNext layer is Data ProvisioningHere we attempt to access any dataConnectors between the sources and the provisioning layer provide the conduit for accessEach data element could be loaded by query, by process or by workflowQuery provisioning is data federationETL or process uses a program to get the any data, and optionally apply transformation to itWorkflow could leverage message content on the Enterprise Service BusNow that we have any data (likely in memory)Next step is to render itMake it appear as a tableTables need to be related or relatableAnd we may want to aggregate or group or compute elements to shape itAccess to rendered objects is restricted to privileged users - governanceToolsVariety of connection methods to access rendered dataWhat does this mean?Existing tools connect natively to virtualized dataQuery, where appropriate, operates just like any other data sourceUsers mostly unaware of data lineage

18

BIW Release VirtualizationUse virtualization to Foster agile deliveryProve concepts with businessMaterialize only necessary components

How can we use data virtualization in BIW environment?Agile delivery data science initiatives to evaluate hundreds of data elements to determine model lift or information entropy Leverage any data sources, structured or unstructuredMany virtualization tools equipped with searching capabilitiesIf you can search it, so can your business usersQuickly start and stop initiatives without comprising reuseDS initiative wants 300 variables this month, 600 next monthDont have to throw it all away, just start another virtualization session, Re-use componentsProve concepts directly with business users that create meaningful prototypes Based on any data sourcesQuickly respond to data discoveries and changes in requirementsMaterialize necessary componentsHandover to existing materialization teamStart with the virtual model to drive designOnly required elements hardened in the warehouseUsers can still access the virtual layer while materialization19

DMBOKv2 Conceptual Architecture

The last slide here, but drove both chaptersBasis from dmbok v1 drives the warehousing componentBI PortfolioExtended with Data VisualizationData Mashups which are end user tools for combining desperate data setsData Science field of Data Mining, Predictive Analysis and Machine LearningBig data beside the warehouseHadoop query in placeQuery results can be stored in a digest, perhaps NoSQL db, for further analysisFull analytics portfolioAnd thats all I have today!

20

Thank YouDMBOK2http://www.dama.org/content/body-knowledge

IRMAC Data Management Educationhttp://www.irmac.ca/

NEXJ Customer Data Managementhttp://www.nexj.com/products/financial-services/enterprise-customer-view/

Questions?E-mail [email protected] ca.linkedin.com/in/martingsykora

Thanks you, it has been my pleasure!RemindersDMBOK2 on the DAMA site, plenty more content in the bookroughly 150 pages+ in the DWBI and bigData Science chapters aloneIRMAC posting a letter to Universities pleading for better education programsNEXJ Customer Data Management lots of exciting product offerings!Please do keep in [email protected], on linkedIn too

OPEN FLOOR FOR QUESTIONS Any questions?21

Referencehttp://en.wikipedia.org/Dime_(Canadian_coin) Canada. Value, 0.10 CAD. Mass, 1.75 g. Diameter, 18.03mm.Thickness, 1.22mmSun Distance to Earth: 149,600,000 kmhttp://www.masswerk.at/googleBBS/http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white_paper_c11-481360.htmlAnnual global IP traffic will surpass the zettabyte (1000 exabytes) threshold in 2016. Global IP traffic willreach 1.1zettabytes per year or 91.3 exabytes (one billion gigabytes) per month in 2016. By 2018, global IPtrafficwill reach 1.6 zettabytes per year, or 131.6 exabytes per month.

22