big data governancelab.ilkom.unila.ac.id/ebook/e-books big data/2016... · modern data management...

BIGDATAGOVERNANCE:ModernDataManagementPrinciplesforHadoop,NoSQL&BigDataAnalytics

PeterK.Ghavami,PhD

Copyright©2016PeterGhavami

Allrightsreserved.

ISBN:1519559720

ISBN-13:978-1519559722

BIGDATAGOVERNANCE:

ModernDataManagementPrinciplesforHadoop,NoSQL&BigDataAnalytics

PeterK.Ghavami,PhD

PeterK.Ghavami,Ph.D.

[email protected]

Copyright©2016PeterK.Ghavami,Washington,D.C.Allrightsreserved.Thispublicationisprotectedbycopyright,andpermissionmustbeobtainedfromthecopyright holder prior to any prohibited reproduction, storage in a retrievalsystem, or transmission in any form or by anymeans, electronic,mechanical,photocopying, recording or likewise. For information regarding permissions,writetooremailto:

PeterK.Ghavami,[email protected]

Theauthorandpublisherhavetakencareinpreparationsofthisbook,butmakenoexpressedorimpliedwarrantyofanykindandassumenoresponsibilityforerrorsoromissions.Noliabilityisassumedfortheincidentalorconsequentialdamages in connection with or arising out of the use of the information ordesignscontainedherein.

Keywords: 1. Big Data Analytics, 2. Data governance, 3. Hadoop, 4. DataSecurity,5.DataManagement,6.DataLifecycleManagement

Copyright©2016PeterGhavami

Allrightsreserved.

ISBN:1519559720

ISBN-13:978-1519559722

BIGDATAGOVERNANCE

ModernDataManagementPrinciplesforHadoop,NoSQL&BigDataAnalytics

PeterK.Ghavami,PhD

FirstEdition

2016

Acknowledgements

Thisbookwasonlypossibleasaresultofmycollaborationwithmanyworldrenowneddatascientists,researchers,CIOsandleadingtechnologyinnovatorswhohave taughtme a tremendousdeal about scientific research, innovationandmoreimportantlyaboutthevalueofcollaboration.ToallofthemIoweahugedebtofgratitude.

PeterGhavami

January2016

TomybeautifulwifeMassi,

whoseunwaveringloveandsupportmaketheseaccomplishmentspossibleandworthpursuing.

CONTENTS

INTRODUCTION

Purpose

SECTIONI:

INTRODUCTIONTOBIGDATA

IntroductiontoBigData

TheThreeDimensionsofAnalytics

TheDistinctionbetweenBIandAnalytics

AnalyticsPlatformFramework

DataConnectionlayer

DataManagementlayer

AnalyticsLayer

PresentationLayer

TheDiverseApplicationsofBigData

DataManagementBodyofKnowledge(DMBOK)

DMBOK2:Whatisit?

EightReasonsforDMBOK2

DataMaturityModel(DMM)

SECTIONII:

BIGDATAGOVERNANCEFUNDAMENTALS

Introduction

Top10DataBreaches

WhatisGovernance?

WhyBigDataGovernance?

DataStewardResponsibilities

CorporateGovernance

BigDataGovernanceCertifications

CaseforBigDataGovernance

StrategicDataGovernance,orTacticalDataGovernance?

TOGAFViewofDataGovernance

DataLakevs.DataWarehouse

HistoryofHadoop

HadoopOverview

HDFSOverview

MapReduce

SecurityToolsforHadoop

TheComponentsofBigDataGovernance

MythsaboutBigData&HadoopLake

EnterpriseDataGovernanceDirective:

DataGovernanceisMorethanRiskManagement

FirstStepsTowardBigDataGovernance

WhatYourDataGovernanceModelshouldaddress:

DataGovernanceTools

BigDataGovernanceFramework:ALean&EffectiveModel

Organization

DataQualityManagement

MetadataManagement

Compliance,Security,PrivacyPolicies

TheEnterpriseBigDataGovernancePyramid

IntroductiontoBigDataGovernanceRules

Organization

DataGovernanceCouncil

DataStewardship

Authority:Policy,decision-making,governance

GovernanceActivities:Monitoring,Support,andmore…

Users:Developers,DataScientists,End-users

MasterDataManagement

MetaDataManagement

LakeDataClassification

Security,Privacy&Compliance

ApplyTieredDataManagementModel

BigDataSecurityPolicy


BigDataSecurity–AccessControls

BigDataSecurity:KeyPolicies

DataUsageAgreementPolicy

SecurityOperationsPolicies

InformationLifecycleManagement

QualityManagement

BigDataQuality&Monitoring

DataClassificationRules

DataQualityPolicies

MetadataBestPractices

BigDataGovernanceRules:BestPractices

SampleDataGovernance&ManagementTools

DataGovernanceintheGRCContext

TheCostsofPoorDataGovernance

BigDataGovernanceBudgetPlanning

WhatOtherCompaniesaredoing?

SECTIONIII:

BIGDATAGOVERNANCEBESTPRACTICES

DataGovernanceBestPractices

DataProtection

SensitiveDataProtection

DataSharingConsiderations

AdherencetoGovernancePolicies

DataProtectionattheHighLevel

LowLevelDataAccessProtection

SecurityArchitectureforDataLake

DataLakeClassification

HadoopLakeDataClassificationandGovernancePolicies

MetadataRules

DataStructureDesign

SandboxFunctionalityOverview

DetailedAccessPolicies

SplitDataDesign

SECTIONIV:

BIGDATAGOVERNANCEFRAMEWORKPROGRAM

BigDataGovernanceFrameworkProgram Overview

IntegrityofInformationandDataResourceandAssets

BenefitsofComplianceandRisksofnon-compliance

Definitions

I.Organization

II.MetadataManagement

III.DataClassification

IV.BigDataSecurity,Privacy&Compliance

V.DataUsageAgreement(DUA)

VI.SecurityOperationsConsiderations&Policies

VII.InformationLifecycleManagement

VIII.QualityStandards

DataQualityReporting

Summary

INTRODUCTION

Purpose

ThefutureofbusinessisBigData.Whilethewealthofanorganizationmaybedisplayedinbalancesheetsandelectronicledgers,therealwealthoftheorganizationisinitsinformationassets–indataandhowwellitharnessesvaluefromit.

WhileHadooppromises toprovide theultimate flexibility andpower in storingand analyzing data, becauseHadoopwas not designedwith security and governance inmind,wefacenewandadditionalchallengesinmanagingitsdatatomeetcorporateandITgovernancestandards.Iofferthebestpracticesindatagovernanceaftersamplingthebest and most successful policies and processes from around the world to offer you asimplified, lowcostbuthighlyeffectivehandbooktobigdatagovernance. That’swhythisbookisindispensabletoimplementingbigdataanalytics.

Knowledge is information and information is derived from data. Without datagovernanceanddataquality,withoutadequatedataintegrationandinformationlifecyclemanagement, thechanceofharnessing thisvalueand leveraging fromdatawillbeverylimited.

Accordingtoexpertreports,in2010therewerearound1.2zettabytes(1.2billiongigabytes) of data worldwide. In 2011 that number jumped to 1.8 zetta bytes. It’sexpectedthattheglobalvolumeofstoragewilldoubleevery12-18months.In2015,it’sexpectedthat theworldwillaccumulatenear8zettabytesofdata. Themajorityof thisdata is unstructured in form of PDFs, spreadsheets, images,multimedia (audio, video),emails,socialcontent,webpages,machinedataaswellasGPSandsensordata.

Thepurposeofthisbookistopresentapractical,effective,nofrillsandyetlow-costdatagovernanceframeworkforbigdata.You’llfindthisbooktobeconciseandtothepoint,highlightingtheimportantandsalienttopicsinbigdatathatyoucanimplementto achieve an effective data governance structure but at a low cost to implement. Thepremiseofthepoliciesandrecommendationsinthisbookarebasedonbestpracticesfromaroundtheworldinbigdatagovernance. I’veincludedbestpracticesfromsomeof themostrespectedandleadingedgecompanieswhohavesuccessfullyimplementedbigdataandgovernance.

TolearnmoreaboutBigDataAnalytics,youcanreadtwoothercompanionbooksbythesameauthor:Thefirstbookistitled“ClinicalIntelligence-TheBigDataAnalyticsRevolutioninHealthcare:AFrameworkforClinicalandBusinessIntelligence”.Itcanbefound at: https://www.createspace.com/4772104 . The second companion book is titled“BigDataAnalyticsMethods:ModernAnalyticsTechniquesforthe21stCentury”.ItcanbefoundonAmazon.

Thisbookconsistsoffourmajorsections:SectionIoffersanoverviewofbigdataand Hadoop. Section II is an overview of big data governance concepts, structure,architecture,policies,principlesandbestpractices.SectionIIIpresentsthebestpracticesinbigdatagovernancepolicies.Finally,SectionIVincludesaready-to-usetemplateforgovernance structurewritten in a flexible format such thatyoucaneasily adapt toyour

https://www.createspace.com/4772104

organization.

The contents of this book are presented in a lecture-like manner using apresentation slide deck style. There is an electronic slide deck available which can bepurchasedonlineatAmazon.com.Youmayadapttheseslidesandmodifythemtofityourparticularenterpriseandsituationmakingityourown.

Now,let’sstartourjourneythroughthebook.

SECTIONI:INTRODUCTIONTOBIGDATA

IntroductiontoBigData

DATAisthenewGOLD.AndANALYTICSisthemachinerythatmines,moldsandmintsit.BigDataanalyticsisasetofcomputer-enabledanalyticsmethods,processesand discipline of extracting and transforming raw data into meaningful insight, newdiscovery and knowledge that helps make more effective decision making. Anotherdefinitiondescribesbigdataanalyticsasthedisciplineofextractingandanalyzingdatatodelivernewinsightaboutthepastperformance,currentoperationsandpredictionoffutureevents.

Beforetherewasabigdataanalytics,thestudyoflargedatasetswascalleddatamining. But, big data analytics has come a long way since a decade ago and is nowgainingpopularitythankstoeruptionoffivenewtechnologies:BigDataAnalytics,Cloudcomputing,mobility,socialnetworkingandsmallersensors.Eachofthesetechnologiesissignificantinitsuniquewaytohowbusinessdecisionsandperformancecanbeimprovedandhowvastamountofdataisbeinggenerated.

BigData isknownby its threekeyattributes,knownas the threeV’s:Volume,Velocity,andVariety.Theworldstoragevolumeisincreasingatarapidpace,estimatedtodouble every year. The velocity atwhich this data is generated is rising fueled by theadventofmobiledevicesandsocialnetworking.Inmedicineandhealthcare,thecostandsize of sensors has shrunk, making continuous patient monitoring and data acquisitionfromamultitudeofhumanphysiologicalsystemsanacceptedpractice.

Withtheadventofsmaller,inexpensivesensorsandvolumeofdatacollectedfrompeople, internet and machines we’re challenged with making increasingly analyticaldecisions quickly from a large set of data that are being collected. This trend is onlyincreasinggivingrisetowhat’sknownintheindustryasthe“bigdataproblem”:Therateof data accumulation is rising faster than people’s cognitive capacity to analyzeincreasinglylargedatasetstomakedecisions.Thebigdataproblemoffersanopportunityforimprovedpredictiveanalyticsandfact-baseddecisions.

It’sreportedthatonaveragemostlargecompaniesintheU.S.haveaccumulatedanaverageof1.5PetaBytes(1.5PB)ofdataby2015.AfewcompaniesnowofferHadoopdatavaultsandwarehousingproducts includingCloudera,MapRandHortonWorks.TheNoSQL(NotonlySQL)andnon-relationaldatabasemovementsuchasHadoop,LuceneandSparkoffernewdatamanagementtoolsthatallowstorageandmanagementoflargestructuredandnon-structureddatasets. Butthebiggestchallenges,bigdatagovernanceremainsmostlyunchartedterritoryatthistime.

Thevarietyofdataisalsoincreasing.Forexample,themedicaldatawasconfinedto paper for too long. As governments around the world, such as the United States,pushedmedical institutions to transformtheirpractice intoelectronicanddigital format,patient data became digital and took on diverse forms. It’s now common to think ofelectronic Medical records (EMR) to include diverse forms of data such as audiorecordings,MRI,Ultrasound, computed tomography (CT) and other diagnostic images,videos captured during surgery or directly from patients, color images of burn and

wounds, digital images of dental x-rays, waveforms of brain scans, electro cardiogram(EKG)andthelistgoeson.

Big Data is characterized by 3V’s: Velocity, Volume and Variety. Thisunderscores the large volume of data that is being collected and stored by our digitalsociety; the rapidpace atwhichdata isgeneratedbyconsumers, smartdevices, sensorsand computer; the variety of data formatswhich in almost every industry spans awiderangeofdatafromtexttoimages,webandmultimediaformats.

IDC[1] predicts that the worldwide volume of data will increase by 50X from2010 to 2020. Theworld volume of datawill reach 40ZB (Zetta bytes)[2]. The firmpredictsthatby2020,85%ofallthisdatawillbenewdatatypesandformats.Therewillbe a 15X growth in machine generated data by 2020. The notion of all devices andappliancesgeneratingdatahasledtotheideaoftheinternetofthings,wherealldevicescommunicate freely with each other and to other applications through the internet.McKinsey & Company predicts that by 2020, Big Data will be one of the five gamechangersinUSeconomyand1/3rdoftheworlddatawillbegeneratedintheUS.

Howwillwemanageandgovernthisvastandcomplexseaofdata?Whatwillbethecostsofpoorornodatagovernance?Thisbookwillcontemplatethecostsofdoingnothingbutpresentsaframeworktobringdatagovernancetobigdata.

New typesofdatawill include structuredandunstructured text. Itwill includeserverlogsandothermachinegenerateddata.Itwillincludedatafromsensors,websites,machines, mobile and wearable devices. It will include steaming data and customersentimentdataaboutyou.ItincludessocialmediadataincludingTwitter,FacebookandlocalRSSfeedsaboutyourorganization,yourpeopleandproducts.Allthesevarietiesofdata typescanbeharnessedtoprovideamorecompletepictureofwhat ishappeningindeliveryofvaluetoyourcustomers.

Thetraditionaldatawarehousestrategiesbasedonrelationaldatabasessufferfroma latency of up to 24 hours. These datawarehouses can’t scale quicklywith large datagrowthandbecausetheyimposerelationalanddatanormalizationconstraints,theiruseislimited. In addition, they provide retrospective insight and not real-time or predictiveanalytics. Bigdataanalyticswillbecomemore real-timeand“in-the-moment”decisionsupporttoolthanthetraditionalbusinessintelligenceprocessofgeneratingbatchorientedreports.

Semantics are critical to data analytics. Asmuch as 60% - 80% of all data isunstructureddatainformofnarrativetext,emailsoraudiorecording.Correctlyextractingthe pertinent terms from such data is a challenge. Tools such as Natural LanguageProcessing (NLP) methods combined with subject-specific libraries and ontologies areused to extract useful data from the vast amount of data stored inHadoopData Lake.HadoopDataLakeisacommontermthatdescribesthevaststorageofdataintheHadoopfile system.DataLake isa term increasinglyused to refer to thenewgenerationofbigdata warehouses where all data are going to be stored using open source and Hadooptechnology. However, understanding sentence structure, context and relationshipsbetweenbusinesstermsarecriticaltodetectthesemanticsandmeaningofdata.

Given the risingvolumeofdata and thedemandorhigh-speeddata access thatcan handle analytics some IT leaders are contemplating to invest in dedicated analyticsplatformlikeHadoop,Non-SQLdatabasesandopensourcetoolsfordatatransformation.Transactional systems and reporting platforms are not designed to handle high-speedaccess to big data for analytics and thus are inadequate. As a result specialized dataanalyticsplatformsareneededtohandlehigh-volumedatastorageandhigh-speedaccessrequiredforanalytics.

WhileBigDataAnalyticspromisesphenomenalimprovementsineveryindustry,aswithanytechnologyacquisition,wewanttotakeprudentstepstowardsadoption.Wewanttodefinecriteriaforsuccess,gaugeReturnonInvestment(ROI),anditsdata-centricmetric that I define as Return onData (ROD). A successful projectmust demonstratepalpablebenefitsandvaluederivedfromnewinsights.ImplementingBigDataAnalyticswillbenecessaryandstandardprocedureformostorganizationsastheystrivetoidentifyany remaining opportunities in improving efficiency, strategic advantage and cuttingcosts. Managing data and data governance are critical to success of big data analyticsinitiativeasthevolumeofdataincreases.

In addition to implementing a robust data governance structure as a key to bigdata analytics success, as I mentioned in my earlier book titled: “Lean, Agile and SixSigma IT Management”, successful implementation of big data analytics also requiresteameffort combinedwith leanandagilepractices. Teameffort is requiredbecausenosinglevendor, individualorsolutionsatisfiesallanalyticsneedsoftheorganization,anddata governance is no different. Collaboration and partnerships among all usercommunities in the organization as well as among vendors in needed for successfulimplementation. Lean approach is required in order to avoid duplication of efforts,processwasteanddiscordantsystems.

Thefutureofbigdataanalyticsisbrightandwillbesoformanyyearstocome.We’refinallyabletoshedlightonthedatathathavebeenlockedupinthedarknessofourelectronicsystemsforyears.Whenyouconsiderotherapplicationsofanalyticswhichareyet to be discovered, there are endless opportunities to improve business performanceusingdataanalytics.

TheThreeDimensionsofAnalytics

Data analytics efforts may focus on any of the three termporal dimensions ofanalysis:Theretrospectiveanalysis, theReal-time(current time)analysisandPredictiveanalysis.Theretrospectiveanalyticscanexplainandprovideknowledgeabouttheeventsofthepast,showtrendsandhelpfindroot-causesforthoseevents.Thereal-timeanalysisshows what is happening right now. It works to present situational awareness, alarmswhen data reaches certain threshold or send reminderswhen a certain rule is satisfied.Theprospectiveanalysispresentsaviewintothefuture.Itattemptstopredictwhatwillhappen, what are the future values of certain variables. The figure below shows thetaxonomyofthe3analyticsdimensions.

The3temporaldimensionsindataanalysis

TheDistinctionbetweenBIandAnalytics

The purpose of Business Intelligence (BI) is to transform raw data intoinformation, insight and meaning for business purposes. Analytics is for discovery,knowledgecreating,assertionandcommunicationofpatterns,associations,classificationsand learning from data. While both approaches crunch data and use computers andsoftwaretodothat,thesimilaritiesendthere.

WithBI,we’reprovidingasnapshotoftheinformation,usingstaticdashboards.We’re working with normalized and complete data typically arranged in rows andcolumns. The data is structured and assumed to be accurate. Often, data that is out ofrangeoroutlierareremovedbeforeprocessing.Dataprocessingusessimple,descriptivestatistics such as mean, mode and possibly trend lines and simple data projections toextrapolationaboutthefuture.

Incontrastanalyticsdealswithalltypesofdatabothstructuredandunstructured.In medicine about 80% of data is unstructured in form of medical notes, charts andreports.Analyticsdoesnotmandatedatatobecleanandnormalized.Infact,itmakesnoassumption about data normalization. It can analyzemany varieties of data to provideview into patterns and insights that are not humanly possible. Analytics methods aredynamic and provide dynamic and adaptive dashboards. They use advanced statistics,artificial intelligence techniques, machine learning, feedback and natural languageprocessing(NLP) tomine through thedata.Theydetectpatterns indata toprovidenewdiscoveryandknowledge.Thepatternshaveageometricshapeandtheseshapesassomedata scientists believe, havemathematical representations that explain the relationshipsandassociationsbetweendataelements.

Unlike BI dashboards that are static and show snapshots of data, analyticsmethodsprovidedatavisualizationandadaptivemodelsthatarerobusttochangesindataand in fact learn from changes in data. While BI uses simple mathematical anddescriptivestatistics,Analyticsishighlymodel-based.Adatascientistbuildsmodelsfromdatatoshowpatternsandactionableinsight.FeedbackandmachinelearningareconceptsfoundinAnalytics.ThetablebelowillustratesthedistinctionsbetweenBIandAnalytics.

ThedifferencesbetweenBusinessIntelligenceandDataAnalytics

AnalyticsPlatformFramework

When considering building analytics tools and applications, a data analyticsstrategyandgovernanceisrecommended.Oneofthestrategiesistoavoidimplementingpoint-solutionsthatarestand-aloneapplicationsanddonotintegratewithotheranalyticsapplications. Consider implementingananalyticsplatformthatsupportsmanyanalyticsapplicationsandtoolsintegratedintheplatform.A4-layerframeworkisproposedhereasthe foundation for the entire enterprise analytics applications. The 4-layer frameworkconsistsofadatamanagementlayer,ananalyticsenginelayerandapresentationlayerasshowninthefigurebelow.

The4-layerDataAnalyticsFramework

In practice, you’llmake choices aboutwhat software and vendors to adopt forbuildingthisframework.Thedatamanagementincludesthedistributedorcentralizeddatarepository. This framework assumes that the modern enterprise data warehouses willconsistofdistributedandnetworkeddatawarehousesusingopensourceenvironmentlikeHadoop.

TheanalyticslayermaybeimplementedusingSASandRstatisticallanguageorsolutionsfromothervendorswhoprovidetheanalyticsenginesinthislayer.

ThepresentationlayermayconsistofvariousvisualizationtoolssuchasTableau,QlikView,SpotFire,orotherdatavisualizationapplications.Inaproperimplementationofthis framework, the visualization layer offers analytics-driven workflows and therefore

tight integration between the presentation layer and the other two layers (data andanalytics)arecriticaltosuccessfulimplementation.

Thekeymanagementframeworkthatspanstheentire4layersisdatamanagementand governance. This framework which is the subject of this book must include theorganizationalstructure,operationalpoliciesandrulesforsecurity,privacyandregulatorycompliance.

Let’sexamineeachlayerabitmoreclosely:

DataConnectionlayer

InthedataConnectionlayer,dataanalystssetupdataingestionpipelinesanddataconnectors toaccessdata.Theymightapplymethods to identifymeta-data inall sourcedatarepositories.Thislayerstartswithmakinganinventoryofwherethedataiscreatedand stored. The data analysts might implement Extract, Transfer, and Load (ETL)softwaretoolstoextractdatafromtheirsource.ToolssuchasTalendanddataexchangestandardssuchasX.12mightbeusedtotransferdatatotheDataManagementLayer.

DataManagementlayer

Once the data has been extracted, data scientist must perform a number offunctions that are grouped under data management layer. The data may need to benormalizedandstoredincertaindatabasearchitecturestoimprovedataqueryandaccessbytheanalytics layer. We’llcover taxonomyofdatabasetools includingSQL,NoSQL,Hadoop,Sharkandotherarchitectureintheupcomingsections.

Inthislayer,wemustpayattentiontoHIPAAstandardsforsecurityandprivacy.Thedatascientistwillusethetoolsinthislayertoapplysecuritycontrols,suchasthosefrom HITRUST (Health Information Trust Alliance). HITRUST offers a CommonSecurity Framework (CSF) that aligns HIPAA security controls with other securitystandards.

DataScientistsmayapplyotherdatacleansingprogramsinthislayer.Theymightwrite tools to de-duplicate (remove duplicate records) and resolve any datainconsistencies.Oncethedatahasbeeningested,it’sreadytobeanalyzedbyenginesinthenextlayer.

Sincebigdatarequiresfastretrievalseveralorganizations,inparticulartheopensource foundation have developed alternate database architectures that allow parallelexecutionofqueries,read,writeanddatamanagement.

Most analytics models require access to the entire data because often theyannotatethedatawithtagsandadditionalinformationwhicharenecessaryformodelstoperform.

The traditional data management approaches have adopted centralized datawarehousearchitectures.Thetraditionaldatawarehouses,namelyacentralrepository,oracollection of data fromdisparate applications have not been as successful as expected.Onereasonisthatdatawarehousesareexpensiveandtimeconsumingtobuild.Theother

reason is that theyareoften limited tostructureddata typesanddifficult tousefordataanalytics, in particular when unstructured data is involved. Finally, traditional datawarehousesinsistonrelationaldatabaseandtabularstructuresanddatatobenormalized.Sucharchitecturesaretooslowtohandlethelargevolumeofdataqueriesrequiredbydataanalyticsmodels.Theyrequirenormalizedrelationsbetweentablesanddataelements.

StarSchema:ThemoreadvanceddatawarehouseshaveadoptedKimball’sStarSchema or Snowflake Schema to overcome the normalization constraints. The Starschema splits the business process into Fact tables and Dimension tables. Fact tablesdescribe measurable information about entities while Dimension tables store attributesaboutentities.TheStarschemacontainsoneormoreFacttablesreferenceanynumberofDimension tables. The logicalmodel typicallyputs theFact table in thecenter and theDimensiontablessurroundingit,resemblingastar(hencethename).SnowflakeschemaissimilartoStarschemabutitstablesarenormalized.

Star schemas are de-normalized data where normalization rules, typical oftransactional relational databases, are relaxed during design and implementation. Thisapproachofferssimplerandfasterqueriesandaccesstocubedata. Howevertheysharethesamedisadvantagewith thenon-SQLdatabases(discussedbelow), therulesofdataintegrityarenotstronglyenforced.

Non-SQLDatabaseschema:Inordertoliberatedatafromrelationalconstraints,severalalternatearchitectureshavebeendevisedintheindustryasexplainedinthenextsections.Thesenon-traditionalarchitecturesincludemethodsthatstoredatainacolumnarfashion,orstoredataindistributedandparallelfilesystemswhileothersusesimplebuthighlyscalabletag-valuedatastructures.Themoremodernbigdatastoragearchitecturesareknownbynames likeNoSQL,Hadoop,Cassandra,Lucene,SOLR,Sparkandothercommercialadaptationsofthesesolutions.

NoSQL database means Not Only SQL. A NoSQL database provides storagemechanismfordatathatismodeledotherthanthetabularrelationsconstraintofrelationaldatabases like SQL Server. The data structures are simple and designed to meet thespecific types of data, so the data scientist has the choice of selecting the best fitarchitecture. The database is structured either in a tree, columnar, graph or key-valuepair.However,aNoSQLdatabasecansupportSQL-likequeries.

Hadoop is an open-source database framework for storing and processing largedata sets on low-cost, commodity hardware. Its key components are the Hadoopdistributedfilesystems(HDFS)forstoringdataovermultipleserversandMapReduceforprocessing the data. Written in Java anddeveloped atYahoo,Hadoop stores datawithredundancy and speeds-up searches over multiple servers. Commercial versions ofHadoop include HortonWorks and Cloudera. I’ll cover Hadoop architecture and datamanagementinmoredetailinthecomingsections.

Cassandra is another open-source distributed database management systemdesigned to handle large data sets at higher performance. It provides redundancy overdistributed server clusters with no single point of failure. Developed at Facebook topowerthesearchfunctionathigherspeeds,Cassandrahasahybriddatastructurethatisa

cross between a column-oriented structure and key-value pair. In the key-value pairstructure,eachrowisuniquelyidentifiedbyarowkey.TheequivalentofaRDBMStableisstoredasrowsinCassandrawhereeachrowincludesmultiplecolumns.But,unlikeatableinanRDBMS,differentrowsmayhavedifferentsetofcolumnsandacolumncanbeaddedtoarowatanytime.

Lucene is an open-source database and data retrieval system that is especiallysuited for unstructured data or textual information. It allows full text indexing andsearchingcapabilityonlargedatasets.It’softenusedforindexinglargevolumesoftext.Datafrommanyfileformatssuchas.pdfs,HTML,MicrosoftWord,andOpenDocumentcanbeindexedbyLuceneaslongastheirtextualcontentcanbeextracted.

SOLR is a high speed, full-text search platform available as an open-source(Apache Lucene project) program. It’s highly scalable offering faceted search anddynamicclustering.SOLR(pronounced“solar”)isreportedlythemostpopularenterprisesearchengine. ItusesLucenesearch libraryas itscoreandoften isused inconjunctionwithLucene.

Hiveisanotheropen-sourceApacheprojectdesignedasadatawarehousesystemontopofHadooptoprovidedataquery,analysisandsummarization.Developedinitiallyat Facebook, it’s now used by many large content organizations include Netflix andAmazon.ItsupportsaSQL-likequerylanguagecalledHiveQL.AkeyfeatureofHiveisindexing to provide accelerated queries,working on compressed data stored inHadoopdatabase.

Sparkisamoderndataanalyticsplatform,amodifiedversionofHadoop.Itsbuilton thenotion thatdistributeddata collectionscanbecached inmemoryacrossmultiplecluster nodes for faster processing. Spark fits into the Hadoop distributed file systemoffering10times(forin-diskqueries)to100times(in-memoryqueries)fasterprocessingfor distributed queries. It offers tools for queries distributed over in-memory clustercomputersthatallowapplicationsrunrepeatedin-memoryqueriesrapidly.Sparkiswellsuited to certain applications such asmachine learning (whichwill be discussed in thenextsection).

Real-timevs.BatchAnalytics:Manyofthetraditionalbusinessintelligenceandanalytics happenonbatchdata; a set of data is collectedover time and the analytics isperformed on the batch of data. In contrast real-time analysis refers to techniques thatupdateinformationandperformanalysisatthesamerateastheyreceivedata.Real-timeanalysisenablestimelydecisionmakingandcontrolofsystems.Withreal-timeanalysis,dataandresultsarecontinuouslyrefreshedandupdated.

AnalyticsLayer

In this layer, a data scientist uses a number of engines to carry the analyticalfunctions.Dependingonthetaskathand,adatascientistmayuseoneormultipleenginesto build an analytics application. A more complete layer would include engines foroptimization,machinelearning,naturallanguageprocessing,predictivemodeling,patternrecognition,classification,inferencingandsemanticanalysis.

Optimizationengine isused tooptimizeandfind thebestpossiblesolution toagiven problem. Optimization engine is used to identify the best combination of othervariables to give an optimal result. Optimization is often used to find lowest cost, thehighestutilizationoroptimallevelofcareamongseveralpossibledecisionoptions.

Machine learning is a branch of artificial intelligence that is concerned withconstruction andbuildingof programs that learn fromdata.This is a basis for buildingadaptivemodelsandalgorithmsthatlearnfromdataaswellasadapttheirperformancetodata as data changes over time orwhen applied to one population versus another. Forexample, models based on machine learning can automatically classify patients intogroupsofhavingadiseaseornot-havingadisease.

Naturallanguageprocessing(NLP) isafieldofcomputerscienceandartificialintelligence thatbuildscomputerunderstandingofspoken languageand texts. NLPhasmanyapplications,butinthecontextofanalyzingunstructureddata,anNLPenginecanextract relevant structured data from the structured text. When combined with otherengines,theextractedtextcanbeanalyzedforavarietyofapplications.

Predictivemodelingengineprovidesthealgorithmsusedformakingpredictions.Thisenginewouldincludeseveralstatisticalandmathematicalmodelsthatdatascientistscanusetomakepredictions.Anexampleismakingpredictionaboutpatientre-admissionsafterdischarge.Typically,theseenginesingesthistoricaldataandlearnfromthedatatomakepredictions.

Patternrecognitionengines,alsoknownasdataminingprograms,providetoolsfordatascientists todiscoverassociationsandpatterns indata. These toolsenabledatascientiststoidentifyshapeandpatternsindata,performcorrelationanalysisandclusteringofdatainmultipledimensions.Someofthesemethodsidentifyoutliersandanomaliesindata which help data scientists identify black-swan events in their data or identifysuspiciousorunlikelyactivityandbehavior. Usingpattern recognitionalgorithms,datascientistsareable to identify inherentassociate rules from thedataassociations.This iscalledassociationrulelearning.

Another technique is building a regression model which works to define amathematical relationship between data variables with minimum error. When the dataincludesdiscretenumbers,regressionmodelsworkfine.But,whendataincludesamixofnumbersandcategoricaldata(textuallabels), thenlogisticregressionisused. Therearelinear and non-linear regressionmodels and sincemany data associations in biologicalsystemsare inherentlynon-linear, themorecompleteenginesprovidenon-linear logisticregressionmethodsinadditiontolinearmodels.

Classificationenginessolvetheproblemofidentifyingwhichsetofcategoriesasubject or data element belongs. There are two approaches, a supervised method andunsupervisedmethod.Thesupervisedmethodsuseahistoricalsetofdataasthetrainingsetwherepriorcategorymembershipisknown.Theunsupervisedmethodsusethedataassociations to define classification categories. The unsupervised classification is alsoreferred to as clustering of data. Classification engines help data scientists to grouppatients,orprocedures,physiciansandotherentitiesbasedontheirdataattributes.

Inference is the process of reasoning using statistics and artificial intelligencemethods to draw a conclusion from data sets. Inference engines include tools andalgorithms for data scientists to apply artificial intelligence reasoning methods to theirdata.Oftentheresultoftheirinferencinganalysisistoanswerthequestion“whatshouldbedonenext?”whereadecisionis tobemadefromobservationsfromdatasets. Someinference engines use rule-based approaches that mimic an expert person’s process ofdecisionmakingcollectedintoanexpertsystem.Rulescanbeappliedinaforwardchainorbackwardchainprocess. Inaforwardchainprocess, inferenceenginesstartwith theknown facts and assert new facts until a conclusion is reached. In a backward chainprocess,inferenceenginesstartwithagoalandworkbackwardtofindthefactsneededtobeassertedsothegoalcanbeachieved.

Semantic analyzers are analytics engines that build structures and extractconcepts from a large set of textual documents. These engines do not require priorsemantic understanding of the documents. Semantic analysis is used to codifyunstructured data or extract meaning from textual data. One application of semanticanalyticsisconsumersentimentanalysis.

Machine Learning: Machine learning is a discipline outgrowth of ArtificialIntelligence.Machinelearningmethodslearnfromdatapatternsandcanimitateintelligentdecisions, perform complex reasoning and make predictions. Predictive analytics usemachine learning techniques tomakepredictions. These algorithmsprocess data at thesame rate as they receive data in real time, but they also have a feed-back loopmechanism. The feedback loop takes the output of the analytics system and uses it asinput.Byprocessingandcomparingtheiroutputtotherealworldoutcome,theycanfine-tunetheirlearningalgorithmsandadapttonewchangesindata.

StatisticalAnalysis:StatisticalAnalysistoolswouldincludedescriptivefunctionssuchasmin,max,mode,median,plusabilitytodefinedistributioncurve,scatterplot,z-Test, Percentile calculations, and outlier identification. Additional statistical analysismethodswould includeRegression (Trending)analysis,correlation,Chi-square,maximaandminimacalculations,t-TestandF-test,andothermethods.Formoreadvancedtools,one can use an open source tool developed at Stanford called MADLIB,(www.madlib.net).Also,ApacheMahoutincludesalibraryofopensourcedataanalyticsfunctions.

Forecasting&PredictiveAnalytics:Forecastingandpredictiveanalyticsarethenewfrontiersinmedicaldataanalytics.Thesimplestapproachtoforecastingistoapplyregression analysis to calculate regression line and the parameters of the equation linesuch as the slope and intercept value. Other forecastingmethods use interpolation andextrapolation. Advancedforecastingtoolsofferothertypesofanalysessuchasmultipleregression, non-linear regression, Analysis of variance (ANOVA) and Multi-variableAnalysis of variance (MANOVA),MeanSquareError (MSE) calculations, and residualcalculations.

PredictiveAnalyticsisintendedtoprovideinsightintofutureevents.Itismodel-drivenandincludesmethodsthatproducepredictionsusingsupervisedandunsupervised

http://www.madlib.net

learning. Some of the methods include neural networks, PCA and Bayesian Networkalgorithms. Predictions require the user to select the predictive variables and thedependentvariablesfromthepredictionscreen.

Anumberofalgorithmsareavailableinthiscategorythattogetherprovidesarichsetofanalyticsfunctionality.ThesealgorithmsincludeLogisticRegression,NaiveBayes,DecisiontreesandRandomforest,regressiontrees,linearandnon-linearregression,TimeSeries ARIMA[3], ARTXp, and Mahout analytics (Collaborative Filtering, Clustering,Categorization). Additional advanced statistical analysis tools are often used, such asmultivariate logistic regression, Kalman filtering, Association rules, LASSO and Ridgeregression, Conditional Random Fields (CRF) methods, and Cox Proportional Hazardmodels.FormoredetailsrefertotheDataanalyticsbookwrittenbythesameauthor.

Pattern Analysis: Using machine learning algorithms researchers can detectpatterns indata,performclassificationofpatientpopulation,andclusterdatabyvariousattributes.Thealgorithmsusedinthisanalysisincludevariousneuralnetworksmethods,Principal Component Analysis (PCA), Support Vector Machines, supervised andunsupervised learningmethods such as k-means clustering, logistic regression, decisiontree,andsupportvectormachines.

PresentationLayer

This layer includes tools for building dashboards, applications and user-facingapplications that display the results of analytics engines. Data scientists oftenmashupseveral dashboards (called “Mashboards”) on the screen to display the results usinginfographic graphs. These dashboards are active and display data dynamically as theunderlyinganalyticsmodelscontinuouslyupdatetheresultsfordashboards.

Infographic dashboards allow us to visualize data in amore relevantwaywithbetterillustrations.Thesedashboardsmaycombineavarietyofcharts,graphsandvisualstogether. Someexamplesof infographic components includeheatmaps, treemaps, bargraphs,varietyofpiechartsandparallelchartsandmanymorevisualizationforms.

Advanced presentation layers include data visualization tools that allows datascientist to easily visualize the results of various analyses (classification, clustering,regression,anomalydetection,etc.)inaninteractiveandgraphicaluserinterface.

Several companiesprovide rapiddatavisualizationprograms includingTableau,QlikView,Panopticon,PentahoandLogiwhicharerevolutionizinghowweviewdata.

Nowwe’rereadytoturnourattentiontodatagovernanceandbestpracticesinbigdatamanagement.

TheDiverseApplicationsofBigData

While Hadoop was designed to handle big data and distributed processing ofmassiveamountsofdata,ithasotherdiverseusecases.Infactthereare6use-casesforHadoopasadistributedfilesystemthatyoucanutilize:

1. Storealldata. UseHadoop to store and combine all data of all types and

formats.Storebothstructuredandunstructureddata.Storehumangenerateddata(applicationdata,emails,transactionaldata,etc.)andmachinegenerateddata (like system logs, network security logs and access logs and so on).Since storage is inexpensive and Hadoop does not mandate any relationalstructureupfront,youcanbefreetobringanydataintotheHadoop.

Thisapproachiscalledthe“datalake. Withthisfreedom,ofcourse,comesgreatresponsibility:tomaintainandtrackdatainthedatalake. Youshouldbeabletoanswerquestionslike:Whattypeofdatadowehaveandwhereisitstored in thedata lake? Dowehave such and suchdata in the lake?And,wherewasthesourceofdata?

Theabilityandconfidencetoanswerthesequestionsliesindatagovernanceandisthepremiseofthisbook.

2. Stage Data. Use Hadoop as staging platform to transform, re-transform,extract and load data before loading it into the data warehouse. You cancreate a sandbox inHadoop to specifically handle interim and intermediatedatainatemporarystorageareaforprocessing.Trackinglineageofdataandwhattypeofprocessingwasappliedtoitareimportantpiecesofinformationto maintain as part of data governance. In this type of use-case, Hadoopprovides a powerful platform to transform raw data into relational dataformatsthatfitthedatawarehousestructure.

3. ProcessUnstructured&ExternalData.UseHadooptoarchiveandupdatetwotypesofdata:1)theunstructureddatainyourbusinessand2)externaldataforanalysis. Insteadofusingcostlydatawarehouseresources tostore thesetypesofdata,whynotsendthemtoHadoop?Thisisacommonpracticenowformanyorganizations.Inparticularwhenyouwanttoprocesssocialmediadatawhichistypicallysemi-structured,Hadoopofferstheidealtoolstoparseandprocesstextualinformationthatdelivercustomersentimentanalysis.

4. ProcessanyData. UseHadoop tocombineboth internalandexternaldatafor processing. Hadoop ecosystemoffers a rich set of tools and processingoptionstojoin,normalizeandtransformdataofalltypesandsources.Moregenerally, you canbring anydata that’s currently not in thedatawarehouseinto Hadoop and process it with internal data providing additional insightsaboutyourcustomers,productsandservices.

5. Store& Process Fast Data. Analyzing real-time data is a challenge thatHadoop can nicely accommodate. Consider a use-case to analyze securityaccesslogsacrossallsystemsinyourenterprise.Oneglobalcompanyreportsgenerating2Petabytesofaccesslogdataeverydaythatneedstobeanalyzedin real time. There are millions of transactions per second that need toanalyzedforavarietyofuse-casesrangingfromreal-timefrauddetection(inthecaseofcreditcardusage)orpatterndetection(in thecaseof identifyinghacker activity). These are known as fast data use-cases. Use Hadoop

componentsandtoolstostorefastdataonpetabytescaleandanalyzeitontheflyasitgetsstoredinHadoopinrealtime.

6. VirtualizeData.Thegoalsofdatavirtualizationistoprovideaseamlessandunifiedviewofall enterprisedatanomatterwhere thedata resides,be it indata warehouses, in big data lake or even spread among many excelspreadsheets.ThisvisionisintendedtoaddressanendemicITproblemofnotknowingwheredoes the information in theorganizationresideandhowcanweaccessittomonetizeit?

While Hadoop does not directly provide a data virtualization platform, it doescontributeasasignificantresourcetothedatavirtualizationarchitecturebuildingblock.There are open source tools and commercial solutions that overtime provide robustsolutionsfordatavirtualization.KeepingapristinedatamanagementanddatagovernanceoverHadoopdatawillbeamajorfactorforachievingsuccessfuldatavirtualizationacrosstheenterprise.

Aswecontinueourdiscussionaboutbigdatamanagement,therearetwosourcesofstandardsandbestpracticesthatdeservetobereferenced.OneistheDataManagementAssociation (DAMA) that has published a set of guidelines for data management in alibrarycalledtheDataManagementBodyofKnowledge(DMBOK).Theothersourceofinformation is the Data Maturity Model. In the next segments, I’ll review these tworeferencesbrieflytosetthestageforthenextsections.

DataManagementBodyofKnowledge(DMBOK)

DataManagement Association (now referred to as DAMA International) is anorganization that devotes its energy to advance datamanagement principles, guidelinesand best practices. One of the organization’s goals is to be the go-to source for datamanagementconcepts,educationandcollaborationoninternationalscale.Ithaspublisheda set of guidelines that are publically available under Data Management Body ofKnowledgeVersion2.0(DMBOK2)[4].

DMBOK2:Whatisit?

DMBOK2isacollectionofprocesses,principles&knowledgeareasaboutpropermanagement of data. It includes many of the best practices and standards in datamanagement. DMBOK may be used as a blueprint to develop strategies for datamanagementfortheentireenterpriseorevenatdivisionlevel.ItwasdevelopedbyDataManagementAssociation(DAMA)whichhasprovidingdatamanagementguidancesince1991.ThePMBOK1guidewasreleased in2009andPMBOK2in2011.Theguidelineswere developed for IT professionals, consultants, data stewards, analysts, and datamanagers.

Thereareeightkey reasons toconsiderDMBOK2 inyourorganization.ThekeywordsareinCapsanditalicized:

EightReasonsforDMBOK2

1. ItbringsBusiness and IT together at the tablewith a commonunderstandingofdatamanagementprinciplesandvocabulary.

2. It connects Process and Data together so the organization can develop theappropriateprocessesformanagingandgoverningitsdata.

3. DefinesResponsibility for data and accountability for safeguard, protection andintegrityofdata.

4. Establishes guidelines and removes ambiguity around change and Changemanagement.

5. ProvidesPrioritizationframeworktofocusthebusinessonwhatmattersthemost.

6. Considers regulatory and compliance requirements to provide intelligentRegulatoryresponsefortheorganization.

7. HelpsidentifyRisksandriskmitigationplans.

8. ItconsidersandpromotesthenotionofDataasanAssetfortheorganization.

9. ItlinksdatamanagementguidelinestoprocessMaturityandCapability.

The DMBOK2 covers 9 domains of data management as shown in the next

diagram.These9domainsare:

1. DataArchitectureManagement

2. DataDevelopment

3. DatabaseOperationsManagement

4. DataSecurityManagement

5. Reference&MasterDataManagement

6. DataWarehousing&BusinessIntelligenceManagement

7. Document&ContentManagement

8. Meta-DataManagement

9. DataQualityManagement

DAMAInternationalDMBOK2

Thecontentsof thisbookall related to these9domainsfromtheperspectiveofBigData,Hadoop,NoSQLandBigDataAnalytics.

DataGovernanceisatthecenterofthediagram.Itisconcernedwithplanning,oversight, and control over data management and use of data. The topics under DataGovernance cover data strategy, organization and roles, policies and standards, issuesmanagementandvaluation.

DataArchitecturemanagementisanintegralpartoftheenterprisearchitecture.It

includes activities such as enterprise data modeling, value chain analysis and dataarchitecture.

Data Development is concerned with data modeling and design, data analysis,databasedesign,building,testing,deploymentandmaintenanceofdatastores.

Document & ContentManagement deals with data storage principles such asacquisition & storage of data, backup and recovery procedures, content management,retrieval,retentionandphysicaldataassetstoragemanagement.Theseguidelinesaddressstoring, protecting, indexing, and enabling access to data found in unstructured sources(electronicfilesandphysicalrecords),andmakingthisdataavailableforintegrationandinteroperabilitywithstructured(database)data.

Data Security Management ensures privacy, confidentiality and appropriateaccess. It includes guidelines for security standards, classification, administration,authentication,loggingandaudits.

DataWarehousing andBusiness Intelligence provides guidelines formanaginganalyticaldataprocessingandenablingaccesstodecisionsupportdataforreportingandanalysis.Itcontainspoliciesregardingarchitecture,implementation,trainingandsupport,datastoreperformancemonitoringandtuning.

Meta-datamanagementisaboutintegrating,controllinganddeliveringmeta-data.Italsoincludesguidelinesonmeta-dataarchitecture,dataintegration,controlanddelivery.

DataQualityManagementisconcernedwithdefining,monitoringandimprovingdataquality. Itcoversguidelinesonqualityspecification,analysis,qualitymeasurementandqualityimprovement.

Database Operations Management covers data acquisition, transformation andmovement;managing ETL, federation, or virtualization issues. It offers guidelines fordataacquisition,datarecovery,performancetuning,dataretentionandpurging.

Reference&MasterData Management focuses onmanaging gold versions ofdata,replicasofdataanddatalineage.Itprovidesguidelinesforstandardizedcataloguesfor external codes, internal codes, customer data, product data and dimensionmanagement.

DataMaturityModel(DMM)

CMMIInstitute,abranchofCarnegieMellonUniversityisfundedbyDoDandUSGovernmentcontracts,initiallyinresponsetotheneedforcomparingprocessmaturityamongDoDcontractors.CapabilityMaturityModelIntegration(CMMI)wasdevelopedasasetofprocessimprovementandappraisalprogrambyCarnegieMellonUniversitytoassessanorganization’sCapabilityMaturity.

In2014,CMMIInstitutereleasedamodelwithasetofprinciplestoenableorganizationsimprovedatamanagementpracticesacrossthefullspectrumoftheirbusiness.KnownastheDataMaturityModel(DMM),itprovidesorganizationswithastandardsetofbestpracticestobuildbetterdatamanagementstructureandaligndatamanagementwithorganization’sbusinessgoals.

Inmyopinion,thepremiseofprocessmaturityiswellgrounded…inthebeliefthatbetterprocessesandhigherprocessmaturityarelikelytoproducehigherqualityandmorepredictableITsolutions.

TheDMMmodelincludesaDataManagementMaturityPortfoliowhichincludesDMMitselfplussupportingservices,training,partnerships,assessmentmethodsandprofessionalcertifications.

DMMwasmodeledafterCMMImaturitymodeltomeasureprocessmaturityofanorganization,improveefficiencyandproductivity,reducerisksandcostsassociatedwithenterprisedata.

DMMisanexcellentframeworktoanswerquestionssuchas:

•Howmatureareyourdatamanagementprocesses?

•WhatlevelofDMMmaturityisyourorganizationat?Howdoyouraiseyourmaturitylevel?

•HowdoIcapturethemaximumvalueandbenefitsfromdataforthebusiness?

TheDMMmodeliscomprisedof5levelsofmaturityasshowninthefollowingtable.ThesourceofthetableisCMMIInstitute’sstandardforDataMaturityModel.Formoreinformation,Irecommendthatyouvisittheirwebsite[5].

Atthelowestlevel,Level1,theorganization’sactivitiesarounddatamanagementaread-hoc,unstableandnotrepeatable.

AtLevel-2,theorganizationhasdatamanagementprocesses,plans,someadherencetostandardsandmanagementobjectivesinplace.

AtLevel-3,theorganizationhasmaturedtoestablishandimplementstandardizedsetofprocesses,rolesandresponsibilitiesandproactiveprocessmeasurementsystem.

AtLevel-4,datamanagementactivitiesandperformancearemeasuredandquantitativeobjectivesforqualityandprocessmanagementaredefinedandimplemented.

Atthehighestlevel,Level-5,theorganizationoptimizesitsinternalprocessesbycontinuousimprovement,changemanagementandalignmentwithbusinessobjectivesand

businessneedschange.

Level

MaturityLevels

KeyCharacteristicsofProcessesandFunctions

Level1 Initial Adhoc,inconsistent,unstable,disorganized,notrepeatableAnySuccessachievedthroughindividuallevel

Level2 Managed PlannedandmanagedSufficientresourcesassigned,trainingprovided,responsibilitiesallocatedLimitedperformanceevaluationandcheckingofadherencetostandards

Level3 Defined StandardizedsetofprocessdescriptionsandproceduresusedforcreatingindividualprocessesActivitiesaredefinedanddocumentedindetail:roles,responsibilities,measures,processinputs,outputs,entryandexitcriteriaProactiveprocessmeasurementandmanagementProcessinterrelationshipsdefined

Level4 Quantitatively

Managed

QuantitativeobjectivesaredefinedforqualityandprocessperformancePerformanceandqualitypracticesaredefinedandmeasuredthroughoutthelifeoftheprocessProcess-specificmeasuresaredefinedPerformanceiscontrolledandpredictable

Level5 Optimized EmphasisoncontinuousimprovementbasedonunderstandingoforganizationbusinessobjectivesandperformanceneedsPerformanceobjectivesarecontinuallyupdatedtoalignandreflectchangingbusinessobjectivesandorganizationalperformanceFocusonoverallorganizationalperformanceDefinedfeedbackloopbetweenmeasurementandprocesschange

TheDMMmodelofferssixdomain(principleareas)formaturityassessmentandprocessimprovement.Thesixdomainsare:DataManagementStrategy,DataGovernance,DataQuality,DataOperations,Platform&Architecture,andSupportingServices.

Thefollowingtablerepresentsthehighlightsofeachdomainandabrief

descriptionofeachdomainisprovidedbelow:

DataManagementStrategy

DatamanagementstrategyDatamanagementfunctionBusinesscasefordatamanagement,Funding,Communications

DataGovernance

Meta-datamanagementBusinessglossaryGovernancemanagement

DataQuality

DataqualitystrategyDataprofilingDataqualityassessmentDatacleansing

DataOperations

DatarequirementsdefinitionDataLifecycleManagementProvidermanagement

Platform&Architecture

OveralldataarchitecturalapproachDataintegration,ArchitecturestandardsDatamanagementplatformHistoricaldata,archivingandretention

SupportingProcesses

Measurementandanalysis,ProcessmanagementProcessqualityassuranceRiskmanagementConfigurationmanagement

DataManagementStrategyisconcernedwithhowtheorganizationviewsdatamanagementfunction,businesscasefordatamanagement,funding,communicationsamongdatamanagementrolesandadherencetodatamanagementstrategy.

DataGovernanceconsidersprinciplessuchasMeta-datamanagement,implementationofbusinessglossaryandoverallgovernancemanagement.

DataQualityisconcernedwithmaturityofstrategyandprocessesfordataquality,dataprofiling,dataqualityassessment,datacleansingandvalidation.

DataOperationsfocusesontheoperationsaspectofDMMandevaluatesthematurityofDataLifecycleManagement,datarequirementsdefinitionandprovidermanagement.

Platform&Architectureisconcernedwithactivitiesandproceduresassociatedwithdataintegration,overalldataarchitecturecreation,architecturestandards,datamanagementplatformanddatalifecyclepracticessuchasdataarchivingandretention.

Supportingservicesisintendedtoempowertheotherfivedomainsbyimplementingmeasurementandanalysis,processmanagement,processqualityassurance,riskmanagementandconfigurationmanagementpractices.

SECTIONII:BIGDATAGOVERNANCEFUNDAMENTALS

Introduction

BigDataanalytics isnowacommonstrategy formanycorporations to identifyhiddenpatternsandinsightfromdata.The3V’sofBigDataAnalyticsarestretchingthelimits if not breaking the traditional frameworks around data governance and datamanagementanddatalifecyclemanagement.

In the current era of Big Data, organizations are collecting, analyzing anddifferentiatingthemselvesbasedonanalysisofmassiveamountsofinformation,datathatcomesinmanyformats,fromvarioussourcesandatveryhighpace.

Central to big data analytics is how data is stored in a distributed fashion thatallowsprocessingthatdatainparallelonmanycomputingnodesthattypicallyconsistofone or more virtual machine configurations. The entire collection of parallel nodes iscalledacluster.

Top10DataBreaches

Thelistofsecuritybreachesisachanginganddynamiclistaseverydaysomenewbreachisreportedinthenews.ThesedatabreachesarenotjustdemonstrativeofpoorITsecuritymeasures,but reflectiveof failures indatagovernanceorpoordatagovernancemanagement.AccordingtostudiespublishedbyHealthITNews(May5,2015),42%ofseriousdatabreaches in2014occurred inhealthcaresector. Herearesomeof themostegregiousbreachesthatledtosubstantialdataloss:

BlueCross Blue Shield of New Jersey lost data affecting roughly 800,000individuals,simplybylosingalaptop.Thedatawasnotencrypted.ThisoccurredinJan.2014.

In early 2009, Heartland Payment Systems, a payment processing company inNew Jersey reported the largest ever data breach that affected an American company.Around130millioncreditanddebitcardswerelosttohackers.Amalwareplantedinthecompany’s network recorded card data that arrived from more than 250,000 retailersacrossthecountrysotheimpactwassubstantial.

SonyOnlineEntertainmentServicesandSonyexperiencedrepeatedbreaches.In2011, more than 102 million records were compromised when hackers attached thePlayStationNetworkthatlinksSony’shomegamingconsolesandtheserversthathostthelarge multi-player online PC games. After the breach was discovered, the SonyPlayStation Network went dark worldwide for more than 3 weeks. Some 23 millionaccounts including credit card, phone numbers and addresses were compromised inEurope alone. Soon after, about 65 class action lawsuits were brought against thecompanyatthecostof$171million.

This is not to mention another breach of Sony Pictures Entertainment in 2014whentheircomputersystemswerehijackedandthecompany’s6,800employeesplusanestimated 40,000 other individualswere impactedwhen their data, emails and personalinformationwereexposed.

In 2008, the National Archive and Records Administration experienced a databreach that affected 76 million records of government employees, of U.S. militaryveteransincludingtheirSocialSecurityNumbers.Theorganizationshippedafaileddiskdrive containing this informationout tobe scrappedbut thedrivewasnever destroyed.Since thenNARAhas changed its policies about protectingPII (Personally IdentifiableInformation).

May2014–Thievesstole8computersfromanofficewhichgavethemaccesstopersonalinformationofroughly350,000individuals.Thedatawasnotencrypted.

Feb.2015-Anthem,formerlyknownasWellPoint,the2ndlargesthealthinsurerinthe U.S. reported 80 million records stolen by hackers. Stolen data included names,addresses,dateofbirth,employmenthistoriesandSocialSecuritynumbers.

In 2011,Epsilon reported a breach that compromised60million to 250millionrecords.TheTexas-basedmarketingfirmwhichhandlesemailcommunicationsformore

than 2,500 clientsworldwide reported that its databases related to 50 of its clients hadbeenstolen.

Emailaddressesofover60millionofclientswasstolenwhichincludedcustomersfrom a dozen key banks, retailers and hotels includingBestBuy, JPMorganChase andVerizon.

Anotherincidentaffecting56millionpaymentcardcustomersofHomeDepotoccurredin2014.Themajorhardwareandbuildingsuppliesretaileradmittedtowhatithadsuspectedforweeks.SometimeinAprilorMayof2014,thecompany’spointofsalesystemsinitsUSandCanadastoreswereinfectedwithmalwarethatpretendedtobeantivirussoftware,butinsteadstolecustomercreditanddebitcardinformation.

Similar toHomeDepot, Target reported a breach of its databaseswhich causedtremendouscustomercomplaintsanddamagetothecompany’simageandconsumertrust.

Evernote,thepopularnote-takingandarchivingonlinecompanyreportedin2013thattheemailaddresses,usernamesandpasswordsofitsclientshadbeenexposedbyasecuritybreach. Whileno financialdatawas stolen, thenotesof the company’s clientsandall thecontenthadbeencompromised. Theclients laterbecame targetsofmassivespam emails, phishing scams and traps by hackers who even pretended to be fromEvernoteitself.

In 2013, Living Social, the daily-deals social site partly owned by Amazonreportedthatthenames,emailaddresses,birthdatesandencryptedpasswordsofsome50millioncustomershadbeenstolenbyhackers.

Oneof theworstsecuritybreachcasesoccurredin2007atTJXCompanies, theparentcompanyoftheretailer,TJMaxxandHomeGoods.Atleast45millioncreditanddebit cardnumberswere stolenoveran18monthperiod, thoughsomeput theestimatecloser to 90 million. Some 450,000 customers had their personally identifiableinformation stolen including driver’s license numbers. The breach eventually cost thecompany$256million.

WiththeadventofHadoopandBigData,theaccumulationofvastamountofdataoncustomers,productsandsocialmedia,therisksassociatedwithdatabreacheswillonlygetmorewidespreadandcostly. Hence, investinginbigdatagovernanceinfrastructure,processesandpolicieswillpreventcostlyexposuretodatabreachesinthefuture.

The point to keep inmind is that data governancemust pay close attention tosecurityandprivacy. Itmustenable theorganization to implement securitycontrolsandmeasures thatprovideproper access that serves thebusinessneedsbut a clear andharddefense against hackers. For example, policiesmight state that data at rest and transitshould be encrypted and certain data (such as Social SecurityNumbers or Credit CardNumbers)aretobetokenized.Simplyput,withoutthekeytotheTokens,evenifthedatawashackedthehackerscannotobtainthedatawithoutthetokenkeys.

Many of these data breaches could have been averted if the basic principles ofsecurityanddatagovernancewereappliedbytheorganization.Studiesofpriorbreaches

haveshown that if theaffectedorganizationshadappliedsimpleandbasicprinciplesofsecurityanddatagovernance,theycouldhaveprevented80%oftheirdatalossevents.

By the time you read this book, you can think ofmany additional examples –unfortunately. Iraisethispointearlyinthischaptertohighlightthemuchneededfocusandinvestmentindatagovernanceacrossorganizationsglobally.

WhatisGovernance?

Therearemanydefinitions fordatagovernance. Onepopulardefinition is this:DataGovernanceistheexecutionandenforcementofauthorityoverthemanagementofdata and data-related assets. Data governance is and should be in alignment with thecorporategovernancepoliciesaswellaswithintheITGovernanceframework.

Alongwithdatagovernance,youmighthaveseenreferencestodatastewardshipand sometimes they’re used synonymously. The two notions are related but different.DataStewardship is the formalizationofaccountability for themanagementofdataanddata-relatedassets.

A data governance framework relies on several other data management roleswhich will be defined later in the book. But the role of data stewards is critical tooperationalintegrityandsuccessofabigdatagovernanceframework.

There are two perspectives on who a data steward is. One definition viewseveryoneintheorganizationwhodealswithdataasbeingaccountableforhowtheytreatdata. Another definition describes a job role of data steward is someone whose soleresponsibilityistobeaccountablefortheirorganization’streatmentofdata.

In the firstdefinition,adatastewardcanbeanyone in theorganizationwhetherbusiness minded or technical. The data steward is tasked with the responsibility andaccountability forwhat theydowith thedataas theydefine,produceand/orusedataaspartoftheirwork.

In the seconddefinition (and this is thedefinition that Iuse in thisbookand istypicallyimpliedindatagovernancepolicies)isonewhodefines,producesorusesdataaspart of their job and has a defined level of accountability for assuring quality in thedefinition,productionorusageofthatdata.

ThebestpracticesinbigdatagovernancepromoteapartnershipbetweenITandotherfunctionalareas,divisionsandlinesofbusinesswhenitcomestodatagovernance.This is a departure from heavy, iron-fist approach that some IT organizations haveemployedinthepast. Insteadthebestpracticesbringclarityofpurpose,accountabilityandpoliciesandthephilosophythatdatastewardshipisnotsolelyanITresponsibility,buta shared responsibility across the enterprise. Every employee is a data steward,responsibleforproperuse,storageandprotectionofdata.

Datagovernancebestpractices,applyformalaccountabilityandsetthebarfortheright behaviors towards data governance in an organization. A data governanceframeworkmustsupportexistingandnewprocessestoensurethepropermanagementofdata,properproductionandusageofdatathroughitslifecycle.

Big data governance improves regulatory compliance, security, privacy andprotection of data across the enterprise. Best practices prescribe non-threatening datagovernanceinacollaborative,federatedapproachtomanagingvaluabledataassets.ThegoalofbestpracticesistocreateabigdatagovernancestructureandframeworkthatisA)Transparent,B)SupportiveandC)Collaborative.

WhyBigDataGovernance?

Another perspective of big data governance is to think of it as convergence ofpolicies and processes for managing ALL data in the enterprise, data quality, security,privacyandaccountabilityfordataintegrity.

It’s a set of processes that ensures important data assets are formally &consistentlymanagedthroughenterprise.Itensuresdatacanbetrustedandpeopleareheldaccountableforlowdataquality.

Bigdatagovernancebringspeople&technologytogether topreventcostlydataissues and ensure enterprise data is more efficiently managed and utilized. Finally, itenablesusingdataasastrategicassetmaximizingthereturnoninvestmentbyleveragingthepowerandvalueofbigdata.

Someofthereasonsfordatagovernanceinclude:

Complianceandlessregulatoryproblems.Withoutmetadatacomplianceandregulatoryreportsareunreliablewithoutqualitydata.Increasedassuranceanddependabilityofknowledgeassets.Withgovernance,youcantrustyourdataandplacedecisionsonsoliddata.Improved informationsecurity&privacy. DataGovernancewill reduce therisksandexposuresassociatedwithdatalossorbreaches.Across the enterprise accountability. Data Governance enables moreaccountabilityandconsistencyofdatahandlingacrosstheenterprise.Consistentdataquality.Improvedqualityreducesrework,wasteanddelays.Itimprovesqualityofdecisionmaking.Maximizing asset potential. With Data Governance, the assets are known,discoverableandusableacross theenterprise, thusraise thevalueandreturnondata(ROD).

DataStewardResponsibilities

Data Stewards are formally accountable for defining rules about new data vs.existingdata.Theirresponsibilitiesincludedefiningrulesaroundlogicalandphysicaldatamodeling,metadatadefinition,businessglossaryandrecognizingthesystemofrecord.

TheresponsibilitiesofDataStewardsextendtodefiningproductionrulesfordataproducedinternally,ordataacquiredfromoutside,datamovementacross theenterprise,andETL(Extract,Transfer,Load)functions.

AnotherkeyresponsibilityincludesdefiningandadministeringDataUsagerulesandenforcingthem,inparticulardefiningbusinessrules,classificationofdata,protection

ofdata,retentionrulesarounddata,complianceandregulatoryrequirements.

CorporateGovernance

Corporategovernancerefers to themechanisms,processes,policiesandcontrolsbywhichcorporationsarecontrolledanddirected.Thesemechanismsincludemonitoringthe actions, policies and decisions of corporations, their officers and employees.Corporate governance and IT governance are the overarching structures that big datagovernancemustbeinalignment.

BigDataGovernanceCertifications

Manyorganizationsopttoapplyandreceivecertificationoftheirdatagovernancestructure.Certificationcanbereceivedbyanoutsideorganization.Certifyingyourdatagovernancestructurehasmanytangibleandintangiblebenefits.

Thevalueandbenefitsofdatagovernanceincreasewiththenumberofemployeesandamountofdatathatismaintainedintheenterprise.Certificationimpliespeopleintheorganization understand the “value” of data governance; they understand data and arecompetent to protect it. Competency is createdby leadership through settingdirection,governancestructureandtraining.

Theenterprisevalueofcertificationcanbeprofound. Certification implies thatpeople are educated in proper data management policies and follow the rules; the riskmanagement rules, thebusiness rules, anddatachangemanagement rules. Certificationalsomeansthatpeopleareconsistentinhowtheydefinedata,productdataandusedataacrosstheenterprise.

Withabigdatagovernancecertification,yourorganizationcanpromotethattheyare “Goodwith their data”! They canmake declarations such as “Weprotect our databetterthanourcompetition”,or“Weimprovethequalityofourdata”;“Ourdataisknown,trustworthyandmanaged”;or“Wecentralizehowwemanageourdata”.

Imagine an opposite and not so desirable situation: Let’s assume you’representingtheresultsofaclinicalrandomtrailstudytoanFDApaneltoreceiveapprovalforreleasinganewdrugtomarket.Imagineifyourstatementsaid:“wehavetheresultsbutwedon’tknowmuchaboutthequalityofourdata,whereitcamefromandwhohashad access to it. Furthermore,we can’t even reproduce these results becauseweneversafeguarded the source data nor took snapshots of our analysis”. Thiswould be a lowpointforthecompanywhocan’tgovernandmanageitsbigdata.

CaseforBigDataGovernance

A number of incidents indicate the seriousness of security breaches that havecausedsubstantialdamageandcosttoorganizations.

A study releasedbySymantec found thatonaverageeach securitybreachcostsorganization$5.4Mdollars. Anotherstudyclaimsthat thecostofcybercrimetotheUSeconomyis$140Bperyear.

Sonyhasexperiencedmorethanitsshareofbreacheswithhackers.In2011,oneofthelargestbreachesinrecenthistoryinvolvedSony’sPlayStationnetworkwhichcostsasmuchas$24Baccordingtosomeexperts.

SomeorganizationslikeSonyhaveexperiencedmultiplebreachesofsecurityatavery high cost to their reputation, financial impact and loss of consumer trust. DataGovernance has become an important strategy to enterprise governance and ITgovernance, both. It is the vehicle to ensure the data is properly managed, used andprotected.

Without proper security controls, Big Data can become Big Breaches. Putdifferently, themoredatayouhavecollected, thebigger themagnitudeandcostof risk,andthemoreimportantitistoprotectthedata.Thisimpliesthatwemusthaveeffectivesecuritycontrolsondatainstorage,insideournetworksandasitleavesournetworks.

Dependingonthesensitivityofdatawemustcontrolwhohasaccess todata, toseewhatanalysisisperformedandwhohasaccesstotheresultinganalysis.

Hadoophasbecomeapopularplatformforbigdataprocessing.But,Hadoopwasoriginallydesignedwithoutsecurityinmind.WhileHadoop’ssecuritymodelhasevolvedover the years, security professionals have sent the alarm around Hadoop’s securityvulnerabilitiesandpotentialrisks.Increasingly,newthirdpartysecuritysolutionscometomarket but it’s not easy to craft an architecture integrating these solutions to create acomprehensiveandcompletesecurityinfrastructure.

StrategicDataGovernance,orTacticalDataGovernance?

WithBigDataandproliferationofHadoop,we’reatthedawnofanewera:theageofdatademocracyordemocratizationofdata.Datademocratizationdeliversalotofpower&valuebutwithitcomesresponsibility,andbiggerneedforgovernance.Thereare3primarylevelsorperspectivestodatagovernance:Tactical,StrategicandOperational.

Operational data governance focuses on daily operations and safeguard of data,security, privacy and implementing policies typically influenced by the IT organizationwith little involvement frombusinessownersofdata. Data is not treated as a strategicasset. There isno staffdedicated to the rolesofdata stewardsordataguardians in theorganization.

Tactical Data Governance is concerned with the current and immediate issuesassociatedwithmanagementofdataandimplementationofgovernance,accountabilityofdatagovernance.Thisapproachspendsitsenergyongovernancemeasures,policies,roles

and responsibilities. It regulates the metadata and business glossary, standardization ofterms,namesanddatadictionaries.

TacticaldatagovernanceisoftenregardedasDataGovernance1.0.Itconcentratesondailyactivitiesandmaintenanceofpolicies throughdatastewardandothermemberswho support data governance decisionmaking. There is typically little involvement ordesignationofanaccountableexecutive,ordatariskofficer(s)whoprovidegroundcoverandexecutiveleadershiptodatagovernance,anorganizationalstructuretypicallyfoundinstrategicdatagovernance.

Sometimes referred to as Data Governance 2.0, Strategic Data Governance isconcernedwiththepropagationofdataassets,theforthcomingobjectivesandchallengesthatitwillencounterasthedatagrowsthroughouttheorganizationandthemaintenanceofit; the long termandfutureperspectiveofdatagovernance. Strategicdatamanagementtreats data as a corporate asset andworks to add asmuch value from this asset to theorganization.

Strategic data governance brings a holistic view of standardization into theconversation,includingevaluatingthecurrentorganization’sstandardtermsandontology,identifying the gaps and including advanced tools and techniques such as semanticontologiesandsemanticconceptsacrosstheenterprise.Inthisapproach,you’llfindroleslike theChiefDataOfficer(CDO)andAccountableExecutive(AE),DataRiskOfficers(DRO) and a central data council (also referred to asDataGovernanceCouncil) in theorganization.

TOGAFViewofDataGovernance

TheOpenGroupArchitectureFramework(TOGAF)haspublishedasetofarchitecturebestpracticesandprinciplesinitslateststandard,TOGAFVersion9,releasedin2011.ThegoalofTOGAFistoprovideguidanceandopenforumtoenableinteroperabilityandboundary-lessflowofinformationamongvendorapplicationsandsystems.

TOGAFpromotesarchitecturalbestpracticesthrough10distinctsetofguidelinesandactivities.Theseguidelinesinclude:

ThePreliminaryphase

ArchitecturevisionBusinessarchitecture

Informationsystemsarchitecture

TechnologyarchitectureOpportunities&solutions

MigrationplanningImplementationgovernance

Architecturechangemanagement

RequirementsManagement

Specifically,TOGAFpromotesadoptionofkeyprinciplesbytheorganizationbeforedraftinganarchitecture.Eachprincipledefinesahighlevelaspiration,visionandpolicy.Theprinciple-drivenapproachtoarchitectureisapowerfulapproachtobringcommonunderstandingandguidancethroughthearchitectureanddesignphases.

TOGAFversion9.1hasprovidedasetofsampleprincipleswhichincludeawiderangeofinformationarchitecturetopics,butforsakeofbrevity,Ioutlineafewguidingprinciplesthatpertaintodatagovernanceinthefollowingsection.TheseprinciplesareexcerptsfromTOGAFtrainingcourseware[6].FormoredetailsandcompleteoverviewofTOGAFVersion9.1,IrecommendvisitingthesiteandreadingtheentireTOGAFdocuments.

SampleTOGAFInspiredDataPrinciples

Principle 1: Data is an Asset - Data is an asset that has value to the enterprise and ismanagedaccordingly.

Rationale:Dataisavaluablecorporateresource;ithasreal,measurablevalue.Insimpleterms, thepurposeofdata is toaiddecision-making.Accurate, timelydata is critical toaccurate, timely decisions.Most corporate assets are carefullymanaged, and data is noexception. Data is the foundation of our decision making, so we must also carefullymanagedatatoensurethatweknowwhereitis,canrelyuponitsaccuracy,andcanobtainitwhenandwhereweneedit.

Implications:

Thisisoneofthreeclosely-relatedprinciplesregardingdata:dataisanasset;datais shared; and data is easily accessible. The implication is that there is aneducationtasktoensurethatallorganizationswithintheenterpriseunderstandtherelationshipbetweenvalueofdata,sharingofdata,andaccessibilitytodata.DataStewardsmusthave theauthorityandmeans tomanage thedata forwhichtheyareaccountable.Wemustmake the cultural transition from ‘‘data ownership’’ thinking to ‘‘datastewardship’’thinking.Theroleofdatastewardiscriticalbecauseobsolete,incorrect,orinconsistentdatacouldbepassedtoenterprisepersonnelandadverselyaffectdecisionsacrosstheenterprise.Partoftheroleofdatasteward,whomanagesthedata,istoensuredataquality.Procedures must be developed and used to prevent and correct errors in theinformationandtoimprovethoseprocessesthatproduceflawedinformation.Dataqualitywillneedtobemeasuredandstepstakentoimprovedataquality—itisprobablethatpolicyandprocedureswillneedtobedevelopedforthisaswell.A forum with comprehensive enterprise-wide representation should decide onprocesschangessuggestedbythesteward.Sincedataisanassetofvaluetotheentireenterprise,datastewardsaccountableforproperlymanagingthedatamustbeassignedattheenterpriselevel.

Principle 2: Data is Shared - Users have access to the data necessary to perform theirduties;therefore,dataissharedacrossenterprisefunctionsandorganizations.

Rationale:

Timely access to accurate data is essential to improving the quality and efficiency ofenterprisedecision-making. It is lesscostly tomaintain timely,accuratedata ina singleapplication, and then share it, than it is to maintain duplicative data in multipleapplications. The enterprise holds a wealth of data, but it is stored in hundreds ofincompatible stovepipe databases. The speed of data collection, creation, transfer, andassimilationisdrivenbytheabilityoftheorganizationtoefficientlysharetheseislandsofdataacrosstheorganization.

Shareddatawillresultinimproveddecisionssincewewillrelyonfewer(ultimatelyonevirtual)sourcesofmoreaccurateandtimelymanageddataforexampleSetofArchitecturePrinciplesArchitecturePrinciples all of our decision-making.Electronically shareddatawill result in increased efficiency when existing data entities can be used, without re-keying,tocreatenewentities.

Implications:

There is an education requirement to ensure that all organizations within theenterpriseunderstandtherelationshipbetweenvalueofdata,sharingofdata,andaccessibilitytodata.Toenabledatasharingwemustdevelopandabidebyacommonsetofpolicies,

procedures, and standards governing data management and access for both theshortandthelongterm.For the short term, topreserveour significant investment in legacy systems,wemustinvestinsoftwarecapableofmigratinglegacysystemdataintoashareddataenvironment.We will also need to develop standard data models, data elements, and othermetadatathatdefinesthissharedenvironmentanddeveloparepositorysystemforstoringthismetadatatomakeitaccessible.For the long term, as legacy systems are replaced, wemust adopt and enforcecommon data access policies and guidelines for new application developers toensurethatdatainnewapplicationsremainsavailabletothesharedenvironmentand that data in the shared environment can continue to be used by the newapplications.Forboth theshort termand the long termwemustadoptcommonmethodsandtoolsforcreating,maintaining,andaccessingthedatasharedacrosstheenterprise.Datasharingwillrequireasignificantculturalchange.Thisprincipleofdatasharingwillcontinually‘‘bumpupagainst’’theprincipleofdata security. Under no circumstances will the data sharing principle causeconfidentialdatatobecompromised.Datamadeavailableforsharingwillhavetoberelieduponbyalluserstoexecutetheirrespectivetasks.Thiswillensurethatonlythemostaccurateandtimelydatais relieduponfordecision-making.Shareddatawillbecome theenterprise-wide‘‘virtualsinglesource’’ofdata.

Principle3:DataisAccessible-Dataisaccessibleforuserstoperformtheirfunctions.

Rationale:Wideaccess todata leads toefficiencyandeffectiveness indecision-making,and affords timely response to information requests and service delivery. Usinginformationmustbeconsideredfromanenterpriseperspectivetoallowaccessbyawidevarietyofusers.Stafftimeissavedandconsistencyofdataisimproved.

Implications:

Accessibilityinvolvestheeasewithwhichusersobtaininformation.Thewayinformationisaccessedanddisplayedmustbesufficientlyadaptabletomeetawiderangeofenterpriseusersandtheircorrespondingmethodsofaccess.Accesstodatadoesnotconstituteunderstandingofthedata.Personnelshouldtakecautionnottomisinterpretinformation.Access to data does not necessarily grant the user access rights to modify ordisclose the data. This will require an education process and a change in theorganizationalculture,whichcurrentlysupportsabeliefin‘‘ownership’’ofdatabyfunctionalunits.

Principle4:DataTrustee&DataSteward-Eachdataelementhasa trusteeaccountablefor data quality and data steward accountable for propermanagement and usage of thedata

Rationale:Oneof thebenefitsof anarchitectedenvironment is the ability to sharedata(e.g., text,video,sound,etc.)across theenterprise.As thedegreeofdatasharinggrowsandbusinessunitsrelyuponcommoninformation,itbecomesessentialthatonlythedatatrusteemakesdecisionsaboutthecontentofdata.Sincedatacanloseitsintegritywhenitisenteredmultipletimes,thedatatrusteewillhavesoleresponsibilityfordataentrywhicheliminatesredundanthumaneffortanddatastorageresources.

Adatatrusteeisdifferentthanadatasteward—atrusteeisresponsibleforaccuracyandcurrency of the data, while responsibilities of a steward are broader and include datastandardization,datausagepolicies,dataaccessmanagementanddefinitiontasks.

Implications:

DataTrusteeandDataStewardrolesresolveambiguityarounddata‘‘ownership’’issuesandallowthedatatobeavailabletomeetallusers’needs.Thisimpliesthataculturalchangefromdata‘‘ownership’’todata‘‘trusteeship’’and“stewardship”mayberequired.Thedatatrusteewillberesponsibleformeetingqualityrequirementslevieduponthedataforwhichthetrusteeisaccountable.Itisessentialthatthetrusteehastheabilitytoprovideuserconfidenceinthedatabaseduponattributessuchas‘‘datasource’’.Itisessentialtoidentifythetruesourceofthedatainorderthatthedataauthoritycan be assigned this trustee responsibility. This does not mean that classifiedsourceswillberevealednordoesitmeanthesourcewillbethetrustee.Informationshouldbecapturedelectronicallyonceandimmediatelyvalidatedasclosetothesourceaspossible.Qualitycontrolmeasuresmustbeimplementedtoensuretheintegrityofthedata.As a result of sharing data across the enterprise, the trustee is accountable andresponsiblefortheaccuracyandcurrencyoftheirdesignateddataelement(s)and,subsequently,mustthenrecognizetheimportanceofthistrusteeshipresponsibility.DataStewardisresponsibletoensureproperuseofdataaccordingtoaccessrules,regulatory and compliance rules, data sovereignty, data jurisdictional rules andcontractualagreementswithcustomersand/orthirdpartydatavendors.Sometimes the roles of data steward and data trustee are combined into oneindividual.Youneedtodefinetheroleasitfitsyourorganization.

Principle 5: Common Vocabulary and Data Definitions - Data is defined consistentlythroughouttheenterprise,andthedefinitionsareunderstandableandavailabletoallusers.

Rationale: The data that will be used in the development of applications must have acommon definition throughout the Headquarters to enable sharing of data. A commonvocabularywillfacilitatecommunicationsandenabledialogtobeeffective.Inaddition,itisrequiredtointerfacesystemsandexchangedata.

Implications:

Data definitions, common vocabulary and business glossaries are key to the

successofeffortstoimprovetheinformationenvironment.Thisisseparatefrombutrelatedtotheissueofdataelementdefinition,whichisaddressedbyabroadcommunity—thisismorelikeacommonvocabularyanddefinition.Theenterprisemustestablishtheinitialcommonvocabularyforthebusiness.Thedefinitionswillbeuseduniformlythroughouttheenterprise.Whenever a new data definition is required, the definition effort will becoordinatedandreconciledwiththecorporate‘‘glossary’’ofdatadescriptions.Theenterprisedataadministratorwillprovidethiscoordination.Ambiguitiesresultingfrommultipleparochialdefinitionsofdatamustgivewaytoacceptedenterprise-widedefinitionsandunderstanding.Multipledatastandardizationinitiativesneedtobecoordinated.Functionaldataadministrationresponsibilitiesmustbeassigned.

Principle 6:Data Security -Data is protected from unauthorized use and disclosure. Inadditiontothetraditionalaspectsofsecurityclassification(forexample,ClassifiedSecret,Confidential, Proprietary, Public) this includes, but is not limited to, protection of pre-decisional, sensitive, source selection-sensitive, and other classifications that fit yourorganization’slineofbusiness.

Rationale: Open sharing of information and the release of information via relevantlegislation must be balanced against the need to restrict the availability of classified,proprietary,andsensitiveinformation.

Existinglawsandregulationsrequirethesafeguardingofnationalsecurityandtheprivacyofdata,whilepermittingfreeandopenaccess.Pre-decisional(work-in-progress,notyetauthorized for release) informationmustbeprotected to avoidunwarranted speculation,misinterpretation,andinappropriateuse.

Implications:

Aggregation of data both classified and not,will create a large target requiringreview and de-classification procedures to maintain appropriate control. Dataownersand/orfunctionalusersmustdeterminewhethertheaggregationresultsinanincreasedclassificationlevel.Wewillneedappropriatepolicyandprocedurestohandlethisreviewanddeclassification.Accesstoinformationbasedonaneed-to-knowpolicywillforceregularreviewsofthebodyofinformation.Thecurrentpracticeofhavingseparatesystemstocontaindifferentclassificationsneeds to be rethought. It is more expensive to manage unclassified data on aclassifiedsystem.Upuntilnowtheonlywaytocombinethetwowastoplacetheunclassified data on the classified system,where it remained. However, asweshall see in the coming sections, we can use Hadoop to create multiple datasandboxes that offer fit for purpose configurations. We can create differentsandboxes to keep classified and unclassified data separate but on the sameHadoopdistributedfilesystem.Ingeneral,it’srecommendedthatalldataintheHadooplakebeclassifiedtomaintainamanageabledatalake.In order to adequately provide access to open information while maintaining

secure information, security needsmust be identified and developed at the datalevel,notjustastheapplicationlevel.Datasecuritysafeguardscanbeputinplacetorestrictaccessto‘‘viewonly’’,or‘‘never see’’. Sensitivity labeling for access to pre-decisional, decisional,classified,sensitive,orproprietaryinformationmustbedetermined.Security must be designed into data elements from the beginning; it cannot beaddedlater.Systems,data,andtechnologiesmustbeprotectedfromunauthorizedaccess andmanipulation.Headquarters informationmust be safeguarded againstinadvertentorunauthorizedalteration,sabotage,disaster,ordisclosure.Need new policies on managing duration of protection for pre-decisionalinformationandotherworks-in-progress,inconsiderationofcontentfreshness.

DataLakevs.DataWarehouse

Weoftenhearreferencestothedatalakeanddatawarehousetogetherasbeingsynonymousanddatalakeisjustanotherincarnationofadatawarehouse.Thetruthisthattherearemajordifferencesbetweendatalakeanddatawarehouse.Thedatalakecanbethoughtofalargestoragespacethatcanhouseanykindofdata,datastructuresandformatsataverylowcost.Adatalakeisastoragerepositorythatholdsavastamountofrawandprocessed(refined)datainitsnativeformat,includingstructured,unstructured,semi-structureddata.Thedatastructureandrequirementsarenotdefineduntilthedataisneeded.

Adatalakeisnotadatawarehouse.They’reoptimizedfordifferentpurposesandtoprovidethebesttoolforthatpurpose.

TamaraDullinoneofherblogsprovidesatabletohighlightdifferencesasshownbelow:

DataWarehouse vs. DataLake

Structured,processed DATA Structured,semi-structured,unstructured,raw

Schema-on-write PROCESSING Schema-on-read

Expensiveforlargedatavolumes

STORAGE Designedforlow-coststorage

Lessagile,fixedconfiguration

AGILITY Highlyagile,configureandreconfigureasneeded

Mature SECURITY Maturing

Businessprofessionals USERS Datascientists,DataEngineers,et.al.

So,howdothesedifferencesaffectourchoiceofdatalakevs.datawarehouse?Hereisadiscussionofthesepoints:

Data:Adatawarehouseonlystoresdatathathasbeenmodeled,structuredwithclearentityrelationshipsdefined.Adatalakestoresanytypeofdata,raw,unstructuredas

wellasstructureddata.

Processing:Beforeloadingdataintoadatawarehouse,wemustgiveitsomestructureandshape–i.e.weneedtomodelitandsetupintablesandidentifyprimaryandforeignkeys.Thisiscalledschema-on-write.Itcantakealongtimeandlotsofresourcestounderstandtheschemaofdataupfront.Withadatalaketoloadthedataasis,raw-withoutanyneedtoknowormakeanyassumptionsaboutitsstructure.Youcanenforceastructurewithyourprogramswhenyoureadthedataforanalysis.That’scalledschema-on-read.Thesearetwoverydifferentapproaches.

Storage:OneoftheadvantageofdatalakeisusingopensourcetechnologiessuchasHadoopthatofferextremelylow-coststorageanddistributedcomputingcapabilitythatsupportcomplexdataanalysis.Hadoopisnotonlyfreebutalsorunsonlow-costcommodityhardware.SinceHadoopcanscaleupasyourdatagrows,youcansimplyaddhardwareonlywhenyouneedit.Thesefeaturesprovidealowercostofownershipforstorage.

Agility:Adatawarehouseisinherentlydesignedtobeastructuredrepository.Whileit’spossibletochangethedatawarehousestructure,itwillconsumealotoftimeandresourcestodoso.Incontrast,datalakerelaxesthestructuralrulesofadatawarehouse,whichgivesdevelopersanddatascientiststheabilitytoeasilyloaddata,configureandreconfiguretheirmodels,queries,datatransformations,onthefly.

Security:Datawarehouseshavematuredoverthelasttwodecadesandareabletoofferdetailedandrobustsecurityandaccesscontrolfeatures.Datalaketechnologiesarestillevolvingandhencemuchoftheburdenofsecurityanddatagovernancefallsontheusercommunity’sshoulders.Untilthedatalakesecuritytoolscatchuptodatawarehouselevels,significanteffortsandattentionmustbespenttowardssecuringdata.

Users:Datawarehousesaredesignedtosupporthundredsifnotthousandsofbusinessusersandqueriesastheplatformofchoiceforbusinessintelligence(BI)reports.However,datalakeistypicallyusedbyahandfulofdatascientistsanddataengineers.Whilethenumberofdatalakeusersaresteadilyontherise,adatalakeisnotdesignedtohandlethousandsofuseraccountsandlog-inswithoutseveralfront-endandmiddle-wareapplicationstohandletheuserinterface.Thedatalakeatthistimeisbestsuitedforusebydatascientistsanddataengineers.But,asthedatalakeparadigmshiftsfromdatasciencetodataproducts,wecananticipatealargervolumeofbusinessusersutilizethedatalakethroughnewanalyticsproductsanddatapipelinesthataredevelopedbydatascientistsanddataengineers.

I’vementionedHadoopseveraltimesintheprevioussectionsandthisisagoodseguetodigdeeperintounderstandingHadoopanditsarchitecture.

HistoryofHadoop

It’s a well-known fact that the original developers of Hadoop did not considersecurityasakeycomponentoftheirsolutionatthetime.TheemphasisforHadoopwastocreateadistributedcomputingenvironmentthatcouldmanagelargeamountsofpublicwebdata.Atthetimeconfidentialitywasnotamajorrequirement.Initiallyitwasarguedthat since Hadoop uses trusted cluster of servers, serving trusted users and trustedapplications, securitywouldnotbeaconcern. However, themorecareful inspectionofthearchitecturehasshownthatallthoseassumptionsareinadequateinarealworldwherethreatsarerampant.

TheinitialHadoopinfrastructurehadnosecuritymodel–itdidnotauthenticateusers,didnotcontrolaccessandtherewasnodataprivacy.SinceHadoopwasdesignedtoefficientlydistributeworkoveradistributedsetofservers,anyusercouldsubmitworktorun.

While theearlierHadoopversionscontainedauditingandauthorizationcontrolsincludingHDFS filepermissions access control couldeasilybe circumvented since anyuser could impersonate anyotheruserwith a simple command line switch. Since suchimpersonationswerecommonandcouldbedonebymostusers,thesecuritycontrolswerereallyineffective.

During that time, organizations concerned with security adopted differentstrategiestoovercometheselimitations,mostcommonlybysegregatingHadoopclustersintoprivatenetworksandrestrictinguseraccesstoauthorizeduserstothosesegments.

OveralltherewerefewsecuritycontrolsinHadoopandasaresultmanyaccidentsandmishapsoccurred.Becauseallusersandprogrammershadthesamelevelofprivilegetoallthedatainthecluster,anyjobcouldaccessanydataintheclusterandanyusercouldreadpotentiallyreadanydataset.Itwaspossibleforoneusertodeletemassiveamountsofdatawithinsecondswithadistributeddeletecommand.

One of the key components of Hadoop, called MapReduce (we will studyMapReduceinmoredetaillater)wasnotawareofauthenticationandauthorization.ItwaspossibleforamischievoususertolowertheprioritiesofotherHadoopjobsintheclustertomakehisjobrunfaster,orevenworse,killotherjobs.

Over the years,Hadoopbecamemore popular in commercial organizations andsecurity became a high priority. Security professional voiced concerns about insiderthreats by users or applications that could perform malicious use. For example, amalicious user could write a program to impersonate other users and their Hadoopservices,orimpersonatingtheHDFSorMapReducejobs,deletingeverythinginHadoop.

Eventually,theHadoopcommunitystartedtofocusonauthenticationandateamat Yahoo! chose Kerberos as the authentication mechanism for Hadoop. The Hadoop.20.20xreleaseaddedseveralimportantsecurityfeatures,suchas:

Mutual Authentication with Kerberos – Incorporates Simple Authentication&Security Layer and Generic Security Service APIs (SASL/GSSAPI) to implement

KerberosandmutuallyauthenticateusersandtheirapplicationonRemoteProcedureCall(RPC)connections.

Pluggable authentication for HTTP Web consoles – This feature allowsimplementers of web applications and web consoles to create their own authenticationmechanismforHTTPconnections.

Enforcement ofHDFS file permissions –TheHadoop administrators could usethisfeaturetocontrolaccesstoHDFSfilesbyfilepermissions,namelyprovidingAccessControlList(ACLs)ofallusersandgroups.

Delegation Tokens –When a Hadoop job starts after the initial authenticationthroughKerberos,itneedsamechanismtomaintainaccessonthedistributedenvironmentasitgetssplitoverseveraltasks.TheDelegationTokensallowthesetaskstogainaccesstodatablockswereneededbasedontheinitialHDFSFilepermissionswithouttheneedtocheckwithKerberosagain.

HadoopOverview

Hadoop is a technology that enables storing data and processing the data inparallel over a distributed set of computers in a cluster. It’s a framework for runningapplications on a distributed environment built of commodity hardware. The Hadoopframeworkprovidesbothreliabilityanddatamovesnecessarytomanagetheprocessingtoappearasasingleenvironment.Hadoopconsistsoftwomajorcomponents:Afilesystemcalled HadoopDistributed File System (HDFS) andMapReduce, a tool tomanage theexecutionofapplicationsoverthedistributedcomputingenvironment.

ThefollowingsegmentsofferamoredetailedviewintoHadoop,MapReduceandhowHadoopdistributedprocessingworks.Youcanskipthefollowingsegmentswithoutlossofrelevantinformation.

HDFSOverview

HDFS is the utility that stores data on the distributed nodes with inherentredundancyanddatareliability.SinceHDFScomeswithitsownRAID-likearchitecture,youdon’tneedtoconfigureyourdatainRAIDformat.TheHDFSRAIDmoduleallowsafile to be divided into stripes consisting of several blocks. This file striping increasesprotection against data corruption. Using this RAID mechanism, the need for datareplicationcanbeloweredwhilemaintainingthesamelevelofdataavailability.Thenetresultislowerstoragespacecosts.

The NameNode is the main component of HDFS filesystem. It keeps thedirectorytreeofallfilesinthefilesystemandtrackswherethedatafilesisstoredacrossthecluster.Itdoesnotitselfstorethedatabuttrackswherethey’restored.

Auser’sapplication(clientapplication)talkstotheNameNodewheneveritwantsto locatea file,orwhen itwants to takeany file action (Add/Copy/Move/Delete). TheNameNode returns the success of such requests by returns the list ofDataNode serverswherethedataisstored.

TheNameNodeisasinglepointoffailureinHadoopHDFS.CurrentlyHDFSisnotahighavailabilitysystem.WhentheNameNodeisdown,theentirefilesystemgoesoffline.However,asecondaryNameNodeprocesscanbecreatedonaseparateserverasabackup. ThesecondaryNameNodeonlycreatesasecondfilesystemcopyanddoesnotprovidearealredundancy.TheHadoopHighAvailabilitynameserviceiscurrentlybeingdevelopedbyacommunityofactivecontributors.

It’srecommendedthattheNameNodeserverbeconfiguredwithalotofRAMtoallowbigger filesystems. In addition, listingmore than oneNameNodedirectory in theconfiguration helps create multiple copies of the filesystem metadata. As long as thedirectories are on separate disks, a single disk failure will not corrupt the filesystemmetadata.

OtherbestpracticesforNameNodeinclude:

ConfiguretheNameNodetosaveonesetoftransactionlogsonaseparatedisk

from the image, and a second copy of the transaction logs to a networkmountedstorage.Monitor the disk space available to NameNode. If the available space isgettinglow,addmorestorage.Do not host JobTracker,DataNode andTaskTracker services on the sameserver.

ADataNodestoresdataintheHadoopFileSystem.Afilesystemhasmorethanone DataNode, with data replicated across them. When DataNode process starts, itconnects toNameNode and spins until that service comes up. Then it acts on requestsfromtheNameNodeforfilesystemoperations.

DataNode instances are capable of talking to each other which is what occurswhen they are replicating data.Because ofDataNode, there is typically no need to useRAIDstoragesincedataisdesignatedtobereplicatedacrossmultipleservers.

The JobTracker is the service inside Hadoop that assignsMapReduce tasks tospecific nodes in the cluster and keeps track of jobs and capacity of each server in thecluster.TheJobTrackerisapointoffailurefortheHadoopMapReduceservice.Ifitgoesdown,allrunningjobsstop.

ATaskTrackerisanodeintheclusterthatacceptstasks,suchasMap,ReduceandShuffleoperationsfromtheJobTracker.EveryTaskTrackercontainsseveralslots,whichindicatethenumberoftasksthatTaskTrackercanaccept.WhenJobTrackerlookstofindwhereitcanscheduleataskwithintheMapReduceoperations,itfirstforanemptyslotonthesameserverthathoststheDataNodecontainingthedata.Ifthenoslotsareavailableonthesameserver,itlooksforanemptyslotonaserverinthesamerack.

Toperformtheactualtask,theTaskTrackerspawnsaseparateJVMprocesstodothe work. This ensures that a process failure does not take down the TaskTracker.TaskTrackermonitorsthesespawnedprocesses,watchingtheoutputandexitcodesfromtheirexecution. Whenaprocesscompletes -whethersuccessfulornot- theTaskTrackernotifies the JobTracker of the result. TaskTracker also sends heartbeatmessages to theJobTrackereveryfewminutestoinformJobTrackerthatit’sstillalive.Inthesemessages,TaskTrackeralsoinformstheJobTrackerofitsavailableslots.

MapReduce

MapReduceisatoolthatdividestheapplicationintomanysmallfragments,eachcanbeexecutedonanynodeonthecluster,andthengetre-aggregatedtoproducethefinalresult.HadoopusesMapReducetodistributeworkaroundacluster.ItconsistsofaMapprocessandaReduceprocess.

TheMapprocessisatransformationofdatacontainingarowKeyandValuetoanoutput Key/Value pair. The input data contains a Key and a value. TheMap processreturnstheKey-Valuepair.It’simportanttonotethatA)theoutputmaybeadifferentkeyfromtheinput;and,B)Theoutputmayhavemultipleentrieswiththesamekey.

TheReduce process is also a transformation that takes all values for a specific

Key,andgeneratesthenewlistofthetransformedoutputdata(thereducedoutput).TheMapReduceenginesuchthatallMapandReducetransformationsrunindependentofeachother,sotheoperationscanruninparallelondifferentkeysandlistsofdata. Inalargecluster, you can run theMap operations on serverswhere the data resides. This savesfromhavingtocopydataover thenetwork.Insteadyousendtheprogramtotheserverswherethedatalives.TheoutputlistcanbesavedbythedistributedfilesystemreadyfortheReduceprocesstomergetheresults.

As we saw earlier, the distributed filesystem spreads multiple copies of dataacrossmultiple servers. The result ishigher reliabilitywithout theneed forRAIDdiskconfigurations, as it stores data in multiple locations which comes in handy to runapplicationsinparallel.Ifaserverwiththecopyofdataisbusyoroffline,anotherservercanruntheapplication.

The JobTracker in Hadoop keeps track of which MapReduce are running andschedules individualMap orReduce processes and the necessarymerging operation onspecificservers.JobTrackermonitorsthesuccessorfailureofthesetaskstocompletetheentireclientapplicationjob.

Atypicaldistributedoperationfollowsthesesteps:

1. ClientapplicationssubmitjobstotheJobTracker.2. TheJobTrackercommunicatestotheNameNodetodeterminethelocationof

data3. TheJobTracker thenfinds theavailableprocessingslotsamongTaskTracker

nodes4. TheJobTrackersubmitsthetasktotheselectedTaskTrackernodes5. TheTaskTrackermonitors thenodesand the statusof the tasks (processes)

running in the server slots. If these tasks fail to submit heartbeat signalsthey’re deemed to have failed and the work is scheduled on a differentTaskTracker.

6. Each TaskTracker notifies the JobTrackerwhen a job fails. The JobTrackerdetermineswhatshouldbedonenext:itmayresubmitthejobonanothernodeor may mark that specific record as a data to avoid, or it may block thespecificTaskTrackerasbeingunreliable.

7. When thework is finished, the JobTracker updates its status and the clientapplicationcanpolltheJobTrackerinformationaboutthestatusofthework.

Therefore Hadoop is an ideal environment to run applications that can run inparallel. Formaximumparallelism, theMapandReduceoperations shouldbe statelessandnotdependonanyspecificdata.Youcan’tcontrolthesequenceinwhichtheMaporReduceprocessesrun.

ItwouldbeinefficienttouseHadooptorepeatthesamesearchesoverandoverinadatabase.AdatabasewithanindexwillbefasterthanrunningaMapReducejoboveranunindexeddata.But,ifthatindexneedstoberegeneratedwhennewdataisadded,orifdataiscontinuouslyadded(suchasincomingstreamingdata),thenMapReducewillhave

anadvantage.TheefficiencygainedbyHadoopinthesesituationsaremeasuredinbothCPUtimeandpowerconsumption.

AnothernuancetokeepinmindisthatHadoopdoesnotstartanyReduceprocessuntilallMapprocesseshavecompleted(orfailed). YourapplicationwillnotreturnanyresultsuntiltheentireMapiscompleted.

OtherHadoop related projects and tools include:HBase,Hive, Pig, ZooKeeper,Kafka,Lucene,Jira,Dumbo,andmoreimportantlyforourdiscussion:Falcon,Ranger,Atlasandmanyothers.Nextwe’llrevieweachoftheseprojectstogetamorecompletepictureoftheHadoopenvironment.

SecurityToolsforHadoop

Solutions like Cloudera Sentry, HortonWorks Ranger, DataGuise, Protegrity,Voltage,ApacheRanger,andApacheFalconareexamplesofproductsthatareavailabletoHadoopsystemadministratorsforimplementingsecuritymeasures.

OtheropensourceprojectssuchasApacheAccumuloworktoprovideadditionalsecurity forHadoop. Other projects like Project Rhino andKnoxGateway promise tomakeinternalchangesinHadoopitself.

TheComponentsofBigDataGovernance

There seven key components of big data governance framework.These includeOrganization,Metadata,Compliance,DataQuality,BusinessProcessIntegration,MasterDataManagement(MDM),andInformationLifecycleManagement(ILM).

The7ComponentsofBigDataGovernance

MythsaboutBigData&HadoopLake

Despitethehugeinvestmentsmadeintotechnology,staffingdataanalyticsofficeandcostsofacquiringdata,governanceisoverlookedinmanyorganizations.Partoftheissueisthatthenewtechnologyisabrightshinyobject,oftendistractingtheplannersandpractitioners alike from the challenges ofmanaging the data. Big data has comewithsomemythswhich overlook themanagement issues associatedwithmanaging, keepingtrackandmakingsenseofallthedata.

Herearesomeexamplesofmythsthatareprevalentintheindustry:“We’redatascientists…wedon’tneedIT…we’llmaketherulesaswego”.Orit’scommontoheardatapeoplesay:“Wecanput“anything”inthelake”and“wecanuseanytooltoanalyzedata”.

Perhaps not expressed openly, but other myths seem to frame the wrongconversationsarounddatagovernance.Youmighthaveheardothermythssuchas:“Dataquality is not important since my model can handle anything”; “We’ll build theenvironmentfirstandthenapplygovernance”.

But the truth is an important position and expectation that without Big DataGovernancefromtheverystart,withinlessthan2yearstheHadoopdatalakewillbeinacompletechaos…acostly“wildwest”experiment!Manyorganizationsthatdidnotplangovernance early in their adoption of Hadoop have turned their data lake into a dataswamp. After spending millions of dollars on their big data infrastructure, they don’tknowwhatdatatheyhave,howit’sbeingusedandwheretofindthedatatheyneed.

EnterpriseDataGovernanceDirective:

The goal of this book is to show the components and tips on establishing botheffective and lean data governance. Tomake a governance structure effective, itmustcontain all the right components and referencedesigns for a successful implementation.Forthestructuretobelean,itmustbelowcosttoimplement,lowoverheadtomaintainandfreeofexcessbaggage.Inotherwords,youmightask“whatistheminimumviableframeworkthatwillget thebiggestbangfor thebuck,so tospeakformyinvestment inbigdatagovernance?”

Theanswerisinthisbook.I’velookedtobringanagileandleandatagovernancetogether in this book. To reiterate our goals a lean and minimally viable governanceframeworkmustinclude:

1. Organization

2. Metadatamanagement

3. Security/Privacy/CompliancePolicies

4. DataQualityManagement

I’ll cover these componentsplusmore in thisbook starting froma tacticaldatagovernanceframeworkandbuildtowardastrategicdatagovernancestructure.

DataGovernanceisMorethanRiskManagement

Theriskmanagementperspectiveofdatagovernancedefinesdatagovernanceasamodel to manage risks associated with data management and activity. The riskmanagement approach promotes developing a single governancemodel for all types ofdata, processes, usage& analytics across the enterprise, not just limited to big data orHadoop infrastructure. It’s important that we consider that for a big data governancemodel to be effective inmanaging overall risk, itmust be part of the bigger enterprisewidedatagovernanceandmanagementframework.

FirstStepsTowardBigDataGovernance

The first step towards a practical data governance model is to recognize andhighlightthedifferencebetweentraditionaldataandbigdatagovernancepolicies.Hadoopandopensourcetoolsstretchthecurrentenveloponbigdatagovernancemodelssuchthatourcurrentmodelsarenotadequate,noradaptabletomeetthenewchallengesintroducedby big data. There are serious gaps in data governance when we introduce big datainfrastructure.

Ifyouconsider,bigdatarepositorylikeHadoopasbeingthedatalakewhereallofyourenterprisedatawilleventually reside,youcan recognize theoverwhelming taskofbringing the entire data activities across the enterprise into governance. This stepincludes gathering data governance requirements and identifying the gaps for yourorganizationbeforeyougettoofarintoimplementingtheHadoopinfrastructure.

Thesecondstep is toestablishbasic rulesofgovernanceanddefinewhere theyareapplied. Thenextstepis toestablishprocessesforgovernanceandthentograduateyour data scientists’ product into governance. This requires clarity around policies andpropertrainingacrosstheusercommunityinyourorganization.

The final step is to identify the tools for data governance that meet yourrequirementsandcanbeintegratedintoyourexistinginfrastructure.

WhatYourDataGovernanceModelshouldaddress:

Forabigdatagovernancemodeltobeeffective,therefourelementsthatitmustaddress:

1. Accountability: Your plan should establish clear lines and role ofaccountability and responsibility across the enterprise and for eachdivision,affiliateandcountryofoperationfordatagovernance

2. Inventory:Thisisoneofmostimportantfeaturesofdatagovernancemodelandonethatwillhaveoneofthehighestreturnsonyourinvestment.Thegoalof inventory is for Data Stewards to maintain inventory of all data in theHadoop data lake usingMetadata tools. The level of documentation in theMetadata toolmayvarydependingon the importanceandcriticalityofdata,but nevertheless, some documentation is required. The metadata tool isknownbymanynamessuchas“dataregistry”toolorMetaDataHub(MDH)amongotherreferences.Buttheyallpointtothecentralregistrywherealldataisinventoried.InthisbookI’llmostoftenrefertothemetadatatoolasMetaDataHub(MDH).

3. Process:EstablishprocessesformanagingandusageofdatainHadoopDataLake,BigDataanalysistools,BigDataresourceusage&qualitymonitoring.The model must define data lifecycle policies, procedures for data qualitymonitoring,datavalidation,datatransformationsanddataregistry.

4. Rules:Establish&trainallusersonrules&codeofconduct.Creatingclearrulesfordatausage,suchasDataUsageAgreements(DUA),dataregistryandmetadatamanagement,qualitymanagementanddata retentionpracticesareimportant.I’veincludedasetofrulesasexamplestotheendofthissectiontoillustrateasamplelistofrulesthatyourgovernancemodelmightinclude.

DataGovernanceTools

Hadoopwas not designed for enterprise datamanagement and governance. Forquite some time there were major gaps in bringing Hadoop into the same level ofgovernanceastherestoftheorganization.However,therearemoretoolsavailabletodayandmoretoolsareabouttocometomarketsoon.

Allthisimpliesthattherearetwochallengesfortheenterprise:

Thefirstchallengeisthattherearemanyproductsbuttheydonotcoveranentire

functional span to handle all aspects of governance that cover security, privacy, datamanagement,qualityandmetadatamanagement.Thatleavesyouwithchoicesofwhichproducts to select and integrate them into a seamless and consistent governanceinfrastructure. The advice and challenge for you is to select less than 2-3 tools thatcompletelysupport theentirespanofdatagovernancepoliciesandintegrate them. Anymorethan3productscanbecomehardtointegrate,manageandsupport.

The second challenge is that the pace of innovation in big data world is quiterapid.Newproductsemergealmostweeklythatsupersedepriorproducts.So,bepreparedtothrowawayyourhardearnedworkontheexistingtools,revampthemandadoptthesenewtoolsevery2-3yearsforsometimetocome.

The diagram below shows an ecosystem of potential tools that perform certainfunctionalrequirementsofabigdatagovernanceinfrastructure.Thechoiceandchallengeofselectingtherighttoolisstillonyourshoulders.

TheHadoop&BigDataGovernanceToolsEcosystem

BigDataGovernanceFramework:ALean&EffectiveModel

Asalludedbefore themajorelementsofaneffective, leanandagilegovernancemodel for bigdata includes fourpillars:Organization,DataQualityManagement,MetadatamanagementandPolicies.Herearesomemorepointsabouteachpillar:

Organization

The big data governancemodelmust encompass the entire organization. Datagovernance isnotan IT-onlyconcern,nor limited todata scientists in theorganization.Everyone in theorganization is responsible forgooddatagovernance, safeguardofdataand its proper usage. Furthermore to bring more strategic attention and lines ofaccountability,wemustidentifythenecessaryrolesandorganizationalstructurestoensuredatagovernancesuccess.

The first organizational task is to establish a Data Governance Council (DataCouncil)thatisthefocalpointintheorganizationtoestablishandenforcepolicies.DataCouncils are typically comprised of IT executives (Chief Data Officer and the ChiefInformation Officer to name a few), the Accountable Executives (AE) from variousdivisions and affiliates, the core data governance leadership and possibly the ChiefExecutive Officer (CEO) depending how strategic data management is viewed in theorganization.Insomeorganization,anAccountableExecutivefromeachlineofbusinessmaybeselected(orappointed)tobeanofficialDataRiskOfficer(DRO)forthatdivisionorlineofbusiness. TheroleofDROisaformalrecognitionof importanceofdatariskmanagementforthatlineofbusiness.

The other component of organizational structure is to develop the overallgovernance framework. The framework outlines the expectations, accountability andcomponents of the governance. Section IV, presents a framework that can be easilyadopted as a starting point to formyour customized governancemodel tailored to yourorganization.

Inadditiontothegovernanceframework,weneedtoestablishguidelinesarounddata lifecycle management. Lifecycle management must address the data retentionpolicies, data usage practices, how to handle 3rd party data and how the data must bepurgeduponitexpiration.

Finally, the organization must identify and measure the level of data risks,exposuretodatarisksandpossiblemitigationplans.

DataQualityManagement

Anactivitythat’softenoverlookedinmanyorganizationsisDataQuality.Acrosstheenterprise,wefindeitherdatathat’smissing,duplicatedorsimplyfilledwitherrors.Thecostofdatacleansingandresourcedthatitconsumesaresubstantial.So,whydon’tensurethatourdataistrustedandofhighqualityrightfromthebeginning?

WecanstartanewwiththeHadoopDataLaketoenforcetheproperdataqualityanddatavalidationteststomaintainapristineandtrusteddatalake.

TheDataCouncilmustdefinetheguidelinesforminimumdataqualitystandardsand required data check points. Check points in form of data validation checks arenecessaryeverytimedataisimportedintoHadoop,ortransformedintoanewformat.

Toeffectivelymanagedataquality,processautomationcangoalongway.So,wewant to implement products such asDrools and comparable solutions tomaintain dataengineers’productivityanddataanalyticsmomentum.

MetadataManagement

Thegovernancepoliciesmustdefineminimummetadata requirementsand rulesforregisteringdata in the lake. In thefollowingsegments, I’llexplainseveralmetadataattributesandstructurethatcanbeadoptedasastartingpoint.

Registeringdatainametadatastorerequirespropertools.Sofar,theopensourcetools like HCatalog, Hive and Metastore are meeting the very minimum functionalityneededtohandlemetadatamanagement.But,othersolutionslikeLoom,SuperLuminate,ApacheAtlasandothertoolsareavailablethatcanoffermoreutility.

Thekeytoextractingvaluefromdataisintheabilitytodiscoverthedata,findingitandusingitproperly.Weneedtoknowwhatdatawehaveandhowtofindit.Weneedtomaintaininformationaboutourdata,namelymaintainingmetadata.

Metadata is data about data. It’s any information that assists in understanding thestructure,meaning,provenanceorusageofaninformationasset.Metadataissometimesdefinedastheinformationneededtomakeinformationassetsuseable”,or“informationaboutthephysicaldata, technicalandbusinessprocesses,datarulesandconstraints,andlogicalandphysicalstructuresofthedata,asusedbyanorganization”.

It’softensaid that“youcan’tmanagewhatyoucan’tmeasure”which resonateswiththeexecutiveleadershipalot.Similarly,tosupporttheeffortsformetadatamanagement,thereisplentyoftruthtothesayingthat“youcan’tgovernwhatyoudon’tunderstand”.

Business Glossary is a definitive dictionary of business terms, their attributes andrelationshipsusedacrosstheorganization.Thedefinitionsmustbedesignedtodeliveracommon and consistent understanding of what is meant by each term for all staff andpartnersregardlessofthebusinessfunction. Forexampleaseeminglysimpletermlike“customer” can have many meanings such as “Prospect”, “VIP_Customer”, “Partner”,“Retailer”,whereeachappliesdifferentlywithpossiblydifferentbusinessrules.Withouta consistent and common definition of each customer type, it’s difficult to know andsegment themarket by customers. For example, if your firm intends to send discountcouponstoprospects,itshouldtargetadifferentsegmentandnotallcustomers.

Thereare3primarycategoriesofmetadata:

1. Business metadata: Consists of business perspective of data, like businessdefinitions,businessprocesses,stewardandownershipinformationaboutthedata,regulatoryandcomplianceelementsanddatalineage.

2. Technicalmetadata:Providestechnicalspecificationsofthedata,rulesaboutrecovery, backups, transformations, extract information, audit controls, and

versionmaintenance.3. Operational metadata: Includes operational aspects of data such as data

processtracking,dataevent,resultsofdataoperations,dates/timesofupdates,access times, change control information and who (or what applications)consumethedata.

Metadatamanagementbeginswithcreatingacentralizedandsearchablecollectionofdefinitions in a data registry system which I call Meta Data Hub (MDH). ProvidingtrainingtoDataStewardsonhowtousethemetadatahubsoalldataenteredintometadataconforms to the necessary quality requirements. In addition the Data Stewards weretrainedtomaintainconsistentqualityofthedataacriticalsuccessfactoroftheprogram.

CompaniessuchasKyvos Insights (LosGatos,CA)offer tools that representanOLAPcube and dimensional view of your data in Hadoop regardless of the format or thedatabasethatit’sstoredin.KnowingwhatOLAPcuberelationsyouwishtodrivefromdata is enabled bymetadata information thatmaintains the data relationshipswith eachother.

Compliance,Security,PrivacyPolicies

Policiesmustclearlydefinetheaccountabilityintheorganization.It’scriticalthatthegovernanceframeworkidentifies theroleandresponsibilitiesofDataStewards,datascientists,dataengineersandgeneralusersfortheorganization.

Thepoliciesmustalsodefinesecuritystandardsforencryption,tokenization,datamasking and data classification. As part of accountability policies, it must define theprocessofassigningnewdatasetstotherightDataStewardintheorganization.

There is no rule of thumb as to how many Data Stewards are needed for anenterprise.Infactit’sjustasdifficulttosaywhatshouldbetheratioofstoragevolumeperDataSteward.But,aftergrindingintoandlearningfromseveralbigdatainitiativesaroundtheworld, I thinkyoucancountonneeding10DataStewards foraPetaByte (PB)sizedatalake,namelyatleast1DataStewardper100TBofdata.

Finally, be prepared to conduct internal audits. Conducting internal audits is agoodmeasuretomanageandidentifygapsandrisks.Internalauditsshouldberegular(atleastonceayear).ItisbestpracticethatresultsofinternalauditsarereportedtotheDataCouncilforreview.

TheFourPillarsofBigDataManagement

TheEnterpriseBigDataGovernancePyramid

One advantage of viewing big data governance as a Pyramid is that it exposesgovernance as a hierarchy and layers of policies. Consider the following diagram thatshows a four-layer pyramid for governance. At the peak of the pyramid, we enjoy thebenefitsandvalueofgovernancebyhavingfullygovernedandtrusteddata.Thisdataisavailable to the entire enterprise for data scientists and data engineers, subject matterexpertsandprivilegeduserswhocanruntheirdataanalyticsmodels,businessintelligencereports,ad-hocqueries,datadiscovery,anddatainsightextraction.

The successof the top layers dependson the effectivenessof the lower layers.The fully governed and trusted enterprise data platform (layer 4) requires a functioningand effective sandbox for data scientists and user community (layer 3). Layer 4 is thelevel where users conduct their modeling, big data analytics, set up data analyticspipelinesanddataproducts.Layer3isthelevelwhereusersrefinetheirdata,processit,cleanse,transformandcurateit–prepareitforanalysis.

The sandbox functionality requires a robust and well configured data lake tothrive(layer2).Inbothlayers2and3,it’scriticaltoimplementmetadataregistrysouserscan search its catalog and locate their needed data. Information lifecycleManagement(ILM)policiesshouldgovernaccesscontrol,retentionpolicyanddatausagemodels.Intheselayers,qualitymanagementpoliciesthatgovernmonitoringanddatatestingshouldbeimplemented.

Finally,atthelowestlevel(Layer1),wehavetheRawdataandlandingareafromsourcesystems. Wewant toensure“full fidelitydata”,meaningthatwewantallof thedata as-is from the source. In this layer we need policies that define data ingestionpipelinerules,dataregistrationinMetaDataHub(MDH)andaccesscontrols.

TheBigDataGovernancePyramid

IntroductiontoBigDataGovernanceRules

Thereare4typesofrulesthatapplytoBigDatainauniquewayinadditiontothetraditionaldatagovernancemodels:

1. DataProtectionRules

2. DataClassificationRules

3. ComplianceRules

4. Process(Business)Rules

I’lldiscusseachof theserules inmoredetail in thissection.AdditionalpoliciesandconfigurationrecommendationsappearinSectionIV.

TheFourPillarsofBigDataGovernanceRules

Organization

DataGovernanceCouncil

Inordertomanageandenforcegovernancepoliciesacrosstheenterprise,weneedto establish a cadre that supports the operational and implementation of big datagovernance framework. The cadre is a core team dedicated team of IT engineerswhounderstandHadoopsecurity,privacyandcompliancestandards.Extendedfromthiscoreteamarethe“virtual”membersoftheDataCouncil,theDataStewardsintheirrespectivedivisions(orlinesofbusiness)butcollaboratetogethercollectivelyunderthegovernanceframework. While the Data Council meets on a regular basis (monthly cadence isrecommended),theDataStewardsalsomeetbutmorefrequently(preferablyonceaweek)toshareinformationandimplementdatagovernancepoliciesinafederatedmodel.

The diagram below illustrates a possible organizational chart that defines rolesandinteractionmodelbetweenvariousdatagovernanceroles.

ASampleOrganizationChartforDataGovernance

Hereisabriefdescriptionofthejobroles:

DataCouncil:DataCouncilisacross-functionalteamofbusinessandprocessmanagerswhooversee the compliance and operational integrity of theDataGovernanceFrameworkAccountable Executive: Also known as AE, is a VP or higher executiveresponsibleforoverallcompliancewithEnterprisedatagovernancepoliciesDataRiskOfficer:TheDROisaDirectororhigherlevelmanagerresponsiblefor identifying and trackingdata risk issues related toBigData. DROsareappointedbyAEs.DataSteward:DataStewardsare typicallydataanalystswho represent their

respective divisions, country office, subsidiary or affiliates to apply andmonitor data governance policies. Data Stewards are the point ofaccountability for enforcing data governance policies and data qualitymanagement.TheyareoftenselectedbytheDROs.DataForumManagers:SomeorganizationscreateaDataForumManagementgroup,butothersformtheCoreDataManagementteamconsistingofITstaffthatisresponsibleforpolicyupdates,technicalimplementationandoperationsof systems, tools and policies of each of 4 pillars of Data GovernanceFramework.

DataStewardship

Data Stewards are the people who define, cleanse, archive, analyze, share andvalidatethedatathatischartedintothemetadataoftheirorganization.DataStewardsarepeople who run the data governance operations. Data Stewardship is responsible formakingcertaintheinformationassetsoftheenterprisearereliable.Theyupdatemetadatarepositoryhub,addnewdatabasesandapplications,introducenewlexiconsandglossarydefinitions,testandvalidatetheaccuracyofdataasitmovesthroughtheorganizationandtransformations.

DataStewardshipisthepracticalexecutionandactivitiesofwhattheDataGovernancestructure isenabling. ADataStewardisaccountablefor theappropriate,successfulandcost-effectiveavailabilityanduseofsomepartofanorganization’sdataportfolio.

Data Stewards are the agents of trustable data within the organization. They’reresponsible for tracking, improving, guaranteeing, supporting, producing, purging andarchivingdatafortheirorganization.Theyworkwithotherdatauserroles(datascientist,data engineer, data analysts, etc.) in the user community as the spokes for datacommunication between an organization’s business, IT divisions and theDataCouncil.TheyhelpalignthedatarequirementscomingfromthebusinesscommunitywiththeITteamsthataresupportingtheirdata.DataStewardsunderstandboththetechnicalaspectsofthedataandbusinessapplicationsofit.Sothey’reessentialtothefunctionaloperationsof data movement, user access management, data quality management within theirorganization (be itanentirecompany,oradivisionorcountryoraffiliate,orothersub-segmentoftheenterprise).

DataStewardsoverseethemetadatadefinitionsandworkdailytocreate,documentandupdate data definitions, verifying completeness and accuracy of data assets under theiroversight,establishingdatalineage,reportingdataqualityissuesandworkingwithITstaff(orHadooptechnicalstaff)tofixthem,workingwithdatamodelersanddatascientiststoprovide context for the data, working to define governance policies, broadcasting andtrainingusersonpoliciesforgreaterdatagovernanceawareness.

DataStewardsdonothavetobeinacentralizedgroupunderasinglemanager.Infact,tomake thework ofData Stewardsmore federated, they can be distributed across theenterprise and report to different managers. It’s recommended that each organizationwithintheenterprise(divisionlevelorlineofbusiness,etc.)fundtheirstaffingresourcesfor Data Stewards. However, it’s conceivable that the IT organization provides suchstaffing resource as well. In amatrix organization, Data Stewardsmay report to theirbusinessmanagersintheirorganizationwithadotted-linetotheITorganization.

Authority:Policy,decision-making,governance

Data governance framework must establish clear lines of accountability andauthority. Whowillmake thedecisionsrelated topolicy,governance, rulesandprocessfor decision making relative to Big Data governance. So, the objective of discussionaroundauthoritymustdefineandregulatepoliciesandanswerquestionslike:

Which user roles and user groups can access Personally IdentifiableInformation(PII)?

Whatareourdataclassification&rulesformetadatamanagementrelatedtoeachclassofdata?

Itmustalsodefineandcreatebusinessrulesforuseraccess.Thereisaneedfora“Controller”whomustapproveallnewemployeeaccesstodata.Inadditionweneedtotrackandreportondata lineage todetermine(andbeable to trackandaudit)where thedatacamefromandwhatwasthedatasource.

Governancepolicesmustalsodefineprocessesforreportingondataqualityrules.Dataqualitypoliciesquestionsrelatedto:

Metadatamanagementandbusinessglossaries

Dataqualityprofiling

Masterdatastewardship

Wewanttoensurethatdifferentdepartmentsdonotduplicatedataacrossthedatalake.

GovernanceActivities:Monitoring,Support,andmore…

Let’s address some policies around security and Privacy Monitoring.Implementingabigdatagovernanceframeworkmustincludemeasureto:

EnsureHadoopaccesscontrolsandusagepatternsareincompliance

Ensure compliancewith regulatory standards:HIPAA,HITRUST, and otherstandardsthatpertaintoyourdataandindustry

DataManagement: Much of work of data management is carried out by DataStewards.DataStewardswillhandlelineagetracking.Thisimpliesthatweneedtosetuptaggingframeworkfortrackingsources.Inaddition,therearegovernancepoliciesrelatedto data jurisdiction, in particular as the organization works on data across countryboundary lines. We need to apply data jurisdictional rules by affiliate, country, dataprovideragreements

Compliance audits are important to determine how well the organization isperformingagainstitsintendedgovernancepolicies.Thegoalofcomplianceauditsistoprovecontrolsareinplaceasspecifiedbyregulations.Insomecases,youmightwanttoapply for external certification of your governance framework to assure externalorganizations thatyouapplyadequatedatamanagementandgovernancepolices toyourdata.

Data Stewards perform a lot of metadata scanning. They certify that data hascompletemetadatainformation.Furthermore,DataStewardsperformdataqualitychecksand defect resolution and escalation of data quality issues to the Data GovernanceCouncil.

Finally Data Stewards collaborate with the CoreGovernance Team to define aDataUsageAgreement(DUA)fortheorganization.TheDUAdefinesandcontrolsusagemodelofthedatasetsasdocumentedintheMetaDataHub(MDH)

Users:Developers,DataScientists,End-users

The user community of big data environment can be diverse consisting ofdevelopers,datascientists,dataproductmanagers,dataengineersandsimply“end-users”whodealwithdatainsomeformoranother.

In order to properly manage access controls, it’s prudent to classify data intoseveralclasses.Forexamplesensitivedata(datathatincludespersonallyidentifiabledata),critical (data is used to make critical business decisions) and normal (Data that’s notcriticaltobusinessoperations).Forapharmaceuticalcompany,socialmediadatamayberegardedasNormalandnon-criticaldata.But,RandomizedCriticalTrialdatacollectedaspartofanewFDAdrugtrialwillbehighlylikelyaCriticaldata.

Users have obligations and responsibilities too. They provide input into dataclassification process. They request data access by following the process (submittingrequesttodatastewardforaccesstohistoricaldatacontainingPII).

Users are responsible to register their data upon loading into Hadoop. They’reresponsible tomaintain themetadata informationfor theirdata. Also, theymayrequestinformationandverificationabouttheirdata.Anexampleofrequestingdataverificationistorequestdatalineage/historyifreportsappearincorrect.

Users may run (or request from data engineers) specific data transformations.However,policiesrequireuserstoadheretodatalineagerulesonwheretogetdata,wheretostoreresults. Finally,usersareexpected toworkwithin theirsandbox,where they’reallowedtocreatedataanalyticspipelinesanddataanalyticsproducts.

MasterDataManagement

MasterDataManagement(MDM)istheprocessofstandardizingdefinitionsandglossaryofbusinessentitiesaboutdata.Therearesomeverywell-understoodandeasilyidentifiedmaster-data items, such as “customer”, “Sales” and “product.” In fact, manydefinemasterdatabysimplyusingacommonlyagreeduponmaster-data itemlist,suchas:customer,product,location,employee,andasset.ThesystemsandprocessesrequiredtomaintainthisdataareknownasMasterDataManagement.

IdefineMasterDataManagement(MDM)asthetechnology,tools,andprocessesrequiredtocreateandmaintainconsistentandaccuratelistsofmasterdata.Manyofftheshelf software systems have lists of data that are shared and used by several of theapplicationsthatmakeupthesystem.Forexample,atypicalERPsystemasaminimumwillhaveaCustomerMaster,anItemMaster,andanAccountMaster.

Master data are the critical nouns of a business and generally cover fourcategories: people, things, places, and concepts. Further sub-categories within thosegroupings are called subject areas, also known as domain areas, or entity types. Forexample,sub-categoriesofPeople,mightincludeCustomer,Employee,andSalesperson.Within Things, entity types are Product, Part, Store, and Equipment. Entity types forConcepts might include things like Contract, Warrantee, and Licenses. Finally, sub-categoriesforPlaces,mightincludeStoreLocationsandGeographicFranchises.

SomeoftheseEntityTypesmaybefurthersub-divided.Customermaybefurthersegmented, based on incentives and history into “Prospect”, “Premier_Customer”,“Executive_Customer” and so on. Product may be further segmented by sector andindustry.Theusageofdata, requirements, lifecycle,andCRUD(Create,Read,Update,Destroy)cycleforaproductinpharmaceuticalsectorislikelyverydifferentfromthoseoftheclothingindustry.Thegranularityofdomainsisessentiallydeterminedbythedegreeofdifferencesbetweentheattributesoftheentitieswithinthatdomain.

Thereisalotoftypesandperspectivesofmasterdatabutherearefivetypesofdataincorporations:

Unstructured—Thisisdatafoundine-mail,socialmedia,magazinearticles,blogs,corporate intranetportals,product specifications,marketingcollateral,andPDFfiles.

Transactional—This is data related to business transactions like sales,deliveries, invoices, trouble tickets, claims, and other monetary and non-monetaryinteractions.

Metadata—This is data about other data and may reside in a formalrepository or in various other forms such as XML documents, reportdefinitions, column descriptions in a database, log files, connections, andconfigurationfiles.

Hierarchical—Hierarchical data maintains the relationships between otherdata. It may be stored as part of an accounting system or separately as

descriptions of real-world relationships, such as company organizationalstructuresorproductlines.HierarchicaldataissometimesconsideredasuperMDM domain, because it is critical to understanding and sometimesdiscoveringtherelationshipsbetweenmasterdata[7].

Masterdatacanbedescribedbybusinessprocesses,andtheywaythatitinteractswith other data. For example, in transactional systems, master data is almost alwaysinvolved with transactional data. The relationships can be determined via stories thatdepict thebusinessprocesses.Forexample,acustomerbuysaproduct.ASupplier sellsa part, and a vendor delivers a shipment of materials to a location. An employee ishierarchicallyrelatedtotheirmanager,whoreportsuptoamanager(anotheremployee).Aproductmaybeapartofmultiplehierarchiesdescribingtheirplacementwithinastore.This relationship between master data and transactional data may be fundamentallyviewed as a noun/verb relationship. Transactional data capture the verbs, such as sale,delivery,purchase,email,andrevocation;whilemasterdataembodiesthenouns.

Masterdataisalsodescribedbythewaythatitiscreated,read,updated,deleted,and searched through its life cycle. This life cycle, called the CRUD cycle for short,definesmeaningandpurposeofthatdataforeachstageofthelifecycle.Forexample,howaCustomerorProductdataarecreateddependonacompany’sbusiness rules,businessprocesses,andindustry.

Onecompanymayhavemultiplecustomer-creationmethods,suchasthroughtheInternet,directlybycallcenteragent,salesrepresentatives,orthroughretailstores.

As the number of records about an element decreases, the likelihood of thatelement being treated as a master-data element decreases. But, depending on yourbusiness,youmayopttoelevatealessfrequentlyusedentitytoamasterdataelement.

Generally,masterdataislessvolatilethantransactionaldata.Asitbecomesmorevolatile, it’s considered more transactional. For example, consider an entity like“contract”. Some may consider it a transaction if the lifespan of a “contract” is veryshort. Themorevaluable thedataelement is to thecompany, themore likely itwillbeconsideredamasterdataelement.

Oneoftheprimarydriversofmaster-datamanagementisreuse.Forexample,inasimple world, the CRM systemwouldmanage everything about a customer and neverneedtoshareanyinformationaboutthecustomerwithothersystems.However,intoday’scomplex environments, customer information needs to be shared across multipleapplications.That’swherethetroublebegins.Because—foranumberofreasons—accessto a master datum is not always available, people start storing master data in variouslocations,suchasspreadsheetsandapplicationprivatestores.

Sincemaserdata is oftenusedbymultiple applications, an error inmasterdatacancauseerrorsinalltheapplicationsthatuseit.Forexample,anincorrectaddressinthecustomer master might mean orders, bills, and marketing literature are all sent to thewrong address. One customer intended to send discount coupons to prospects of itsproduct,butmistakenlysentcouponstoallcustomers(includingexistingcustomerswho

had just purchased their product). As a result the company experienced lower profitswhenexistingcustomersreturnedtoredeemtheircoupons.Similarly,anincorrectpriceonan itemmaster can cause amarketing disaster, and an incorrect account number in anAccountMastercanleadtohugefines.

Maintainingahigh-quality,consistentsetofmasterdataforyourorganizationhasbecomeanecessity.MasterdatamanagementmustencompassbigdataandHadoop.Animportant step towards an enterprise master data management is to create a metadatasysteminHadoopthat is insynchwiththemetadataintherestof theenterprise. Somecompanies have a singlemetadata system.Others have created twometadata registries,onethatservesthetraditionaldatastores(Oracle,Teradata,MySQL,etc),andanotherthatisdedicatedtoHadoop.However,havingtwometadataregistriesoutofsynchwitheachothercanleadtoduplicationofeffortsanderrors.Thesecompaniesareintegratingtheirtwometadata systems intooneorenabling them toupdateeachotheruponachange inonesystemortheother.

Thereare threebasic steps tocreatingmasterdata:1)cleanandstandardize thedata,2)consolidateduplicatesbymatchingdatafromallthesourcesacrosstheenterprise,and3)beforeaddingnewmasterdatafromdataintheHadoopLake,ensureyouleveragefromexistingmasterdata.

Beforecleaningandnormalizingyourdata,youmustunderstandthedatamodelforthemasterdata.Aspartofthemodelingprocess,thecontentsofeachattributemustbedefined, and a mapping must be defined from each source system to the master-datamodel. This information is used to define the transformations necessary to clean yoursourcedata.

Theprocessofcleaningthedataandtransformingitintothemasterdatamodelisvery similar to the Extract, Transform, andLoad (ETL) processes used to populate theHadoopLake. IfyoualreadyhaveETLand transformation toolsdefined,youcan justmodifythesetoextractmasterdata,insteadoflearninganewtool.Herearesometypicaldata-cleansingfunctions[8]:

Normalize data formats. Make all the phone numbers look the same,transformaddresses(andsoon)toacommonformat.

Replacemissingvalues.Insertdefaults,lookupZIPcodesfromtheaddress,lookuptheDun&Bradstreetnumber.

Standardizevalues.Convertallmeasurementstometric,convertpricestoacommoncurrency,changepartnumberstoanindustrystandard.

Map attributes. Parse the first name and last name out of a contact-namefield,movePart#andPart_NotothePart_Numberfield.

Mosttoolswillcleansethedataautomaticallytotheextentthattheycan,andputthe rest intoanerror table formanualprocessing.Dependingonhow thematching toolworks, thecleanseddatawillbeput intoamaster tableoraseriesofstaging tables.Aseachsourceiscleansed,theoutputshouldbeexaminedtoensurethecleansingprocessis

workingcorrectly.

MetaDataManagement

Meta data is not limited to just names, definitions & labels. In fact it shouldincludemore comprehensive information about the dataset, the data table and the datacolumn(field).Thedataattributesstoredinthemetadatashouldinclude:

Datalineage–Recordthesourceofthedataandhistoryofitsmovementandtransformationsthroughthedatalifecycle.

Tracking usage – Record who creates this data and who (which programs)consumethedata.

Relationship to products, people& business processes –Define how is thedata used in business processes or business decision making. Answerquestionslikehowthedataisusedintheproductdevelopmentandbypeople?

Privacy,access&confidentialityinformation–Recordtheprivacy,regulatoryandcompliancerequirementsofthedata.

Inordertomanagemetadata,youmustimplementametadatasystem,alsoknownbyothernamessuchasMetaDataHub(MDH)ormetadata registryapplication. Thereare several possible choices for implementing a metadata registry system. You mayconsideropensourceandproprietarytoolssuchas:

Hcatalog

HiveMetastore

Superluminate

ApacheAtlas

Inaddition,youmayconsiderintegratingothertoolssuchasOozie,RedpointandApacheFalcontocomplementthemetadataregistrysystem.

LakeDataClassification

The key to a successful data management and data governance is dataclassification.Datamaybeclassifiedaroundseveraltopicsandcontextsasyou’veseensofarinthisbook.AclassificationthatdefinesthestageofdatainHadooplakeisthe4stagesof:Raw,Keyed,ValidatedandRefined.

Raw Data:When data lands in the Hadoop lake, it’s regarded Raw data. Bestpracticesdenotethatnoanalysisorreportsaretobegenerateddirectlyfromthisdata.Rawdata must have limited access, typically by a few data engineers and systemAdministrators (Admins). Raw data should be encrypted before landing in the Hadooplake.

KeyedData:Thisisthedatathathasbeentransformedintoadatastructurewithschema.AKeyeddatamightbeputintoaKey-Valuepairedstructure,orstoredintoHive,HBase or Impala structures. Keyed data is derived from Raw data, but physically a

different data file. Access to Keyed data is also limited to a few data engineers andadministrators.Noanalysisorreportingisallowedfromthisdata.

Validated Data: When data sets go through transformations (such as filtering,extraction,integrationandblendingwithotherdatasets),theyresultingdatasetsmustbereviewedandvalidatedagainsttheoriginaldatasets.Thisisanimportantsteptoensurequality,completenessandaccuracyofdata.Datacheckpointsworktovalidatedataaftereachtransformation.DataengineersandDataStewardsmonitorandpayattentiontoanyerrorlogsfromdatatransformationjobsforfailures.Therearetests(suchascountingthenumber of records before and after the transformation) to ensure the resulting data isaccurate.Oncedatapassesthesetests,it’sregardedvalidated.Validateddatamaybeusedforanalysis,reportingandexportingoutofHadooplake.

Governance policies for validated datamake it possible for users to access thedata(aslongasthey’reprivilegedtohaveaccesstothatparticulardataset)andperformanalysisorreporting.ValidateddataisderivedfromKeyeddatabutit’sadifferentfile.Atthis stage the Raw and Keyed data may be discarded as they’ve transformed into thevalidatedstage.

Refined Data: A validated data may undergo additional transformations andstructure changes. For example, a validated data set may be transformed into a newformatandgetblendedwithotherdatatobereadyforanalysis.Theresultisrefineddata.Refined data may be exported out of Hadoop and be shared with other users in theenterprise.

ThediagrambelowdepictsthedistinctionsandusagerulesforeachclassificationofdatainHadoop.

TheFourStages&ClassificationsofDatainHadoopLake

It’simportanttonotethatthedatafilesineachstageofthistaxonomyarestructuredindirectoryfilestructuresthataremostappropriatefortheirstageofrefinement.AsI’llexplaininmoredetaillater,inthefirsttwostages,thedatadirectorystructurestillmimicsthedirectorystructureofthesourcesystemthatitwasextractedfrom.But,inthefinaltwostages(ValidatedandRefined),thedirectorystructureissetupaccordingtotheuserdomainortheuse-case.Perbestpractices,onlydatafromthesefinaltwostagesmaybepublished,analyzedorsharedoutsideofHadoop.

Security,Privacy&Compliance

ThepurposeofSecurityandPrivacycontrolsistodeterminewhoseeswhat.Thisimplies certain policies to secure as many data types as possible, and to define dataclassifications.DataClassificationshavemultipledimensions. Forprivacyand securitycontrols, consider classifying data into Critical, high Priority, Medium Priority, LowPriority;orasimilarmodelthatfitsyourorganization.CriticalandHighPrioritydataarethose data items that are materially significant to your business and business decisionmaking.

Another classification is to determine which data sets are sensitive vs. not-sensitive. Sensitive data is a classification that includes Personally IdentifiableInformation (PII), or Patient Health Information (PHI), and Credit Card information(subjecttoPCIsecuritystandards).

Aspartofthisprocess,youmustdeterminewhichHITRUSTcontrolsarerequiredfor the Hadoop Lake. You must define data encryption, tokenization & data maskingrules.Forexample, large files thatarecoarsegrainedmightgetencryptedupon landingintotheHadoopLake. But,moredetaileddatacolumnsthatcontainPII,suchasSocialSecuritynumber,orCreditCardnumberwouldgettokenized.Finally,yourmodelwoulddeterminehowtomaskcertaindata.Forexample,ifyourcallcenteragentsneedtoknowthelast4digitsofacreditcardnumber,youcanmaskthefirst12digitsofthecreditcardsoonlythelast4digitsarevisible.

Whether youuse an existing tool likeActiveDirectoryormake a new tool foruseraccesscontrols,it’simperativetodefineusergroupsandusercategories.UsingUserGroupsandUserRoles,youcancontrolaccess tospecificdata in thelake. Forpropermanagement of security and privacy controls, establish a centralized and integratedsecuritypolicytoolsetandprocess.

Inaddition,sinceHadoopisaclusterofservers,youmustdefineintra-applicationsecurity ruleson thecluster. Toensure that the security andprivacycontrols arebeingproperly followed and applied,we need to consider performing periodic internal auditsandreportgapstotheDataCouncil

ApplyTieredDataManagementModel

Thedatagovernancepoliciesdon’tneedtoapplyequallytoalldata. TheLow-prioritydatamightincludesocialmediadatathatdon’triseinimportancetoothercriticaldata;hencetheirgovernancepoliciesmightbelessstringent.But,thecriticaldatasuchaspatient health informationwill requiremost strict of governance policies. The diagrambelow illustrates the distinctions that you can considerwhen crafting and administeringdatagovernancepoliciesbydataclassification.

MoreStrictPolicies&MonitoringApplytoCritical&HighPriorityData


BecauseHadoopwasnotdesignedwithsecurityinmind,twonewChallengesareaddedwithHadoopinCloud:1)SecuringtheCloud,whetherit’sAmazon,S3,Azureoranyothercloudarchitecture,2)Securing theHadoop&opensourceapps (MapReduce,etc.)

Securing the cloud is made possible by several standards and solutions. Forexample, one can decide to apply HITRUST, and Cloud Security Alliance STARcertifications.

In order to secure the Hadoop (Cloudera, HortonWorks,MapReduce, etc.) andopensourceapps,thesolutionistoapplythefourpillarsofHadoopenvironment

1. Perimeter:SecureHadoopattheborders(suchasbuildingintrusiondetectiontools)

2. Access:Limitaccesstodatabyroleandgroupmembership

3. Visibility:MaintaindatavisibilitythruMetadatamanagement

4. Protection:Applyencryption,tokenization,datamasking


To illustrate the four pillars of Hadoop security in more detail consider thediagrambelow.I’veincludedthepurposeandstrategiestoimplementingeachpillaralongwithtoolsthatmakethosestrategiespossible.

TheFourPillarsofHadoopDataLakeSecurity

BigDataSecurity–AccessControls

The access control mechanism for Hadoop Lake is best served when it’sintegrated and compatible with the existing security procedures and policies. A users’accesstoHadoopdataneednotbedifferentthataccesstoconventionaldatasources.

We can view a user’s access as a chain of policies and procedures. To enableproperaccesscontrols,wedefinetheuser’smembershipincertaingroups.Next,theroleof the user is defined. Together groupmembership and role give a granular control ofaccess(Read,Write,Delete,andExecute)privilegestoaparticulardataelement.

Consideratypicaluser,JohnDoe,amemberoftheHEORgroupwhoisaDataScientist. Based on this information, a Hadoop security tool such as Sentry can beconfiguredtogiveaRead-onlyaccesstosensitivedata.JohnDoe’saccessiscoordinatedwith his access privileges in Microsoft Active Directory so it’s consistent across theenterpriseandcentrallymanaged.

AccessControlChaininHadoopDataLake

BigDataSecurity:KeyPolicies

Let’sconsidersomeofthehighlevelandkeysecuritypoliciesforbigdata. I’llintroduce a list of sample policies that you can adopt and adapt to your organizationalrequirements.

1. SecuritycontrolsonHadoopmustbemanagedcentrallyandautomated

2. IntegrateHadoopsecuritywithexistingITtoolssuchasKerberos,MicrosoftActiveDirectory,LDAP,loggingandgovernanceinfrastructure

3. Allintra-Hadoopservicesmustmutuallyauthenticateeachother

4. Applydataaccesscontrols toHadoopbyusinga tool likeApacheSentry toenforceaccesscontrolstodatabeforeandafteranydataextract,transformandload(ETL).

5. Enforce process isolation by associating eachTask code to the Job owner’sUID.Thisensuresthattheuseraccessprivilegesareconsistentlyappliedtotheirjobsinthecluster.

6. MaintainfullaudithistoryforHadoopHDFS,fortoolsandapplicationsusedincludinguseraccesslogs

7. Never allow insecure data transmission via tools such as HTTP, FTP andHSFTP.EnforceIPaddressauthenticationonHadoopbulkdatatransfers.

8. ApplyHadoopHDFSpermissions for coarse data controls: directories, filesandACLssuchaseachdirectoryorfilecanbeassignedmultipleusers,groupwithfilepermissionsareRead,WriteandExecute.

9. Apply encryption to coarse-grained data (files) and Tokenization to fine-graineddata(tablecolumns)

DataUsageAgreementPolicy

OneofthepoliciesthatdatagovernancemustestablishisaDataUserAgreement.ThisisanagreementbetweenDataGovernanceCouncilandusersthatallowsusersaccesstodatainexchangeexpectsandlistsanumberofpoliciesthatusersagreetoabide.BelowisalistofpoliciesthatcanbeconsideredinaDataUserAgreement:

1. Allcritical&High-prioritydatamustbeof“TrustedSource”quality.Inotherwords,userswillrefrainfromconductinganalysisorreportingondatathatisnottrusted.Similarly,usersagreetomaketheirdatainHadoopLake“trusted”data,meaningtheywillcompletethemetadatafortheirdataandvalidatetheirdataaftereachtransformationforqualityandaccuracy.

2. Maintain data asset change control requiring advanced notice of any dataschema changes. This implies that users will notifyData Stewards if theymodifyanyschemaoradddatacolumnstotheirdatatable.

3. DataStewardsagreetomonitordataqualityandmetadatacompliance.DataStewards inspect metadata for user data periodically to ensure data with“trusted”statusmeetsmetadataspecificationsforcompletion.

4. UsersagreetoregisterandupdatemetadataHub(MDH)whenloadingdata

intoHadoop.DataStewardsmaynotifyusersandissuewarningiftheirdatadoesnothaveadequatemetadatainformation.

5. Third-partyand4thpartydataistobemanagedandusedinaccordancetodataagreements with the 3rd party data provider. If your organization purchasesdata from external 3rd party or 4th party sources, the contract usually comeswithcertainclausesandusageconstraints.Forexample,onedataproviderofconsumerdatamightallowanalysisofdataforeconomicpurposes,butnotformarketing purposes. This policy ensures that userswill abide by these datausageclauses.

6. Data retention policies must applied according to the DUA policies. DataretentionpoliciesvaryfromcountrytocountryandevenfromstatetostateintheU.S.Somestatesmighthaveretentionpoliciesof10yearsandsomeevenlonger.Usersmustbeawareandadheretotheseretentionpolicies.

7. UsersagreetocompletetheEnterpriseBigDataGovernanceannualtraining.Users and training are often considered as the first line of defense. DataStewardorHadoopAdministratorsshouldonlyprovideaccess tousersafterusershavecompletedtheirtrainingofdatagovernancepolicies.

SecurityOperationsPolicies

Operationalizingsecuritypoliciesandcoordinatingthepeople,policies,toolsandconfigurationstodeliveracohesivesecurityonHadoopischallenging.Aslightchangetooneaspectofthepolicy,toolorconfigurationcanintroducehugevariationsordeviationsfrom the intended course. Hence, it’s important to devise operational policies foroperatingthesecuritymeasuresforBigData.I’velistedsomesamplepoliciesthatcanbeadoptedformaintainingbestinclasssecurityoperations:

1. Data Stewards are tasked and held responsible for notifying the HadoopsecurityAdmin regardingonboardingor off-boardingusers for their areaofbusiness.

2. Enable fine grain access controls to data elements. Create user groups andassignuserIDstotheappropriategroups.

3. When requesting access toHadoopdata, usersmust submitAccessRequestForm showing they’ve completed pre-requisites before getting accessgranted. Access Request forms must be approved by the line of business(Division,franchise,etc.)DataRiskOfficer

4. Applyrandomdataqualitychecksforvalidity,consistencyandcompleteness.ReportdataqualityissuestotheDataCouncil.

5. AllIPaddressesmustbeanonymizedtoprotecttheuserdata.

6. Allsensitivedata,alaPersonallyIdentifiableInformationanddata(PII)mustbede-identifiedinRaworbecomeencrypted.Tokenizationispreferredwhen

datacolumns(Fields)areknown.

7. Developandmonitorperformancemetricsforservicesandrole instancesontheHadoopclusters. Monitorforsuddendeviations. Forexample, ifauserhashistoricallyusedasmallfractionofdataandresourcesinHadoopbutyoudiscoverasuddenincreaseindataandCPUresourceconsumption,itshouldalertyouofapossiblerogueorinsidiousattempttobreechyourdata.

InformationLifecycleManagement

As Imentioned earlier,managing data through its life cycle spans a process ofCRUD:Create,Read,Update andDelete. There are certain policies that apply to dataduringitslifecyclethatgovernallowedusage,handlingprocessandretentionrules.Hereare some high level rules that I recommend as the basis of an Information LifecycleManagement(ILM)governance:

Therulesforminimumdataretentionmayvaryfromcountrytocountry.ThelongestretentionperiodmustbeupdatedintheMetaDataHub(MDH)Toolforalldatasets.

Ifyour industry requires thatyoureproduce thedataandanalytics resultsatanother time, you must take snapshot of data and analytics models andsafeguardthem.Oneapproachisforyoutomaintainaseparatesandboxfordatasnapshotsthatarerequiredforcompliance,legalandmedicalreasons.

ThedatasnapshotsandboxisconfiguredwithonlyAdminaccessprivileges.In other words, only the Hadoop Administrators may have access to thissandboxinordertominimizeaccesstoitsdata.

DefineHadoop disaster recovery, downtime and business continuity policy.Protecting data may require duplication of the data lake across differentgeographies.Forexample,storingdataintwoseparateAmazonlocations,sayAmazonEastaswellasAmazonAsia,offersredundancyforbetterbusinesscontinuityandfasterrecovery.

Data recovery procedures should define how fail-over and fail-backprocedureswouldwork, how replicationwould be restored, the process forrestoringdata, recoveringdataandsynchronizingdataafter the failedsite isonline.

QualityManagement

BigDataQuality&Monitoring

Dataqualitymonitoringinvolvesinspectionandtestingofdataasitgoesthroughtransformation and movement throughout the enterprise. The goal of data qualitymonitoringistoensureaccuracy&completenessofdata.

Accordingtothebestpracticegovernanceguidelines,alldatamustbemonitored.Consider data extraction, transfer, load (ETL) jobs that run as part of data processingpipeline.Eachstepofthedatapipelineiscarriedoutbyadataprocesswhichproducesanerror log. One datamonitoring task is to inspect the error logs to identify if any dataprocessing jobs failed. Other data monitoring techniques including testing, such asconducting basic counts and comparing the number of records or volume of data fromsource to target (do we have the same number of records before and after thetransformation?).Anothertechniqueistoconductboundaryvaluetestsondatafields.Ifadata field isexpected tohavea range,youcan test thenewdatasetagainst that range.Finally, other techniques look for suspicious presence of too many NULL values, andotherdataanomaliestodetectqualityissues.

TheHadoopecosystemoffersmanydatapipelineprocessing toolswhich reportinterimerrorlogsforreview.Thesetoolsandsomecommonpracticesinclude:

1. UseOozie to orchestrate data pipelines and track processing errors at eachstageofdataprocessing

2. Use Hbase to record and maintain errors related to metadata errors, dataprocessingjoberrorsandeventerrors

3. Configurethetoolstoprovideautomaticqualityerrornotificationsandalertsiferrorsarereportedinthelogs

DataClassificationRules

Thekeytogranulardataaccessmanagementlieswithdataclassification.Youcandefine the data classes for your organization as each organization is different. But,typically good data classification should consider multiple dimensions and uses of thedata,notjustformatandfrequency.

We’ve seen data already classified into sensitive (PII, PCI, PHI, etc.) and non-sensitive. We also looked at data classes in Hadoop as Raw, Keyed, Validated andRefined.Inaddition,thereisthedistinctionbetween“Pending”statusofdata(datathatisnotprimedforusageandanalysis)vs.“Trusted”data(datathathasbeenvalidated,hasacompletemetadata,lineageinformationandiscertifiedbyqualitychecks).Furthermore,herewecandefinefouradditionaldataclasses:

CriticalData–Thetypeofdatathathasthehighestregulatory,reportingandcompliance requirementsdefinedbyFDAand/orGxP (GoodManufacturingPractices,GoodxPractices,justplaceyourkeybusinessprocessforx)andis

materiallysignificanttothedecisionmakingatEnterprise.

High Priority Data – the type of data that is materially significant to thedecision making at Enterprise, regarded classified information andconfidentialdata.ThisdataisrequiredforanalyticsbutnotsubjecttoGxPorFDAregulations.

MediumPriorityData–Datathatistypicallygeneratedorusedasaresultofday-to-dayoperationsandisnotclassifiedasconfidentialorclassified.

LowPriorityData –Data that is typically notmaterial and does not haveretentionrequirements.

DataQualityPolicies

Policies that define proper quality management of data require partnershipbetween users and Data Stewards. Data quality issues may be escalated to the DataGovernance Council, but the organization ultimatelymust commit to track and resolvethem.OneofthefunctionsthatDataStewardsfulfillistoensurethattheirorganization’sdatameetdataqualityspecifications.Thefollowingaresamplepoliciesasguidelinesforquality-drivenstandards,strategiesactivitiestoensuredatacanbetrusted:

Trusted source is defined as a critical or high-priority data set that has acomplete and accurate metadata definition including data lineage and thequalityofthedatasetisknown.

Sources that are not fully meeting the definition of Trusted Source will bedeemed“Pending”Statusuntil thedataset isbrought intocompliance. NoreportingoranalysisofPendingStatusdataisallowed.

DataStewardscanassignthestatusofPendingorTrustedtodatasetsinthemetadata.EachAccountableExecutiveandDataRiskOfficer(DRO)mustbeaware and be able to identify which data elements (those that are in theirdomainofresponsibility)areTrustedSourcesorinPendingStatus.

Data quality issues are defined as gaps between the data elementcharacteristics and deviations from this data governance standard. Dataqualitygaps typically include: data that hasnot beenvalidated for accuracyandcompleteness; data that is incompletemetadata information in theMetaDataHub(MDH);anddatalineageisunknown.

Policiesshouldrequiretieredapplicationofdatagovernancerulestodatabythedataclassification.Forexample,applydataqualitymonitoringtoCriticaldataandHighPriorityelementswithinthedataassets. DataStewardsmustdocumentandreporttoDROofanyissuesrelatedtodataquality.

TheMetaDataHuboranotherdatabasecanmaintaininventoryandrecordofmodelsusedbydata scientists.Amodelmaybeusedbymultipleusersanddataproductsintheorganization.Managingmodelsfromacentralrepository

has many advantages. This policy ensures that all models used by theorganizationareknownfortrackingpurposes,documentedforreproducibilityat a later time and also to centrally manage andmodify them if errors arediscoveredthatrequirefix.

Data Governance Council may define data quality metrics for theorganization. Thesemetricsmight includepercentofdata inPending statusvs.Trusted status,Numberofopenquality issues, anddatavolumebyeachdivision or user group. Data Stewards are responsible to provide metricsreports to theDRO,AEandDataCouncil related to levelof complianceofcriticaldatasets.DataStewardsmustonaregularbasisprovideadataqualityissues list with the expected date of remediation and a remediation plan toresolvetheissues.

DataStewardsmustmaintainevidenceofcompliancetothedatamanagementpoliciesandsupportingstandards,proceduresandprocesses.

Data Stewards along with Data Risk Officers annually review inventory ofcritical or high-priority usages and Data Usage Agreements with in theirdomainofresponsibility.

For data that is in “Pending” status, it can only be moved into “Trusted “status (or “Certified” status depending on the terminology that youwish toadopt)bytheDataStewardfortheirareaofresponsibilityif:

•MetadataiscompleteinMetaDataHub(MDH)Tool

•LineagetooriginationortotheTrustedSource

•DataQualityMonitoring&checkspassed

•Retentionperiodisdefined

MetadataBestPractices

Thedataqualitystandardsatyourcompanyunderthebigdatagovernancepolicyshould require the business metadata to be captured and maintained for all importantmetadata(CriticalandHigh-prioritydata).

This sectionprovidesguidanceonwhathighqualitymetadata looks like. DataStewardareencouraged touse thissection to improve theirmetadata. AsyourHadoopdatalakegrowsinsize,thevalueofmetadatawillgrowovertime.Descriptivenamesarerequiredforall importantdataelements(CriticalandHighProfiledata).GenerallyDataStewardsdeterminetheDescriptivenames.

Whencompletingthemetadatatrytoanswertwokeyquestions:

1. WhatdoIknowaboutthisdata(abouttables/columnsorkey-valuepair)thatsomeoneelsewouldneedtoknowinordertounderstanditandpropertyuseit?

2. Ifyouhaveneverseenthisdatabefore,whatwouldyouwanttoknowtohaveconfidenceyouwereusingitproperly?

The purpose ofmetadata is to explain the data so that it can be understood byanyone who is business-aware but does not know the data. The goal is not to give ageneral brush stroke of information about the data but to give enough specifics so itsmeaningisclearanditcanbeinterpretedandused.Theintentisnottowriteforsomeonewhoalreadyknowsthedata,butforsomeonewhoisnewtothecompany.

Descriptivenamesanddefinitionsoffermoredetailsand/ormoreexplanationthanispresent in thephysicaldataelementname. Whileyoumaynotbeable tochangethedatabase column name, you can change the description of the data element with adescriptivenameandbeclearinthedefinition.

YoushouldmakeproperassignmentofdescriptiveforCodesandIndicators:

Ifaphysicalnamecontains“_ind”meaninganIndicator,thenthepossiblevaluesshouldbeYesandNo,orTrueandFalse,notalistof5possiblevalues.

If aphysicalname takesmore than twovalues, say50values to represent stateabbreviations,thenthephysicaldataelementmustcontain“_cd”toindicateaCodetypeofdataelement.

DataStewardsmustclarifyanymisleadingnames.Forexample,eventhoughthephysical data element name is Customer_Identifier, if the field is actually a point ofcontactthedescriptivenameanddefinitionsshouldbechangedtoreflectthis.

Another rule of thumb is to avoid acronyms. Don’t assume that everyonewillunderstandwhattheacronymmeans.Worktheefforttospelloutthefullnameandplacetheacronymafterthenameinparenthesis.

Ifthecommentwiththemetadatadescriptionistrueatthepointintimebutmaychangeinthefuture,bespecificandindicatethedaterangethatthedescriptionisvalid.Avoidusingreferenceslike“currently”or“now”,“before”,or“after”.

Carefullyselectthetermsthatyouintendtouse.Whileconversationallypeopleuse loose terms to mean the same thing, in metadata descriptions, they need to beclarified.Keepinmindthatadataelementdefinitionmuststandonitsown.Forexample,“Customer” and “Account” are two different things. Add qualifiers to make thedescription more precise. As an example clarify “Customer” by the possible types ofcustomersuchas“AccountHolder”,“Prospect”,“Co-Signer”or“Guarantor”.

Avoidusingpronounsingeneral.Pronounsareconfusing,vagueandcanleadtomisinterpretation.Bestpracticeguidelinesforcolumnandfieldlevelmetadataincludethefollowingpolicies:

Descriptivenames should end in aword that classifies thedata.For illustrationconsider, date, amount, number, codewhich are examples of data classifiers.Thewordclassification should immediately follow the name or phrase that it’s describing. Thisimplies that descriptive names should be at least two words long. For example, useCustomerBalanceandPrimaryAccountandnotBalanceorAccount.

Descriptivenamesshouldbeasshortaspossibletoowhileretainingtheirmeaninganduniqueness.But,clarityisalwaysmoreimportantthanbrevity.

Agood descriptive namemust reflect the description and definition of the dataelement. Typically the descriptive name should not contain articles, prepositions, orconjunctionssuch“an,to,of,after,before”.

Asneeded,modifiersmaybeusedtocreateindependentattributes.Forexample,Application Status Code uses the modifier that allows the data element to standindependentofthetable/filewhichit’sstored.

Try to use more generalized definition and descriptive of data elements. Forexample if a field isused forbothSocialSecurityNumberandEmployer IdentificationNumber, don’t call the field Social Security Number. Instead call it Tax IdentificationNumberorNationalTaxIdentifierorsomethingmoregeneralindescription.

Data Descriptions should also contain the business meaning and why the dataelement is important to the business. Often a data element is computed and if so, it’simportanttoshowthemathorcalculationmethodinthedefinition.

Eachdefinitionmustbeindependentofanyotherdefinition,suchthatthereadershould not need to jump around to understand the data element. There are however,legitimatesituationswhereit’sappropriatetoreferreaderstoothercolumns/fields.Hereare some possible scenarios for cross-referencing descriptions: If there areinterrelationshipsbetweenphysicaldataelements, the relationshipwouldbe reflected inthedata description; or a hierarchyof data classification existswhich forms a tree.Forexample, when a class of data has sub-classes the top level column should how thehierarchyworks.

Metadata descriptionsmust include contextual information. The context shouldpaint a picture of how the data is used andwhat itmeans situated in the context of itsusage.Thecontextisoftencleartoasubjectmatterexpertbutnottotheaverageuser.

Ifaphysicaldataelementisacode,thenitmustcontainasamplesetofcodesandtheirmeaningsinthedescription.Ifadataelementcanhavemorethanonemeaning,itimpliesthatanotherdataelementdetermineswhichdefinitionappliestoit.Therefore,theotherdataelementmustbementionedinthedescription.

If adataelement is allowed tocontainanomaly in thedata, themeaningof theanomaly must be indicated in the description. For example, if a Social Security fieldcontainsavalueof-1,asaflagtolookuptheSSNinaspecialfile,thenthedescriptionmustdefinethatrule.

Furthermore, the organization must maintain a valid value list, namely a codetabledefiningthevaluesforCodes,IndicatorsandFlags.IfadataelementthatisdefinedasaCode,orIndicatororaFlag,butdoesnothaveadefinedvalueinthecodetable,thedescriptionmustprovideavalidvaluelistforthatelement.

If a data element is regarded as Critical or High-priority, it should include anIndicator to identify it as such. Similarly, if the data element is regarded as sensitive(meaning it contains personally identifiable data), it should also include an indicator toidentifyitso.TheIndicatormighttakevaluessuchas“Yes”,“No”or“Unspecified”.

Finally, data tables and files need to be described inmetadata registry system.The descriptions must identify which applications use the table/file. Table definitionsshoulddescribethecontentsofthedatainthetable.Itmustalsoincludeinformationaboutuniquenesskeys.Theuniquenesskeyisanycolumnorcombinationofcolumnswhichcanuniquelyidentifyarowinatable.Atablemaycontainmultipleuniquenesskeys,sotheyallneedtobedefinedinthetabledescription.

Theprimarykey(PK)isauniquenesskeythatischosenbydesignerstoenforceuniqueness.Ifauniquenesskeyinvolvedmultiplecolumns,eachcolumnmustrefertothecombinationwhichproducestheuniqueness.

Thecontextualinformationabouttable/fileisalsoimportant.Thecontextshouldindicate retention and metadata contact for that table/file. Furthermore, the table/filedescriptormustincludeanIndicatortodetermineifthetable/fileisareferencefile.WhentheIndicatoris“Y”,itindicatesthatthetable/filecontainsdatathatexplainsthemeaningof another data. Typically, this is used as a code table. If the Indicator is “N”, or“Unspecified”,thenthedatatable/filesignifiesthattheelementisnotareferencetableorfile.

Finally,regularandperiodicreviewofmetadatadefinitionsandcorrectingerrorsinmetadataareimportantactivitiesforDataStewardstoimprovetheoverallqualityandusageofdatainHadoop.

BigDataGovernanceRules:BestPractices

As a sample of typical big data governance policies and best practices, I’vecompiledalistinformofrules.Theseare,inmyopinionthetwelvegoldenrulestokeepinmindwhendraftingadatagovernanceprogram.

Thegoaloftheserulesistoprovideaneffective,leanandtangiblesetofpolicies

thatareactionableandcanbeput intoexecution.DataGovernanceCouncilmayreviewtheserulesatleastonannualbasistomakechangesandmodifications.Hereisasampleofruleslistedbelow:

Rule#1:Governeddatamusthave:

•AknownschemawithMetadata

•Aknownandcertifiedlineage

• Amonitored, quality test, managed process for ingestion &transformation

Rule#2:Governedusagepoliciesarenecessarytosafeguarddata

Rule#3:SchemaandMetadatamustbeappliedbeforeanalysis

• Eveninthecaseofunstructureddata,schemamustbeextractedbeforeanalysis

Rule#4:ApplyDataQualityCertificationprogram

• ApplyongoingDataQualitymonitoring that includes randomqualitychecks,tests

•DatacertificationprocessbyDataStewards&DataScientists

Rule#5:Donotdumpdataintothelakewithoutrepeatableprocess.Establishthe process for landing data into the lake (in order to fill the lake). Defineguidelines for data transformation and data movement activities with inHadoop.

Rule#6: Establish data pipeline categories and data classifications that wereviewed already. For example, the data stages: Raw,Keyed,Validated andRefined.

Rule#7:Registerdataintometadataregistryuponimportingitintothelake.Insomeorganizations,ItiscommonpracticethatHadoopAdministratorsmayissue a warning and purge any data that is not registered in the meta dataregistrysystemwithin90daysofloadingintoHadoop.

Rule#8:Thehigher thedataclassificationandstage, themorecomplete themetadatamust be. For example:A data set inRefined status is expected tocontain all elements of data management (definition, lineage, owner, users,retention,etc.).

Rule#9:RefinedDatamustadheretoaformatorstructure.Example:RefineddatamustbeeitherinHiveDataFormatand/orHBaseorinaspecificschema.

Rule#10:Classifydatabymultipleattributesand record theclassification inmetadata registry: data owner, type of data, granularity, structure, countryjurisdictions, schema and retention are some examples. Data Governance

Council must define privacy, security and usage policies for eachclassification.

Rule#11:Oneof thecardinal rulesof securingdata inHadoop is toencryptand tokenize. Thecommonrule is toencryptcoarsegraineddata (thesearedatafiles,orunstructureddata),andtokenizefinegraineddata(datasetsthatincludedatafieldsindatatables).

Rule#12: An alternative policy and extension to Rule#11 for protectingsensitivedataistomandateallsensitivedatatobepreviouslyencrypted,de-identified,anonymizedandtokenizedbeforegettingloadedintoHadoop.

Dataprotectionstrategiesvarybytypeandclassificationofdata.Somedatamyneedto be anonymized or de-identified. For example, a company policy might require thatnamesofindividualcustomersortheirlocation,companynames,productnamesandsuchshouldneverbeopenlyvisible.Othertypesofdatasuchascredit-cardinformationshouldbetokenizedwhileotherinformationmaybeencrypted.

The implicationof these rules inpractice isachieveapolicy thatneverallowsanysensitive data into the lakewithout applying the data protection transformations first. Implementinga rule tonever loadsensitivedataasRaw into thedata lakewithout firstapplying encryption and tokenization to the data is a step towards effective dataprotection.

SampleDataGovernance&ManagementTools

Fortunately, new data governance tools are emerging almost every month formanaging big data in Hadoop environment. There are possibly three challenges inselectingandimplementingtoolsthesedays:1)Identifyandselectaminimalsetoftoolswhich together provide a complete coverage of your organization’s data governancestrategyandrequirements;2)Integratingthesetoolsintoaseamlessenvironment;and3)Bepreparedtorevampthetoolsetin2-3yearsasthepaceoftechnologyandinnovationinthisareaarequitedynamicwithnewtoolsimmediatelysupplantingthelasttool.

Belowisasampleofdatagovernancesoftwaretools(Thetoolsthatappearwitha“*”areopensource):

Collibra–ActiveDataGovernanceanddatamanagement

Sentry*–Centralaccessmanagementtool

Knox*–CentralauthenticationAPIsforHadoop

ApacheAtlas*-Metadataregistryandmanagement

ClouderaNavigator–Centralsecurityauditing

BlueTalon–Centralaccess&securitymanagementtool

Centrify–Identitymanagement

Zaloni–Datagovernance&datamanagementtool

Dataguise–Auto-discovery,masking,encryption

Protegrity–Dataencryption,tokenization,accesscontrol

Voltage–Dataencryption,tokenization,accesscontrol

Ranger*–SimilartoSentrybutprovidedbyHortonWorks

Falcon*–Datamanagement&pipelineorchestration

Oozie*–Datapipelinesorchestrationandmanagement

DataGovernanceintheGRCContext

One perspective into big data governance is to view it in the context ofGovernance-Risk-Compliance (GRC). In this framework,bigdatagovernance isviewedas an element of IT Governance and must be based on and in alignment with ITgovernance policies. In turn, IT Governance is a sub-segment of, and must be inalignmentwiththebroaderCorporateGovernancepolicies.

Whether your organization has substantial experience with Corporate and ITGovernance or not, this book can still provide youwith adequate governance strategiesand tips that you can stand up your own big data governance framework for yourorganization.

TheGRCPerspectiveofDataGovernance

TheCostsofPoorDataGovernance

The costs of poor or no big data governance can grow substantially and

exponentially with the volume of data in your data lake. Aside from the direct andtangible costs that the organization will incur for lack of data governance, there areintangiblecostsaswell.

Poordatagovernanceincreasesthelikelihoodof:

Negativeimpactonbranchquality

Negativeimpactonmarkets

Negativeimpactoncreditrating

Negativeimpactonconsumertrust

Insummary,poordatagovernanceraisestheReputationriskforyourorganizationandcanseverelytarnishyourbrandidentity.

BigDataGovernanceBudgetPlanning

BigDataGovernancereliesonstaffing,toolsandprocesses,allofwhichrequireadequate and consistent budget. I raise this issue since many organizations haveoverlooked the need for data governance investment while planning their big datainitiatives.

Someofthelineitemsthatabigdatagovernancebudgetmustincludeare:

Resourcerequirements:

•BigDataGovernanceCouncilManagers

•Security/ComplianceManagers

•ITandtechnicalstaff

•DataStewards

Budgetfortools

•Consultantsupport

•Softwaretoolslicensing

•Softwaretoolsintegration

•Softwaretoolmaintenancebudget

Trainingbudget

• Annual training budget for users, Data Stewards and DataGovernanceCoreteamstaff

WhatOtherCompaniesaredoing?

Many large enterprises are developing robust data governance policies andframework. One example is the Hadoop Security initiative by HortonWorks. ThisinitiativeincludesMerck,Target,Aetna,SAS,andafewothercompanies.Thegoalofthis

initiative is to bring security policy & tools development to deliver a robust datagovernance infrastructure for businesses in 3 phases. This infrastructure is based onApacheRanger,HcatalogMetastore&ApacheFalcon.

Fortunately, additional initiatives are emerging that include companies likeTeradata,Cloudera, andMapR that are formingaviabledatagovernanceecosystemforbigdatagovernance.

Hortonworks and the Apache project are working on additional metadatamanagementtoolsundertheprojectAtlas.IsuggestkeepingawatchontheAtlasprojecttostayuptodateaboutthelatestreleasesandfunctionalityfromthisproject.Thelinkisat:http://atlas.incubator.apache.org/

http://atlas.incubator.apache.org/

SECTIONIII:BIGDATAGOVERNANCEBESTPRACTICES

DataGovernanceBestPractices

Hadoop was initially developed without security or privacy considerations.Hence, therehasbeenahugegapindatamanagement tools,structureandoperationsofHadoopinthepast.Withoutproperdatagovernancetoolsandmeasures,theHadoopDataLakecanbecomeaDataSwamp.

However,manybestpractices, tools&methodsareemerging.Manycompaniesmentioned in this book are offering solutions to address these challenges and gaps.HadoopvendorssuchasHortonWorks,ClouderaandMapRhavereleasedorannouncednewtools.Thisisaverydynamictechnologyarea.

This section contains some best practices in practical management of data inHadoop.You’llfindmanysamplepolicies,configurationrulesandrecommendationsforbest governance of yourHadoop infrastructure. But, this only the tip of the iceberg astechnologycontinuouslyevolvesandsecurityinandbyitselfisarace,anunendingracewithhackers.

DataProtection

DataProtection isarchitectedas shown in thediagrambelow tobehandledviatwoaccesspaths:HadoopHDFSorSQLAccess.Bothpathsmustbeprotectedbuteachpath requires a different design and approach. The HDFS access path includes thephysicallayer,foldersandfiles.TheSQLaccesspathconcernedwithLogicallayeranddatabaseconstructsandtoolssuchasHiveorImpala.

This design promotes data classification into several status and types. In theHadoop sandbox, I’ve shown four data statuses: Raw, Keyed, Validated, and Refined.Bothdatapaths come to theHadoopSandbox. Whendata ison-boarded (or imported)into thesandbox, it’s initiallyRawData.Thenit is transformedintoKeyedsuchas inaHive table and format. Then the data validated for quality. At this stage the data isvalidated and Trusted. (I’ll explain the Trusted data classification later). Finally datamightgetfurthertransformationandfilteringtoreachtherefinedstatuswhichisthestatusthatindicatesthedataisreadyforanalysis.

At the highest level Data Protection involves 3 categories of considerations:Sensitive Data Protection, Data Sharing Considerations, and Adherence to GovernancePolicies.

SensitiveDataProtection

SensitiveDataProtection isconcernedwithpolicies thatensureyouroperationsaremeetingregulatoryandcompliancerequirementsforsensitivedata.Thisispredicatedonanother typeofdata classificationactivity:Classifyingyourdata intoCritical,High-Priority, Medium Priority and Low priority. All Critical and High priority data areregarded Sensitive. This type of data includes Personally Identifiable Information (PII)and Personal Health Information (PHI) and credit card - Payment Card Industry (PCI)protectedinformation.SomeorganizationsrefertosuchsensitiveinformationasPrivateInformation (PI) and Non-Private Information (NPI) that does not contain personallyidentifiableorsensitiveinformation.Inthefollowingdiagrams,I’llrefertovarioustypesofsensitivedatasimplybyPHItorepresentallsensitivetypesofdataincludingPIIandPCIprotectedinformation.

Protecting Sensitive Data requires implementing centralized authorization andaccessmanagement,datausagepoliciesandrestrictionsondatausageandprotectionofPHIdata.

DataSharingConsiderations

Data sharing is a critical governance consideration in particular in largeorganizations, multi-division and global companies whose data generation andconsumptionspansnotonlyfunctionallinesbutcountries,divisionsandaffiliatessuchascorporatepartnersandfranchises.

Muchofbigdatacomesfromexternalsourcesandbycontractingwith3rdparty(andsometimes4th partydataproviders). Thesepurchases comewithusage rules, do’sanddon’tclauses.Datapoliciesrelatedtopropersafeguard,accessandusageof3rdparty

dataareneeded toensure thatwe’recompliantwith the initial3rdpartydatapurchasingagreementsandcontracts.

Datausageagreements(DUA)arepoliciesthatdefinetheproperuseandaccesstodata.Ontheotherhand,usabilityofdataandmakingdataeasilyconsumablebyend-usersisimportant.

Bothaccesspathsrequirethatusersadheretoexistingpoliciesandthatusersareto be provisioned by Data Stewards, plan training for handling sensitive data prior tograntingaccesstosuchdata.

AdherencetoGovernancePolicies

TheflipsideofenforcinggovernancepoliciesisAdherence.Adherenceincreaseswitheducationandtrainingofallusersintheorganization.AccesstotheLakemustbeapprovedbyDataStewardswhoareaccountable forenforcingdatagovernancepoliciesfortheirrespectivegroup,ordivisionoraffiliates.

All authentication tools must support logs for audits. Data Stewards areresponsible to ensure new users receive proper access and the disable access foremployeeswhohavelefttheorganization.

Certain industries dictate additional standards and compliance requirementsregardingdata.OneofthesestandardsistheGoodManufacturingPractices(GMP)andforthatmatterallotherpracticestypicallyknownasGxP(xisforanyactivityorprocess)fortheorganization.

Privacy&SecurityArchitecturalConsiderationsforHadoopEnvironment

Finally,itiscrucialthatdatastewardsanddatascientistsbeempoweredtoquicklyprovisionstorageanddataspaceintheirsandboxinthelake.

DataProtectionattheHighLevel

TheHadoopfilesystemincludingthedatalakerepresentsdatasetsatfilelevel(coarsedata) and coarse grained access to data. The best practices recommend that weUnifyAccess throughActiveDirectoryintegration.Thereare toolssuchasCentrifyforHDFSand the Operating System security measures (for Satellite/edge Nodes in the Hadoopcluster).ThegoalofthisapproachistoenableuserstouseWindowsCredentialsfortheiraccess to the lake. This approach allows for automated provisioning of users intoestablishedgroups.

You can provision Access through existing Enterprise Authorization system such asMicrosoftActiveDirectory. It’s recommended that you establish user groups based ondatasensitivity(PHI,Partnerships,ClinicalTrials,non-PHI,DataScientist,etc.)

ThebestpracticesrecommendthatyouestablishDirectoryStructurethatsupports

accessbythefollowingclassificationsofusersandusage,role-basedaccess:

Source & Ownership of data by Directory grouped by Division, Country,Affiliateandsimilarorganizationalboundary

FlexibleDirectory structure that can be defined to further restrict access toconfidentialdata,projects,team,etc.

Toprotectsensitivedata,atthecoarsegrainedlevelofdata,onecanandshouldapply tokenization tosensitive (PHI,PII,PCI)dataprior todata ingestion into the lake.Data tokenization is an important strategy toprotect data. Whenyou tokenizedata it’sadvisabletostorethetokenkeysonthetokenserversonyourpremiseinadifferentsubnetinparticular if thedataresides in thecloud. In theeventhathackersgainaccess to thedataonthecloud,thatinformationisuselesstothemwithoutthetokenkeys.

Oneofthebestpracticesistodoubletokenizethedata.Tokenizingsensitivedatasuch as credit card numbers, card expiration dates and security codes, social securitynumbersandothersensitiveinformationisincreasinglyapopulartechniquetoprotectdatafromhackers. Afinancialcompanymightchoosetotokenizethefirst12digitsofa16-digitcreditcardnumberleavingthelast4digitsavailableforcallcenter’sfunctionssuchasverificationandauthenticationofcustomers.

TokenizationreplacesthedigitswitharandomlygeneratedalphanumerictextID(knownasa“token”)whichcannotbeidentifiedwithoutde-tokenizationusingtheinitialkey that generated in the token server. A double tokenization essentially tokenizes thetokenoncemore using a separate token server. This is an additional safeguard againsthackers. In the event that a hackermight gain access to one token server, even if thehacker could de-tokenize the data once the ability to de-tokenize a second time isimpossiblewithoutaccesstotheothertokenserver.

Inthehighlevelgovernance,datastewardsarevigilantandactivelymonitordatafilesregularlyforPHI,sensitivedataanddatalabeled“sharedeveryone”access.

ManagingSecurity&PrivacyonPhysical&LogicalLayers

LowLevelDataAccessProtection

Representingthefinegrainedaccess,thisaccesscontrolinvolvesSQLAccess,thelogicallayerofdatamanagement.Toensureproperaccessandsecurity,youmustfunnelHiveSQLaccessthroughSpecificHiveServerthatcontrolsaccess.Sentryisatooloftenused in securing data on SQL. You can provision users through Sentry on SQL. Theadvantagesofthisconfigurationinclude:

Enablesthecreationofviewsandlogicaltables

Providesasecondaryaccesslayerforlogicalobjects

Enablessafesharingofnon-sensitivedatawithinasensitivedatafile

SecurityArchitectureforDataLake

The following architectural diagrams depict a modular design where thefunctionaldatasecurityandprotectionareseparatedbydesignintolayersandmodulestoensureallaspectsofdatasecurityareconsidered.Thekeyaspectsofthissecuritydesignare:

Authentication,enabledbystandardssuchasKerberosorsimilarsolutionsAuthorization, enabled byMicrosoftActiveDirectory (AD) andCentrify orcomparableproductPerimeter Security, can provided by Secure Socket Layer (SSL) and KnoxproductorsimilarsolutionMonitoring,canbeenabledbyproductssuchasBlueTalon

Within the architecture, the building blocks for governance, data qualitymanagement, data protection, both fine grained and coarse grained access control areshown.Thenextdiagrammatchesthesamefunctionswithspecifictoolsorstandardsthataccomplishthatfunction:

TheQualitymanagementrequiresDataAuditingandSystemAuditing.ThesearepossiblethroughproductssuchasDrools/NavigatorandMcAfeeorothersolutions.Governance management can be enabled by solutions such as Zaloni orsimilarsolutionsDataProtection includeddataencryptionand tokenizationof thedata in thelake.Someofthesolutionsfortokenization&encryptionincludeVoltageandProtegrity. The apply data protection at the folder, file and attribute level,solutions such as Gazzang, Protegrity and Voltage represent some of theoptions.Meta data management can be implemented using HCatalog and HiveMetastore.While these are somewhat nascent tools, they’re becomingmorefunctionalandpowerfulsolutions.In order to manage data lineage, you may implement solutions such asCollibraCoarsegrainandfinegraindataprotectioncanbeenabledbyasolutionsuchasVoltage,orProtegrity.HDFSAccesscontrolcanbeimplementedusingAccessControlLists(ACL).AccesstoHive,Impala,HBaseandotherfinegraineddatabasecontrolscanbeenabledbyaproductlikeSentry.

AModelHadoopSecurity&PrivacyManagementStack

AHadoopSecurity&PrivacyModelEnabledbyTools

DataLakeClassification

Lakedataclassificationpromotesbetteraccesscontrol,dataqualitymanagementandusability featuresofyourbigdata repository. Hereare the fourdataclassificationsagainexplained inmoredetail.Within the source formatteddata, there3 stagesofdataformats:

RawData

The rawdatahasnoTransformationperformedon it. It’s thedata that ison-boardedfrom the source outside of the lake. All sensitive data is to be encrypted or tokenizedbeforeitlandsinthelake.Accesstothisdataisrestrictedtoprivilegeduserssuchasdataengineersonly.

There are no quality checks performed on raw data. This data represents source fileformat as-is; it had the source directory structure. Raw data must be registered withmetadata even with minimal information about the source, owner, format and type ofdata.Anydatathatisregisteredinthemetadataregistry(alsoreferredtoastheMetaData

Hub–MDH),canstayinHadoop.

KeyedData

Keyeddatarepresentsdatathathasbeenimportedintosomedatastructuresuchas HBase or Hive or similar data format. Access to Keyed data is also restricted toprivilegeduserssuchasdataengineers.

Keyed data represents standardized file format and compression. It maintains dataKeysandschemaareapplied.Atthisstage,dataQualityChecksareapplied.Reportsfromqualitychecksaregenerated,andanydefectsareescalatedtodatastewardsandevenuptotheDataGovernanceCouncil.

Keyed data is not cleansed and no filtering is applied to it. Keyed data can stay inHadoopbut stillmaintains its sourceDirectory structure. One important fact tonote isthatKeyeddataisderivedfromRaw&isaphysicallydifferentfilefromtherawfile.

ValidatedData

ValidatedDatarepresentstheclassofdatahasbeenthroughdataqualitychecksandhasbeen fully registered in the metadata registry; Basic cleansing & data fixes have beenapplied;DataprotectionrulesareappliedandDataQualityRulesareapplied.

In this stage, files may be duplicated for across the organization for otherdivisions, partners and affiliates, Subsidiaries or sharing to other divisions in othercountriesaslongallowedbydatajurisdictionpolicies.

Certain data masking may apply to this data based on data usage policies. But,ValidatedmaybeshippedoutofHadoop.ThisdataisderivedfromKeyedorRaw&isphysicallydifferentfilefromitspriorstages.Oncedatareachesthislevel,theKeyed/Rawfilesaredeleted.Atthisstage,broaderuseraccessisallowedtoValidateddata.

RefinedData

Whendata canbemergedwithother data, filtered andparsed it reaches theRefineddata stage. In this stage, operational processing is performed; analysis is applied andreporting is generated. Users have access to this data and typically create it fromValidated data as they apply structure to the data. Theymay parse &Merge files andrecords,addattributesandshipthedataoutofHadoop.

However, additional Data Quality checks may be applied or required at this stage.Refined data is derived fromKeyed or Validated& is physically a different file. Thediagramthatshowsthefour-stagesofdatainHadoopisshownagainbelowforreference.

TheFour-stagesofDatainHadoopDataLake

HadoopLakeDataClassificationandGovernancePolicies

Different data governance policies apply to different stages and categories ofdata.Asthenextdiagramshowsvariousaccessandusagecontrolsapplytodataclassesdifferently.Let’slookatthatrulesandusagepoliciesforthesedataclassificationsinmoredetail:

RawdataUsageRules

Sensitive Raw data must be encrypted and/or tokenized before arrival into thelake.Onlyfew,prevailedusersmayaccessthisdatathusthegoalistohaveaverylimitedaccesstothisdata.“PrivacySuppression”isneededtoaccessandtransformdatainthisstage,but“PrivacySuppression”isgrantedtoaverysmallgroupofdataengineersordatascientists.Alldataon-boardedintoHadoopdatalakemustberegisteredinthemetadatahub includingdata lineage, schema,attributemetadataandowner (BusinessDivisionorlineofbusiness)ofthedata.But,detailedfieldlevelmetadataisnotalwaysnecessary,soyoumaydecidetomakeitoptional.

KeyeddataUsageRules

KeyeddatawillalsohavelimiteduseraccessavailabletoafewprivilegeduserssuchasdataengineersorITpersonnelwhomanagethedatalake.Inthisstagekeysareaddedtothedata.Theapplicationanduse-casesofthisdataisexplorationuseonly.Anysensitivedataissplitintoaseparatesegmentandfolderinthelake.DataSplittingisakeymethodofkeepingsensitivedata(PHI,PII,PCIinformation)inaseparatecontainerandbeabletoapplyspecificaccesspoliciestothatfolder.

Registering this data intometadata hub is required including attributemetadataandlineageinformationtothefileandfieldlevelinmetadata.

Validated/RefineddataUsageRules

Datafunctionsinthisstagevaryfromfilecleansing,curationandpreparationofdataforanalysistotransformations,mergingwithotherdata,filteringandextractingsub-setsof thedatasets. Thisclassofdata isdefined tohavebroaderuseraccess.But,datastewardsmustbevigilanttocontinuesplittingdataintofoldersforsensitivedatafromtherest.

AccesscontroltothisstageofdataismoreopentousersbutstillcontrolledbythedesignatedDataStewardsforeachgroupintheorganization. Keepingmetadataforthisdata at high quality and completeness are important and are expected from users. Themetadatarequirementsforlineageatthecoarselevel(folder&files)andgranularlevels(datatablesandfields)arerequired.

MetadataRules

SomeattributesrequiredforMetadataregistrationspanAllfiles,includingthoseinRaw,whichmustberegisteredwithdatasetandfileinformation.Hereisalistofsomepossibleattributesandrecommendationsasillustratedinthenexttwodiagrams:

Register the following for Dataset: Dataset name, source, owner, privacyconsiderations,description

Register thefollowingforFile:Filename,file location,entity type,creationdate

BeforemovinganydatafromRawtoKeyed, itmusthaveschemaandfieldlevelmetadata

MetadataRulesMatchedtoDataStagesintheLake

DataManagement&ConfigurationRulesMatchedtoDataStagesintheLake

DataStructureDesign

Someofthebestpractices-whichwe’vealreadydiscussed-includesplittingdataandstoringsensitivedatainaseparatefolderanddirectorypath.Thefollowingdiagramshowsasampledatalakefilestoragestructure.

It’simportanttonotehowthefolderpathattherootisdesignatedandhowtheybranch into leaves that make the path and folders more manageable from access andprivilege perspective. Note that you don’t want to build the directory path by dataclassification.YourdirectorydesignforRawdatawillvaryfromtheRefineddata.

WhendataisinRaw,KeyedorValidatedstages,yourdirectorystructureshouldstartfirstbysourceofdata

WhendataisRefinedandsetupforbroaderuse,yourdirectorystructureshouldbesetupbyorganizationthatowns(anduses)thatdata.

Let’s assume you’re looking to design file structure for a large global retailcompany,sayNike.Nikeisthelargestofathleticfootwareandapparelcompaniesintheworldwith sales over $20B and over 30,000 employeesworldwide. The company hasaffiliatedbrandslikeShaq,Hurley,Converse,OneStar,ColeHaan,andotherbrands.

Let’s assume the company has some pre-existing stored in a SAS server, priordata on an Oracle database, Teradata data and several subsidiaries, divisions andfranchises. The company has 3 key product categories like Footwear, Apparel, andEquipment.FootwearproductsincludeRunning,Soccer,Tennis,Golfshoes,etc.

SAS,OracleandTeradataareregardedassourceofdata.However,organizations(ORGs)arethecompany’ssubsidiariesandaffiliateslikeConverse,ColeHaan,etc.Theproductsformthesub-foldersinthedirectorydesign.

In the diagram below, the data movement paths labeled A, are productionprocesses,whilethedatamovementslabeledB,areuserdatamovementprocesses.

Forexample,aRefinedfilefolderpathmightcapturethedivisionaswellastheproductcategoryasitspath.AdatafileinRefinedstagerelatedtoTennisshoesaffiliatedwithConversemighthaveadirectorypathas:/sbx/Converse/Tennis/file.

Thekeypointfromthisdesigntomaintainisthatdataclassificationisrecordedinmetadata registry not in the directories. This approach allows directory design to besimpler,more consistent across the enterprisewith bettermanagement functionality foraccesscontrol.

DataDirectoryStructureinHadoopDataLake

SandboxFunctionalityOverview

The sandbox folder design would follow the <ORG>/<Line ofBusiness>/<Product> path. As illustrated in the diagram above. Users have read/writeaccesstotheSandboxdesignedandassignedtotheirorganization.UsersmayreaddatafromHadoop lakeandbringover toSandbox.This is thecommonpractice indata lakeprocesses. If you are using Active Directory for authentication/authorization, then it’srecommended that you implement Active Director (AD) nesting to simplify accessrequestsforHadoop.

DetailedAccessPolicies

Specificaccesspolicies canbedefined for the sandboxanddirectory structure.Forexample,useraccesstoaspecificsandboxshouldbeapprovedbytheDataStewardorDataRiskOfficerofeachorganization(ADataRiskOfficerisaVPorhigherexecutive,appointed by Division, Country, or similar level line of business to represent thatorganizationontheDataGovernanceCouncil).

TheDatagovernancebestpracticepoliciesdictatethatthereshouldneverbyanyreporting,analysisorpublicationdirectlyfromRawdata.Thereshouldbenomarketingactivities,analysisorreportsgenerateddirectlyfromtheRawdata.

To gain access to data in sandboxPrivacy rules training are required for users.Depending if theuserswillaccessdata fromUS,Europe,Canadaoranothercountryofjurisdiction,trainingdesignedforsuchcountriesmustberequiredforusers.

DataStewardsmustonaregularbasis(quarterly,half-yearlyorannually)reviewusertrainingcompetencyandrenewaccessforusers.Tokeepuseraccessconsistent,it’scommon practice and common sense that user access privileges to Hadoop data isidenticaltotheirpre-existingdataaccessleveltootherdatastoresintheorganization.

ToplevelSandboxProtection

ToplevelSandboxdirectoryisprotectedbyaSandboxspecificgroupsuchas:

phdp_<resource>_sbx

For example: /sbx/Converse directory has phdp_Converse_sbx group assigned.Sandbox group ensures only authorized users access to Sandbox. Tomanage SandboxOwner-onlyaccessconsiderthefollowingconfigurationtechniques:

HDFSUmaskvalueensuresthatfilesarecreatedwith“owneronly”privileges

Example:Usedefault600permissionsonfilesasonlyfileownerhasaccesstodatabydefault

Example:Inorder tosharedatawithothers,usermustchangepermissionto640andassignappropriategroup

UsersexpectedtoassignADGrouppolicyaccordingtothesensitivityofdata,

similartotheHadoopLakepolicy

Apolicy thatyoucanapply todata in the lake is to requireusersmust registerdataintheirsandboxwithin90daysafterthedataisarrivedorcreated.

ManagingSelectAccesstoHiveTables

InordertolimitSELECTaccesstoHiveTables,usersmusthaveREADaccessontheunderlyingdata inHDFS. Hivedatabasesare tobedesigned for respectiveLineofBusiness(Division,Country,etc.)orbysubjectareas(Products,Factories,etc.). ActiveDirectory(AD)groupmembershipisrequiredtogetaccesstothedatabase. ForCasualusers,youcanprovideREAD-onlyaccesstotablesinLakedatabase.

AspartofDataUsageAgreement,youmayallowusers tocreateorcopyHiveTablesfromtheHadooplakeintotheirSandbox.InordertorunHivequeries,usersmayrunHivequeriesusingbeelinecommandlineutilityorHueGUI.However,atthetimeofwritingthismanuscript,Hivecommandlinehassecurityvulnerabilitiesandshouldnotbeused.

SplitDataDesign

ManyusersaskwhyDataSplitinHadoop?Partofthereasonisthatthereisnoconceptof“Views”inHadooptolimitaccessbydatatypeaswe’vetraditionallyenjoyedinconventionalRDBMSsystems.

CurrentHadoopdataprotectionisatthefilelevel(coarsegrained)whichmaybeinadequateformanyapplications,andforimplementinganenterprise-widewell-governeddatalaketosupportHadoopandvarietiesofaccessmechanisms.

It’sencouraging tonote thata2015Hadoopenhancement isexpected todeliver“views” capability to separate data files logically. Until this “view” capability isimplemented,physicalseparationofsensitiveisnecessary.

The best practices require files to be split physically in Hadoop to separatePHI/sensitivefromnon-PHI/non-sensitivedata.

Another reason for physical splitting of sensitive data is that certain legal,regulatory, 3rd party contractual data and GxP (Good x Practices) requirements orcompliancepoliciesapplytosensitivedatawhichmustbestoredseparately.

One of the key data governance designs is to separate data schema andAccessControltosensitivedata.It’srecommendedthatyoucreatesub-folderineachdirectorytocontainthesensitivesplitdata.Properdatagovernancepoliciesdictatethatstrictaccesscontrolstothisfolderisappliedbythedatalakeadministrators.

InSectionIV,I’llintroduceadocumentthatcanbeastartingpointandbaselinedocumentforyourbigdatagovernanceprogram.

SECTIONIV:

BIGDATAGOVERNANCEFRAMEWORKPROGRAM

BigDataGovernanceFrameworkProgramOverview

Data Governance refers to the rules, structures, processes and practices thatallocate the roles, responsibilities, and rights of participants in big data analytics. Thisgovernancepolicy is intended tomeet all relevant compliancewithapplicable lawsandobjectivesofoperatingthebigdataplatforminasafeandsoundmanner.

This Governance Framework is a federated model with distributed roles andresponsibilities that encompass the big data management framework across franchises,countriesandwhollyownedsubsidiariesoftheorganization.Theframeworkisdesignedto protect the organization’s big data assets effectively at the lowest overhead cost. Itspoliciesareintendedtoconformtotheorganization’sCorporateGovernancepolicies.

Theframeworkconsistsoffourpillars:

1. Organization2. MetadatamanagementStandards3. Security/Privacy/ComplianceStandards4. DataQualitymanagement

Enterprise Data Management is responsible for establishing requirements,guidelines and procedures for proper management and handling of data across theenterprise. This framework abides by theEnterpriseDataManagement. Since in theHadoop environment the data and applications are tightly coupled, this frameworkwillincludeapplicationmanagementpoliciesinadditiontodatamanagement.

TheprimarygoverningbodyofthispolicyistheDataCouncil(“DataCouncil”)whichconsistsofexecutivemanagementfromcross-functionalorganizations(franchises,countries, subsidiaries, affiliates)whoare responsible for ensuringyourdata is reliable,accurate, trustworthy, safeandprotected. TheDataCouncilmakes recommendations toexecutivemanagementwithrespecttodatamanagementmatters.

IntegrityofInformationandDataResourceandAssets

Information and data resources are the organization’s assets, used by theorganization’s personnel to execute critical processes that deliver on its strategies.Becauseofrisksandcostsofusingorprovidingthewronginformation,poor-qualitydataor a breach, management must deliberately control the quality and access to data andapplications.

Well-designed data and application management controls must enable cost-efficientself-identificationandself-correctionofdataandinformationrisks,suchasusinginadequatedata,orselectingthewrongdataforadefinedpurpose.

This data governance framework and its policies are intended as supportingstandards and controls that consistently prevent these breakdowns from happening orprovidealertsofbreakdowns,plustoidentifyandremedycontrolexceptions.

Toplanandoperatethebigdataanalyticsplatformandtoenableitscompliancewith laws and regulationsof theorganization, executivemanagementmust identify and

controldataand informationand theirusage inamanner that iscommensuratewith thedirection of the Enterprise Risk Management Policy. To manage the risks, data andapplications must include controls that meet the requirements of this policy and itssupportingstandards.

Thepoliciesandstandardsinthisframeworklisttherequirementsthatneedtobemetwhendesigningandoperatingcontrolsforbigdataandapplications.

An adequately controlled information systems environment includes thefollowing:

1. A method for consistently identifying and defining the information that isimportanttooperatethebigdataanalyticsplatform.

2. Clearaccountabilityandauthorityformanaginginformationresourcessothattheirquality isadequate for their intendedusesatanappropriatecost to thebusiness.

3. Requiredandauditableproceduresfordeliveringandpublishinginformation,analysisresults.

4. Aclearlineofresponsibilitybetweenmanagementandwhatittakestomeetregulatoryandlegalexpectationsfordataandapplications.

5. Policies that ensure data integrity by prioritizing, defining, documentingrequirements.

6. Clearauthoritytomakedecisionsaboutdatamanagementincludingaprocessforapplyingcosteffectivecontrolsso that thosedecisionsareactedon,andtheabilitytodemonstratethatthecontrolsareeffective.

7. Maintainanup-to-dateandcompleteinventoryofimportantdatasourcesforthebigdataplatformandbusinessusesplusdatastewardshipprinciples thatresultinadequatedataintegrityfortheorganization.

8. Ongoing monitoring of standards and procedure for managing big dataplatformandapplicationassets.

BenefitsofComplianceandRisksofnon-compliance

Risksassociatedwithnon-compliancemustbeassessed,recordedandreportedtotheDataCouncilonaregularbasis.Complyingwiththedatagovernancepolicyhasthefollowingbenefits:

Adecreasedprobabilityofimpacttocustomers,patientsandbusinesspartnersandoperationallosseventsAn ability to acquire, share, and improve data and information with highefficiencyAn ability to meet business demands quickly through re-use of accurate,completeandfunctioningdataresourcesAn ability to reliably improve critical clinical and business processes andleveragefromsuperiorinformationThestrategicadvantageofbetterinformationsourcesandefficientcompliancewithregulations.

Failuretocomplyposesthefollowingrisks:

Anincreasedprobabilityofanunplannedlossorcustomerimpactduetopoorqualityorwronguseofdataorinformation.Wasteincriticaloperationswheredataispoorlydefined,badlydocumented,orinaccurate,costlyanalyticalresultsthatleadtoinconsistentandmisleadinginformation.Failed strategic andproductobjectivesdue tobad informationoruseof thewronginformationReducegoodwillandshareholdervaluethroughmisrepresentedorinaccuratepublicinformationLossofpublictrustandtarnishedbrandintheeventofdatabreachorlossofdataUnsatisfactoryauditfindings,regulatoryjudgmentsorlegalactionsthatposedirectcosts.

Definitions

DateUsage–Amodel,reportoranalyticsprocess.

DataLevels–Describestheimportanceandtheregulatorysignificanceofadataelement.Therearefourlevels:Critical,High-priority,Mediumpriorityandlowpriority.

Dataelement–Consistsofoneormoredatasetsstoredandusedinbigdataplatformforanalysis.

DataUsageAgreement–DefinedasacontractbetweenbigdataplatformmanagementandAccountableExecutivedefiningthedataqualityrequirementsthatmustbeadheredtoforthecriticalorhigh-prioritydatabeingconsumed.

AE–AccountableExecutive, aVP or higher level individualwho is designated as theexecutiveresponsibleforexecutionandcompliancewiththegovernancepoliciesfortheirfranchise.

DRO–DataRiskOfficeristheexecutivedesignatedatthedivisionorbusinessunitlevelwhoisaccountableforensuringcriticalorHigh-Prioritydataisidentified,monitored,andkept in compliance with the requirements of the governance policies. The DRO maydelegatetasksrelatedtogovernancestandardstoDataStewardsorSubjectMatterExpertsbut theDROmaintainsoverallaccountabilityfor theeffectivemanagementofsuchdatawithintheirareaofresponsibility.

DS – Data Steward is designated by eachAE for each franchise to ensure governancepoliciesappliedincludingpoliciesonsecurity,privacy,dataqualitymonitoring,reporting,andcertificationofcompliance.AfranchisemayhaveoneormultipleDataStewards.ADataStewardisamanagerorabove.

SME – Subject Matter experts that bring their knowledge and expertise in security,privacy, datamanagement, compliance, and technology. SMEs are designated byDataStewardsineachfranchisetooperatethedataanalyticsinamannerthatiscompliantwiththisdatagovernanceframework.

DataQualityIssue–AdataissueisdefinedasagapincompliancetotheDataQualitystandard,orfailureinDataUsageAgreement,deviationfromthisdatagovernancepolicy,privacy,securityoraudits.

Third-partyDataProvider–Acompanyorindividualprovidingdatatotheorganizationtobeusedfor itsanalyticaluse. Fourth-partyDataprovider isacompanyor individualprovidingdatatothethird-party

Trusteddata–Dataelements thatmeet thedatagovernanceanddataqualitystandardsdefinedinthisgovernanceframework.

Data Council – The primary executive forum for the organization’s divisional(franchise/country/subsidiaries)data riskofficers–whoare responsible for ensuring thecompany’s data, is reliable, accurate, trustworthy, properly safeguarded and used -tocommunicate,prioritize,assessandmakerecommendationstotheexecutivemanagementwithrespecttodatagovernanceandriskmanagementissues.

I.Organization

BigDatagovernanceisfederatedandorganizedbytheDataCouncil. TheDataCouncil meets regularly, typically monthly to address changes in policy, manage dataissues,auditresultsandissuesrelatedtocompliance,security&privacy.

TheDataCounciladvises,regulates,andestablishesguidelinesandpoliciesinthemanagementofbigdatafortheorganizationanditssubsidiariesandaffiliates.TheDataCouncilservesasa forumfordiscussingandsharing information,escalatingkey issues,coordinatingandensuringaccountability.Itscharterisdedicatedtoprovidingconsistentwell-managedleadershipondataquality,governanceandriskfortheenterprise.TherearefourdatagovernanceforumsthataremanagedbytheDataCouncil:

MetadatamanagementQualityManagementDataGovernanceSecurity/Privacy&Compliance

EachforumwillhaveadesignatedleaderappointedbytheDataCouncilaslistedonthenextpage.

Role

Metadata

Forum

DataQualityForum

DataGovernanceForum

Security,Compliance

PrivacyForum

Leader LeaderName LeaderName LeaderName LeaderName

Scope Develop&Leadcreationofmetadatacapabilities,tools&relatedtopics

Develop&Leadstandardsofdataqualitycapabilities,tools&relatedtopics

Establishdatagovernancecapabilities,tools&relatedtopics

Develop&Leaddatasecurity,compliance&privacycapabilities,tools&relatedtopics

Materials&Tools

MetaDataHub(MDH)tool&dataregistrationprocedures

Dataqualitymonitoring,testingandissueremediationprocedures

DataGovernancepoliciesandprocedures

Security,privacyandcompliancepolicies,processes&procedures

TheDataCouncilmemberswhocomprisethedecisionmakingbodyofthegroupconsist of the IT leadership, BigData PlatformManagement Leaders andAccountableExecutives from each line of business (franchise, country, subsidiary, affiliate) whoparticipatesinusingthebigdataplatform.

The federated roles in Data Council include the Accountable Executives, DataRisk Officers and Data Stewards. These roles are not full-time positions, ratherresponsibilities assigned to individuals with vested interest in their use of big dataplatform.

The Accountable Executives (AEs) are VPs or higher level managers who areresponsiblefortheoveralldatagovernancefortheirareaofbusiness.

The Data Risk Officers (DROs) are appointed by the Accountable Executives.TheyareDirector+levelmanagerswhoensuredatagovernancepoliciesarefollowedfortheirareaofresponsibility.

TheDROs,appointDataStewards fromtheirorganization. DataStewardsowntheresponsibility toensuretheusers in theirareasfollowthedatagovernancestandardsandpoliciesfordataquality,dataintegrity,security,privacyandregulatorycompliance.

ThefollowingdiagramdepictstheDataCouncilstructure.

DataGovernanceCouncilandStakeholderOrganization

II.MetadataManagement

An inventory of data usage will be maintained in the platform’s metadatamanagementHub (MDH:MetaDataHub). The data usage inventorywill be reviewedannuallyforanyupdatesandchangesbyeachDROfortheirareaofresponsibility.

Requiredmetadatainformationaboutthedataelementincludes:

DataAssetName (name of the database, data set, spreadsheet or any otherTrustedSourceofdata)Data Asset/File Type (e.g. flat file, Oracle, SAS, DB2, CSV, XML, JSON,Hive,XLS,etc.)Table/FileName (nameofphysical table/data set, spreadsheet tabwhere thedatafield/columnislocated)Column/FieldNameData Type/Length (e.g. number, char, varchar, date and length of thecolumn/field,i.e.Numeric(8),Char(25),Varchar(20)Date)Nameoftheanalysis(model),reportorprocessDataOwner(thebusinessentity,franchiseorcountrythatownsthedata)Descriptionofdatausage(stakeholders,consumersofreports,datascientistsusingthedata)DescriptionofAE,DROandDataStewardforthedataelementCriticalityofthedata(Critical,High-,Med-,LowPriority)Frequencyofreview/refresh(frequencyofreportcreationoranalysis)ProcessdocumentLocation(Thephysicallocationofthefoldercontainingthedataelementprocesses)Primaryuser(consumer)ofthereport(oranalysis)DateofreportpublicationNameofreport(oranalysis)asregisteredintheMetaDataHub(MDH)tool

Required metadata information about the data processes (both business andanalytics)includes:

DescriptiveName–CommonnameforthedataelementDescription–ClearlydefinethedataelementandintendedpurposeValid values, or range of valid values and theirmeanings for data elementsthatarecodes,flagsorindicatorsPrioritylevel–ValidvaluesareCritical,High,Medium,LoworNullprioritySourceIndicator–IndicatestheTrustedSourceforthisdataelementAssociated Data Usages – Identifies the data usages consuming the dataelementwherethedataelementisidentifiedashigh-priorityNameofAE,DROandDataStewardLineageinformation–Asaminimum,itmustindicatethesourcedataasset’sname(s), Date obtained or most recently updated and Transformationperformed on the source data. Transformation information includes anyprocessingorcalculationsperformedandexpectedoutputonthecolumn/fieldlevelifpossible.

RetentionperiodBusiness/regulatory/legal/country-specific laws and regulations that apply tothedataelementMethodofarchiving/purgingthedataonceretentionperiodisexceededRequiredlevelandstandardforsecurity(e.g.HITRUST,etc.)Requireslevelandstandardforprivacy(e.g.HIPAA,etc.)

III.DataClassification

Data Classification is an important step toward Data and ApplicationManagement.Thefourdataclassesandtheirdefinitionsarelistedinthenextdiagram:

DataClassificationStages&ControlsforHadoopLake

TheDataStewardforeachorganizationisresponsibletomonitorandensurethedataisinproperdataclassificationcategoryinMetaDataHub(MDH)too.

Additional security and governance policies related to metadata managementapplyto theHadoopenvironmentforalldata in theRaw,Keyed,ValidatedandRefinedclassifications:

LocationDirectoryownerGroup

ResourceGroupClassification(Raw,Keyed,ValidatedorRefined)PHIindicator(personallyidentifiableinformation)AffiliateGroup(Franchise,orCountryorSubsidiaryoraffiliate)JurisdictionalControls(Specifythecontrolandcountryorgeography)AdditionalAccessRestrictions(Anyspecificaccessrestrictionsregulatedbylawornegotiatedwiththedataprovider)

IV.BigDataSecurity,Privacy&Compliance

The security policies & processes outlined in this governance framework arealignedwiththeenterprisedatasecurity,privacyandcompliancerules.Thisgovernanceframework establishes a framework to gather and centralize information security fromdifferentsecuritycapabilitiesandprocessestoperformeffectiveriskmanagement.

Theadditionalsecuritydomainsintroducedbybigdataplatformtotheenterpriseconsists of two elements: use of public cloud (Amazon) and open source big data(Hadoop)environments.

ImplementApacheRangerandmonitorRangerinformationtomaintainacentraldata security administration capability. Configure Apache Ranger to security tools inorder to manage fine grained authorization to specific Tasks or Jobs in Hadoopenvironment.UseApacheRangertostandardizeauthenticationmethodacrossallHadoopcomponents.

Implement different authorizationmethods includingRole-based access control,attributebasedaccesscontrol,etc.usingRanger.

ThegeneralsecuritypoliciesapplicabletothecloudconsistofbestpracticesfromITIL, HITRUST and Cloud Security Alliance STAR (Security Trust and AssuranceRegistry)toformalThird-partycloudsecuritycertification.

Thegoalofthissecuritypolicyisto:

Develop and update approaches to safeguard data at the application, level,sandboxlevel,Filelevelandindividualfield/columnlevel.Threatintelligence–gather,correlate,andprovideintelligencetoITsecurityfromplatformusageandaccesslogs.Advanced threat monitoring – Monitor for advanced threats within theenvironmentusingspecializedtoolsandprocessesconformantwithenterpriseITsecurityprocedures.Develop audit capability of all activity at the user and data cell level withabilitytoprovidedataforensicsanalysis.

The Hadoop environment does not come with built-in security and thereforesecuritytoolsandprocessesmustbebolt-ontothisenvironment.Sincedataflowsinbothdirections from traditional data management environment to Hadoop and vice versa,consistentpoliciesarerequired.ThefourpillarsofBigDatasecurityare:

1. Perimeter–GuardingaccesstotheHadoopclusteritself.ThisispossibleviaMapRManager.

2. Access – Defines what users and applications can do with data. A tool ofchoiceisApacheSentry,butalsoRanger[9]andothertoolsarepossible.

3. Visibility–Reportingonhowthedataisbeingused.SupportforthisfunctioncomesfromacombinationofMetaDataHub(MDH)andProtegrity

4. DataProtection–Protectingdata in thecluster fromunauthorizedaccessbyusersorapplications–Thetechnicalmethodsrequireencryption,tokenization

and datamasking. The tool to perform this function can be a solution likeProtegrityorVoltage.

HadoopDataSecurity&AccessControlConsiderations

Security controls on Hadoopmust be managed centrally and automated acrossmultiple systems and data elementswith consistency of policy, configuration and toolsaccordingtoenterpriseITpolicies.

Integrate Hadoop security with existing IT tools such as Kerberos, MicrosoftActiveDirectory,LDAP,loggingandgovernanceinfrastructure,SecurityInformationandEventManagement(SIEM)tools.

Authentication – Manage access to Hadoop environment via a single system

such as LDAP and Kerberos. Kerberos features include the capability to interceptauthenticationpacketsunusableby theattacker, eliminating the threatof impersonation,andneversendingauser’scredentialsintheclearoverthenetwork.ThesecuritypolicyrequiresusingKerberosforauthenticationandLDAPforuseraccessprivilegecontrol.

All intra-Hadoop servicesmustmutually authenticate each other usingKerberos RPC.This internal check prevents rogue services to introduce themselves into the clusteractivitiesviaimpersonation. EnsureKerberosTicketGrantingTokenfeatureispropertyconfigured.

Authorization–ApplydataaccesscontrolsinHadoopconsistentwiththetraditionaldatamanagementpoliciesusingApacheSentry(orRanger)toenforceuseraccesscontrolstodatainHadoopbeforeandafteranydataextract,transformandload(ETL).

Specificsecurityruleslistedbelowapplytobigdatamanagement:

ToapplyuseraccesscontrolstoHadoopdata,applyPOSIX-stylepermissionsonfilesanddirectors,AccessControlLists(ACL)formanagementofservicesand resources, and Role-basedAccess Control (RBAC) for certain servicesthataccessthedata.Configure Apache Sentry for consistent access authorization to data acrosssharedjobsandtoolsincludingHive,MapReduce,Spark,Pigandanydirectaccess to HDFS (specifically to MapR RDFS) to ensure a single set ofpermissionscontrolsforthatdataareinplace.Securitycontrolsmustbeconsistentacross theentirecluster toensure intra-process accesses to data on variousVMs are enforced. This is of particularimportancewithinMapReduceastheTaskofagivenJobcanexecuteUNIXprocesses(i.e.MRStreaming),individualJavaVMs,andarbitrarycodeonthehostserver.Ensure Hadoop configuration is complete for internal tokens includingDelegationToken,JobTokenandBlockAccessTokenareenabled.Ensure Simple Authentication and Security Layer (SASL) is enabled withRPCDigestmechanismconfiguration.Never allow insecure data transmission via tools such as HTTP, FTP andHSFTP. Find alternate tools to HSFTP since Hadoop proxies use HSFTPprotocolforbulkdatatransfers.EnforceIPaddressauthenticationonHadoopbulkdatatransfers.Enforce process isolation by associating eachTask code to the Job owner’sUID.MapReduceofferssuchfeaturestoisolateaTaskcodetoaspecifichostserver using the Job owner’s UID, thus providing reliable process isolationandresourcesegmentationattheOSlevelofthecluster.Apply Sentry tool configurations to create a central management of accesspoliciestothediversesetofusers,applicationsandaccesspathsinHadoop.ApplySentry’sfine-grainedrole-basedaccesscontrols(RBAC)onallHadooptoolsconsistentlysuchasImpala[10],Hive,Spark,MapReduce,Pig,etc.This

createsatraceablechainofcontrolsthatgovernsaccessattherowandcolumnlevelofdata,asshowninfigurebelow:

AccessControlModelExampleforaTypicalUser

Apply Sentry’s features to use Active Directory (AD) to determine user’sgroup assignments so any changes to group assignment in AD areautomatically updated in Sentry resulting in updated and consistent roleassignments.ApplyRDFSpermissionsforcoarsedatacontrols:directories,filesandACLssuchaseachdirectoryorfilecanbeassignedmultipleusers,groupwithfilepermissions are Read, Write and Execute. Additionally configure RDFSdirectoryaccesscontrolstodetermineaccesspermissionstochilddirectories.Apply additional Access Control Lists (ACLs) to control data-levelauthorizationofvariousoperations(Read,Write,Create,Admin)bycolumn,columnfamily,columnfamilyqualifier,tospecificcell-level.ConfigureACLpermissionsatbothGroupandUserlevels.

Visibility&AuditReports–Thegoalofvisibilityistransparencyandthegoalofauditsis tocaptureacompleteand immutable recordofalldataaccessactivity in thebigdataplatform.Therulesforauditandtransparencyinclude:

Implementaccesslogaudittrailsonlog-insanddataaccessinordertotrackchangestodatabyusers.Definedatacategories that containpersonally identifiabledata (PII) andaresubject to HIPAA, or PCI Data Security standards. Ensure the privacycategoryiscompleteinMetaDataHub(MDH)toolforalldatasets.ImplementcapabilitytofurnishauditlogreportsofallusersincludingAdminactivitywithinthecontextofhistoricalusageanddataforensics.Maintain full audit history forRDFS, Impala,Hive,HBase andSentry in acentralrepository,includinguserID,IPaddress,andfullquerytext.Periodically(onaMonthlybasis)reviewaccess-logs,andidentifyanomalies,issuesandgapsassociatedwithsecuritypoliciestotheDataCouncil.Adoptworkflowmanager toolssuchasOozieasstandard tomaintainerrorand access logging at each stage of data processing, transformation andanalysis. Combine Oozie with Apache Falcon to track and manage jobstreamswithinspecifieddatagovernancepolicies.StoreHadoopJobTrackerandTaskTrackerlogsforallusersandreviewthelogsonaregular(monthly)basis.

V.DataUsageAgreement(DUA)

TheDataUsageAgreementisasetofrulesandpoliciesthatapplytousesofdataelementswhether theyare internalorexternallyobtained.AllCritical andHighprioritydata are expected tobeof “TrustedSource”dataquality.Thebestpracticepolicies andrulesassociatedwithDUAspecificallyinclude:

Usage of all Critical andHigh priority data will be subject to Data qualitymonitoring,testinganddatacertification.DataUsageAgreementdefinesappropriateandacceptableuseofCriticalandHigh-prioritydataforeachdataelement.Dataretentionrequirements–DUAmayindicatethedataretentionperiodandacceptableusagetimeperiod.DataAssetchangecontrolmanagementrequiresadvancednoticeofanydatachanges,reviewandapprovalanddefinedescalationprocess.SMEs,Data Engineers andData Scientists (“Users”) agree tomaintain andupdate theMetaDataHub (MDH) (by registering theirdata)with the latestinformation about the data elements that they load into Hadoop lake, orconsumeorproduceasaresultoftheiranalysisfortheirareaofresponsibility.SMEs,DataEngineersandDataScientists(“Users”)agreetofollowprivacy,confidentiality and security standards for the data in their areas ofresponsibility.SMEs, Data Engineers andData Scientists (“Users”) agree to complete theBigDataGovernanceannualtraining.Third-partydataismonitoredandtobemanagedinaccordancewiththeThirdPartyDataAgreement specifiedwith the3rd party data provider.Lineage ofdataelementsmustindicatetheThirdpartyandFourthpartydataproviders.(4thpartydataprovidersarethoseentitieswhoprovidedatatothethirdpartydataprovider).

VI.SecurityOperationsConsiderations&Policies

Anonymization has been applied and studied extensively by leading ITmanagement experts. A number of general open source tools for anonymization areavailable,suchtheCornellAnonymizationToolkitandARX.ToolkitsthatworkwithBigDatatools,liketheHadoopAnonymizationToolkithaveemergedsuchasProtegrityandVoltage.PoliciesshoulddetermineandguideonanonymizingIPaddressesanddatainthecloud,andmonitorandmeasurethequalityofanonymizationregularly.

Bothencryptionandtokenizationaretechniqueswidelyusedinsecuringbigdatain cloud. Data governance security best practices should indicate thatwhile tokenizeddatamaybestoredinthecloud,thetokensmustbemaintainedonpremisesandnotinthesamelogicalandphysicaldomainasthedataitself.

Medical data has traditionally been anonymized to enable research whileprotecting privacy. Policies must indicate if you can bring anonymized data sourcestogether and join the information while the data is anonymized. Similarly, the policiesshould indicatewhenwecanuse simplekeyencryption andwhat typeofdatamustbetokenized.

Additionalpoliciesandoperationalrulesareasfollows:

Ensureallusershavecompletedpre-requiredtraining,suchastheHadoopsecurityandgovernancetrainingmaterialbeforegrantingaccesstothem.DataStewardareresponsiblefornotifyingtheHadoopsecurityAdminregardingonboardingoroff-boardingusersfortheirareaofbusiness.CreateusergroupsandassignuserIDstotheappropriategroups.Users must submit Access Request Form showing they’ve completed pre-requisitesbeforegettingaccessgranted.AccessRequestformsmustbeapprovedbytheLOBDataRiskOfficerCreate trainingmaterial forusers to includeasminimumthe following:Hadoopenvironmentoverview,dataandapplicationgovernancepolicies,policesaboutPIIand sharing data with affiliates and publication, information security of PII,complianceandregulatorycompliancerulesspecifictothebigdataplatform.StreamingdataintoHadoopcanproducelargeandfastdataloadsthatmakedataquality checks of every data item difficult. Instead, apply random data qualitychecksforvalidity,consistencyandcompleteness.Ensure data qualitymetrics are reported for streaming data includingmeta datainformation,lineageinformation.AllIPaddressesmustbeanonymizedtoprotecttheuserdata.Allpersonallyidentifiabledatamustbede-identifiedinRaworbecomeencrypted.Tokenizationispreferred.Developandmonitorperformancemetrics forservicesandrole instanceson theHadoopclusters.ThesemetricsincludeHDFSI/Olatency,No.ofconcurrentjobs,AverageCPUloadacrossallVMs,CPU/Memoryusageofthecluster,Diskusage,etc.OthermetricsincludeActivitymetrics,Agent(application)specificmetrics,fail-overmetrics,andAttemptmetrics.

ApplyhighestsecuritycontroltotheKerberosKeytabfilethatcontainsthepairsofKerberosprincipalsandticketstoaccessHadoopservices.ConfigureHadoop cluster tomonitor NameNodeResources and generate alertswhenanyspecifiedresourcesdeviatefromdesiredparameters.UseHadoopmetricstosetalertstocapturesuddenchangesinsystemresources.

VII.InformationLifecycleManagement

Information Lifecycle Management governs the standards and policies of datafrom creation and acquisition to deletion. Due to regulatory andwell-managedpractices,severalrulesforretentionandmanagementofdataareapplicabletobigDataGovernance.Someofthebestpracticepoliciesthatgoverninformationlifecyclemanagementareincludedbelow:The retention rules for data may vary from country to country. The longestretention period must be updated in the Meta Data Hub (MDH) Tool for alldatasets.Certainanalyticsdatasnapshotsarenecessaryforcompliance, legalandmedicalreasons. Such snapshotsmay include amirror copy of the data, the algorithms,models,theirversionandresults.The snapshot data will be saved in a separate protected Hadoop sandbox withAdminaccessprivileges.Usersmay request copiesofdata snapshotby requesting such information fromtheHadoopAdmin.Data that reaches its end of life (upon maturity of retention period) should bepurgedaccordingtotheITdatapurgingprocedures.Define Hadoop disaster recovery, downtime and business continuity policy. Adowntime means that the Hadoop infrastructure is not accessible. The policywould indicate how replicated siteswill bemanaged and the process for eitherautomaticormanualfail-overintheeventofadowntime.Thebusinessrecoveryplanshould indicatehowreplicationwithresume, theprocessforrestoringdata,recoveringdataandsynchronizingdataafterthefailedsiteisonline.

VIII.QualityStandards

DataQualitystandarddefinesrequirementstoclassify,documentandmanagedatatoensure big data platform’s critical and high priority data meets established qualityrequirements.Theobjectiveofthisstandardistoensurethatplatform’sdataismanagedandifofsufficientquality.

TheDataQuality Standard requires theData Steward tomanage data issues. Thisresponsibility includes identifyingdata issues,assigningseverityandnotifying impactedstakeholders;developingdataremediationplans;providingregularstatusupdatesondataremediationefforts; communicatingdata issues remediationplansand statuswithDRO,AEandtheDataCouncil.

Data issues can be detected from various sources, including but not limited tocompliancemonitoringactivities,control testing,dataqualityandMetadatamonitoring,projectwork,qualityassurancetestingandaudits.

Alldatawhetherinternalorexternallyobtainedistobeclassifiedinto3categories:

CriticalData–Thetypeofdatathathasspecificregulatoryandcompliancerequirementssuch as HIPAA, FDA, FISMA, and Financial is materially significant to the decisionmakingintheorganization.

HighPriorityData–thetypeofdatathatismateriallysignificanttothedecisionmakingat the organization, regarded classified information and confidential data. This data isrequiredforanalyticsbutdoesnotcarrythehighcompliancerequirementssuchasFDA,CDIorHIPAA.

MediumPriorityData–Datathatistypicallygeneratedorusedasaresultofday-to-dayoperationsandisnotclassifiedasconfidentialorclassified.

LowPriorityData–Datathatistypicallynotmaterialtodecisionmakingandanalyticsanddoesnothavespecificretentionorsecurity/privacyrequirements.

TheDataQualityStandardappliestoCriticalDataandHighPriorityData:

IsapplicabletodatausageofcriticalandhighprioritydataelementsasdesignatedbyeachfranchiseorcountryAccountExecutive.Setscriteriafortheidentificationofdatadeemedmostsignificantandimpactfultobusiness.Defines the accountable executive and performer roles responsible for datamanagementacrosstheorganizationSetstheminimumrequirementsformanagingriskduetodatarelatedissues

Enforcing the consistent creation, definition, and usage of critical and high-prioritydata reduces reputational, financial, operational and regulatory risks and preventssubsequent costs of presentation and clinical errors aswell as the potential formakingdecisions based on incorrect, inconsistent, or poor-quality data. Applying theserequirements enables theorganization to effectivelydemonstrate thequalityof thedata,raisingconfidence&regulatorycomplianceinthedataused.

TheDataQualitystandardforeachdivision,franchiseorcountrywouldconsistofthefollowing:

Critical and high-priority data are to bemaintained andmanaged as data assetswhicharecollectionsofoneormoredatasets. Adataset includesadata table,file,spreadsheetoranycollectionofdataelements.Trustedsourceisdefinedasacriticalorhigh-prioritydatasetthathasacompleteandaccuratemetadatadefinitionincludingdatalineageandthequalityofthedatasetisknown.SourcesthatarenotfullymeetingthedefinitionofTrustedSourcewillbedeemedPendingStatusuntilthedatasetisbroughtintocompliance.Aplan,withcommitmentdatesforcompliancewiththegovernancestandardsareinplaceandarebeingexecuted.EachAEandDROmustidentifywhichdataelementscontainedindataassetsintheirareaofresponsibilityareTrustedSourcesorinPendingStatus.Validate lineage fromTrusted Sources to the point of origination for critical orhigh-prioritydataelements.Maintain accurate and complete metadata for critical and high-priority dataelements.Apply data quality monitoring to critical data elements within data assets.DocumentandreporttoDROofanyissuesrelatedtodataquality.Dataquality issues aredefined asgapsbetween thedata element characteristicsanddeviationsfromthisdatagovernancestandard.Understand how the critical and high-priority data with in each area ofresponsibility is being consumed and informing DRO of material changes thatimpacttheirspecificusagesasdefinedintheDataUsageAgreement.CreateandmaintaindataretentionplansaccordingtoDataLifecycleManagementpolicies.Maintaininventoryandcertificationofmodelsusedbydatascientists.Provide metrics reports to the DRO, AE and Data Council related to level ofcompliance of critical data sets. Report issues list with the expected date ofremediationandaremediationplan.Evidence of compliance to the data management policies described in thisdocumentandsupportingstandards,proceduresandprocesses.Manage and prioritizemitigation activities tomitigate data quality risks, issues,including oversight by AE and DRO of an execution plan to apply theserequirementsacrosscriticalorhigh-prioritydatausageswithintheirdomain.Annually review inventory of critical or high-priority usages and Data UsageAgreementswiththeirdomain.Document processes for identification and classification of data elements bycriticalitylevelandadherencetoqualitystandards.Critical and High priority data are to be inventoried, maintained and linked toappropriatemodel(s),process(es),andreport(s)intheMetaDataHub(MDH)tool.

Fordatathatisin“Pending”status,itcanonlybemovedinto“Certified”statusbythe

DataStewardfor itsareaofresponsibility if itmeets thefollowingcriteria(ComplianceRequirements):

MetadataiscompleteinMetaDataHub(MDH)ToolLineagetooriginationortotheTrustedSourceDataQualityMonitoring&checkspassedRetentionperiodisdefined

DataQualityReporting

It’sa requirement thatAE/DRO/DataStewards regularlymonitorand reporton thequality of critical and high-priority data elements within their assigned areas ofresponsibilityonaquarterlybasistotheDataCouncil.

EachDataQualityReportmustcontainthefollowingcomponents:

ResultsofmonitoringListofissuesidentified(i.e.dataqualityfailures)Pass/FailbasedonthresholdsAssessment of the reliability of the data for the usages as specified in theDataUsageAgreement

Thefollowingmetricsareusedwhendefiningdataqualityrulesforcriticalandhigh-prioritydataelements:

Accuracy–Measurethedegreetowhichdatacorrectlyreflectsitstruevalue.It’sadeterminationonhowcorrectly thedata reflects therealworldobjectoreventbeingdescribed.Dataaccuracyismostoftenassessedbycomparingdatawithareferencesource.Validity–Validitymeasurethedegreetowhichthedataiswithinanappropriaterangeorboundaries,conformstoasetofvalidvaluesorreflectsthestructureandcontentofavaluewhichcouldbevalidCompleteness – Measures the degree to which data is not missing and is ofsufficient breadth and depth for the task in hand. In other words, it’s fulfillingformalrequirementsandexpectationswithnogaps.Timeliness–Measuresthedegreetowhichdataisavailablewhenitisrequired.Additionally,thismeasureensuresdataiscombinedtogetherfromconsistenttimeperiods.Consistency–Ameasure of information quality expressed as (a) the degree towhich redundant instances of data, present inmore than one location, reconcileand(b)thedegreetowhichthedatabetweenaTrustedSourceandusage(s)canbereconciledduringthemovementand/ortransformationofthedata.

Whendata issuesaredetected fromvarioussources includingcompliance reporting,control testing, audit findings, data quality andmetadatamonitoring activities, or fromThird-partydatamanagementauditsandactivities,themustbedocumentedandreportedbytheDataStewardfor theirareaofresponsibility. SuchreportsforCriticalandHigh-prioritydatamustincludetheremediationplanandtargetdateforremediation.

Dataissuescomeindifferentseveritylevels.Therubricbelowcontainsthecriteriafortriageofdataqualityissues:

Severity

Level

Extreme

(Level4)

High

(Level3)

Medium

(Level2)

Low

(Level1)

CriteriafortriageofDataQualityIssues

Issuehasorwillpreventtimelypreparationofbusinessorregulatoryreportsorwillnegativelyimpactbusinessby>$1M

Issueimpactsbusinessprocessingorreporting.Remediationactivityisrequired.

IssuehasminimalimpacttoBusinessdecisionmakingorreporting.Dataremediationactivityisrecommended

Issuehasminorimpacttobusinessdecisionmakingorreporting.Remediationisrecommendedbynottimesensitive.

IssueRemediationPlandocumentwithin:

1week 2weeks 3weeks 4weeks

RecommendedTimelineforRemediation

<1month 1-3months 3-6months 6-12months

Recommendedcommunicationofstatusfrequency

Weekly Bi-weekly Monthly Quarterly

Whenanissueisresolved,theDataStewardisresponsibleforclosureofdataissuestotheimpactedexecutives,DROandAEintheirareaofresponsibility.However,intheeventthattheissueisnotresolvedinatimelyfashion,theDROandAEsareresponsiblefor escalating the issue to the Data Council and include it in the enterprise RiskManagementQuarterlyRiskReport.

Summary

This book has covered a wide range of topic and areas including modern datagovernanceprinciplesforamoreeffectivemanagementofbigdata.Whiletheamountofeffortandscope is substantial, theartandscienceof this field iscontinuouslychangingandnewtoolsandmethodsemergealmostdaily.It’simportanttocontinuestayingontopofthelatesttechniquesindatagovernancetodelivereffectiveandbestpracticemethodstoyourorganization.

SimilarlybestpracticesinBigdatagovernanceareevolving.Butthisbookhaslaidafoundational and practical ground work to design and implement a data governancestructureforanyindustryandcompanyofanysize.

Big data will grow only bigger and with it the importance and visibility of datagovernancewillincrease.TheleadingC-suiteexecutiveshavecometorealizeinvestingin data governance will protect their investments in data in the long run. Fortunately,buildingabigdatagovernancestructuredoesn’tneedtobeexpensiveorcomplex.Itstartswithimplementingthebasic,commonsensepracticesexplainedinthisbook.

ABOUTTHEAUTHOR

Peter K. Ghavami received his PhD in Systems and Industrial Engineering from

UniversityofWashingtoninSeattle,specializinginbigdataanalytics.HethenservedasHeadofDataAnalyticsatCapitalOneFinancial,Banklineofbusiness.HereceivedhisBA from Oregon University in Mathematics with emphasis in Computer Science. HereceivedhisM.S.inEngineeringManagementfromPortlandStateUniversity.Hiscareerstartedasasoftwareengineer,withprogressiveresponsibilitiesatIBMCorp.,asSystemsEngineer.LaterhebecameresponsibleasDirectorofengineering,ChiefScientist,VPofEngineeringandProductManagementatvarioushightechnologyfirms.

Before coming to Capital One Financial, he was Director of Informatics at UWMedicine leading numerous clinical system implementations and new productdevelopment projects. He has been strategic advisor andVP of Informatics at variousanalyticscompanies.

He has authored several papers, books and book chapters on software processimprovement,vectorprocessing,distributednetworkarchitectures,andsoftwarequality.Hisfirstbook,titledLean,AgileandSixSigmaInformationTechnologyManagementwaspublished in 2008. He has also published a book on data analytics titled: ClinicalIntelligence-The Big Data Analytics Revolution in Healthcare, which has become anindustrystandardforbigdataanalyticsinbiosciences.

Peter is on the advisory board of several clinical analytics companies and ofteninvitedaslecturerandspeakeronthistopic.HeiscertifiedinITILandTOGAF9.HeisamemberofIEEEReliabilitySociety, IEEELifeSciencesInitiativeandHIMSS. Hehasbeen an activemember ofHIMSSDataAnalytics Task Force and advises Fortune 500executivesondatastrategy.

[1]InternationalDataCorporation(IDC)isapremierproviderofresearch,analysis,advisoryandmarketintelligenceservices

[2]EachZettabyteisroughly1000ExabytesandeachExabyteisroughly1000Petabytes.APetabyteisabout1000TeraBytes.

[3]AutoRegressiveIntegratedMovingAverage.

[4]https://www.dama.org/content/body-knowledge

[5]http://cmmiinstitute.com/data-management-maturity

[6]http://pubs.opengroup.org/architecture/togaf9-doc/arch/

[7]“TheWhat,WhyandHowofMasterDataManagement”,byrogerWolter,KirkHaselden,MicrosoftCorporation,Nov.2006

[8]“TheWhat,WhyandHowofMasterDataManagement”,byrogerWolter,KirkHaselden,MicrosoftCorporation,Nov.2006

[9]RangerispromotedbyHortonWorks,whileSentrywasinitiallypromotedbyCloudera.IfadifferentHadoopinfrastructuresuchasMAPRisselected,thenneithertoolmaybenecessaryasMAPRincludesitsinternalfilestructuresecuritytool,butyoucanopttouseSentryasthecentralauthorizationtoolaswell.

[10]ImpalainitiallydevelopedbyClouderarequiresSentryforapplicationtokensecurity.Hence,ifyouselectImpalathenyou’llrequireimplementingSentry.However,ifyouchooseMAPRandDrill(insteadofImpala)thenSentryisoptionalandnotmandatory.

big data governancelab.ilkom.unila.ac.id/ebook/e-books big data/2016... · modern data management...

Documents