ieg 201402 intuit building big data analytics platform
DESCRIPTION
Information Excellence Group 2014 Spring "Business Analytics Industry Summit", Building Big Data Analytics Platform, Neeta Pande, Data Architect, INTUITTRANSCRIPT
INTUIT:
Building Big Data Analytics Platform
at IntuitNeeta Pande
Building Big Data Analytics Platformat Intuit
8/Feb/2014Neeta Pande
Roadmap
• Setting Context and Introduction to the Analytics Platform at Intuit
• Key highlights that differentiates the platform
• Sharing Experiences building the platform
• Wish-list of capabilities for future of Big data technologies
Setting Context and Intro to the Analytics Platform
Quick look into Intuit Offerings
• Central repository of Analytical Data from – Intuit products– Intuit Business Systems– Intuit Master Systems– External Data Sources
• Caters to– Product Managers– Product Developers– Data Analysts– Data Scientists– Experience Designers
Enterprise Wide Platform for cross Intuit Data Analytics7
Introduction to the Analytical Platform
Technologies used to build the platform
HCAT
ALO
G
Key highlights that differentiates the platform
10
Product User
Entered Data
Product User
Entered Data
Product Usage Data
Product Usage Data
BusinessData
BusinessData
Master Data
Master Data
External Data
External Data
Data IntegrationData Integration
Policy based Access ControlPolicy based Access Control
Management, PM, PD, Data Analyst, Data Scientist
Central Analytics Platform
Batch Near Realtime Realtime
Capability View of the Platform
Enterprise wide data across all offerings and cross-offerings
CohostSensitive Informationon same infrastructure
Batch, Near Real Time, Real time on the sameinfrastructure
Mobile, Web,Desktop Offerings
DWH Semantic layers on Hadoop
Key differentiators of the Platform
• DWH patterns like SCD, surrogate key, fact updates challenging
Data Pipeline and Challenges
1
2
3
4
5
67 8
Data Acquisition
Data Cleansing
Data Standardization
Data Securitization
Incremental load
Entity Mastering
DWH load
Data Consumption
• Cleansing and Standardization need third party libraries
• Part of the same flow and need a hadoopintegration
• Encryption of sensitive information
• Tokenization for join optimization on sensitive fields
• Extract Analytical information before encryption
• Challenge loading data from transactional sources
• MDM solutions from major vendors do not provide mastering in Hadoop.
• Interactive exploration in MPP-RDBMS because of Advanced SQL and query performance
• Sampling and extraction for building models in R
Sharing Experiences building the platform
• Batch Data Integration – Evaluated and found Big Data Integration capabilities of Informatica relevant for the Platform
• Real time – Using Flume for real time use cases. Found Kafka and storm to be a good fit from several requirements POV.
Evaluated and found InformaticaData Quality good fit for Data Cleansing and Standardization integrated in the same flow as Batch Data Integration
• Custom Implementation of symmetric key Encryption/Decryption.
• Hadoop does not provide out of the box solution
• Evaluated Third Party Solutions, not matured enough
• Key management using HSM (Safenet)
• Decryption UDFs in MR, PIG, Hive shielding developers/users from the security implementation
Custom Implementation of Mastering solution in-hadoop.
• Leading MDM solutions do not have Hadoop Integration
• Some open source tools have MDM capabilities, but not matured and widely adopted.
• Traditional DWH and incremental loads challenging on Hadoop.
• Upserts and SCD handled best in HBase and exposed via HCatalog for querying
• The adhoc query capabilities still not matured/adopted and hence MPP-RDBMS still preferred.
• Large Scale machine learning infrastructure still being adopted. Hence widely used technology options not in place
Wish-list for future of Hadoop
Data Security support built in to the platform
MDM solutions integrated and optimized for the platform
Interactive querying capabilities on the big data platforms (Impala, Tez)
Better support for traditional DWH capabilities
Integrated Real time, Near real time and Batch processing pipelines
Distributed machine learning technologies with comprehensive and advanced capabilities
Opensource end to end data quality solutions integrated with the platform
Q & A
Thank you
Community Focused
Volunteer Driven
Knowledge Share
Accelerated Learning
Collective Excellence
Distilled Knowledge
Shared, Non Conflicting Goals
Validation / Brainstorm platform
Mentor, Guide, Coach
Satisfied, Empowered Professional
Richer Industry and Academia
About Information Excellence Group
Progress Information Excellence
Towards an Enriched Profession, Business and Society
About Information Excellence GroupReach us at:
blog: http://informationexcellence.wordpress.com/
presentations: http://www.slideshare.net/informationexcellence
linked in:http://www.linkedin.com/groups/Information-Excellence-3893869
Facebook:http://www.facebook.com/pages/Information-excellence-group/171892096247159
Google+: https://plus.google.com/u/0/communities/102316155996060621595
twitter: #infoexcelemail: [email protected]
Have you enriched yourself by contributing to the community Knowledge Share..