the edw ecosystem
TRANSCRIPT
The EDW EcosystemLeveraging Hadoop within an existing Data Warehouse environment
LUKE KAY STEVE O’NEILL
Department of Immigration and Border Protection 2
Question: What is this?
Answer: A data lake…Image Source: Linkedin , 2016
Department of Immigration and Border Protection 3
1. Overview of DIBP business
2. Why Hadoop in DIBP?
3. Overview of EDW environment
4. Technical implementation of Hadoop
5. Next steps
6. Questions
Agenda
Department of Immigration and Border Protection 4
The Department
Each week on average…
•Previously two separate organisations: • ACBPS (Australian Customs and Border Protection Service)• DIAC (Department of Immigration and Citizenship)
•DIBP formed on 1st July 2015 (Department Immigration and Border Protection) and ABF (Australian Border Force)
DIBP Annual Report 2014-2015 ACBPS Annual Report 2014-2015
Department of Immigration and Border Protection 5
Hadoop drivers in DIBP?Data Archival / ETL Offload
- Significant Legacy ‘application’ debt- Historical data sources kept for ‘one-off’
querying or Audit purposes- The right platform for the right workload
Business Requirements
- Load Big Data, but make it easily accessible- Store it for long periods of time
Analytics
Advanced Analytics (Spark)- Text mining
- Unstructured Data
New Data types & Functionality
- ’Free text’ searching (SOLR)- Key value store (HBASE)
Department of Immigration and Border Protection 6
EDW / Hadoop StrategyThe right data in the right place for the right outcome:
• Teradata EDW for the structured• Hortonworks Hadoop for the Unstructured• Teradata Aster for Discovery
All connected over high speed ‘Infiniband network’ for quick data transfer and seamless connectivity.
Our Strategy:
• Walk before you run. Start small and expect rework
• Ability to query and access the two environments
efficiently and effectively (more on this)
• Free text / unstructured searches (SOLR) compared
to full table scans in EDW
Department of Immigration and Border Protection 7
• Embedded Operationally in the Department
• Get everything / Integrate Data
• As close to the event as possible
• Used everywhere
• Intelligence, Operational , Analytics and Reporting capabilities
The Enterprise Data Warehouse
Department of Immigration and Border Protection 8
EDW ENVIRONMENT
Department of Immigration and Border Protection
Our RequirementsUnstructured Repository
• Tens of millions of BLOBsYears of backlog / historical data
• New projects • Multiple source database types• Fast and convenient retrieval
Logging Repository• Billions of rows of infrequently accessed logging data
Timeframe• Now
Department of Immigration and Border Protection
Our Solution
Source HBase
Hive
AvroSqoop
Avro
Pig
HiveQLHDFS
Source
Archive
Landing Staging
Department of Immigration and Border Protection
The ChallengesLearning Curve
• Once the basics are covered, the curve gets steep.
Toolsets• Vast number of tools• Behaviour of tools
Robustness• eg, HBase regions crashing under load
The Business• Fear of the new
Department of Immigration and Border Protection
Advice and LessonsKeep it Simple
• Start with your existing skillsets• Start with a defined need
Expect Rework• Partly from bugs• Mostly from experience
Help is out There• Not always easy to find, but generally good quality• Be careful of old advice
Department of Immigration and Border Protection
Advice and LessonsRun Your Own Race
• Don’t be afraid to not like tool X or approach Y• Don’t feel the need to always jump to the latest and
greatest
Use Your Existing Standards• eg, our Landing and Staging concept
Enjoy!
Department of Immigration and Border Protection 14
What’s next- Departmental EDW consolidation (leverage lessons
learned)
- Hadoop 2.4 just complete ( Now we will take new functionality, bug fixes, security)
- Advanced Analytics work on the Hadoop platform
- Hadoop Roadmap (next slide)
Department of Immigration and Border Protection 15
Department of Immigration and Border Protection 16
Questions