debunking common myths of hadoop backup & test data management
TRANSCRIPT
![Page 1: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/1.jpg)
Confidential and Proprietary1
Debunking Common Myths About Hadoop Backup and Test Data ManagementHari Mankude, CTONovember 2016
![Page 2: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/2.jpg)
Confidential and Proprietary2
My Background
![Page 3: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/3.jpg)
Confidential and Proprietary3
Why Bother With Backup and Test Data Mgmt?
The average cost of a data loss incident is $900,00090% of enterprises delay applications because of a lack
of test data
• Source: EMC, Talena
![Page 4: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/4.jpg)
Confidential and Proprietary4
Myth #1 Data Replicas Prevent Data Loss
Name Node
Data Node Data Node Data Node Data NodeData Node
![Page 5: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/5.jpg)
Confidential and Proprietary5
Myth #2 Hadoop Replication Prevents Data Loss
Name Node
Data Node Data Node Data Node
Name Node
Data Node Data Node Data Node
Data Center #1 Data Center #2
DistCp
![Page 6: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/6.jpg)
Confidential and Proprietary6
Myth #3: Hadoop Snapshots Are An Effective Backup Strategy
Snapshots result in storage
amplification
PROBLEM
Need scheduler to take timely snapshots & delete older
restore points
PROBLEM
![Page 7: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/7.jpg)
Confidential and Proprietary7
Myth #4: Restoring From Snapshots Is Trivial
Requires metadata
and data to be restored
in synch
PROBLEM
Versioning complicates the restore
process
PROBLEM
![Page 8: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/8.jpg)
Confidential and Proprietary8
Myth #5: DistCp Is Good Enough
DistCp only copies data,
not metadata or attributes
Very resource intensive – takes up
MapReduce slots on
production
Error recovery is not robust
and can lead to failed jobs
No restore point
management (aka no point
in time recovery)
![Page 9: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/9.jpg)
Confidential and Proprietary9
Myth #6: The traditional backup/restore process works
• 500 TB with 5% daily change = 650 TB moved per week
Weekly Fulls and Daily
Incrementals
• Impact on CPU• Management overhead
of agents on 100s of nodes
Agents
• Involves going back to last full backup and applying all the incrementals
Restores
![Page 10: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/10.jpg)
Confidential and Proprietary10
Myth #7 Test Data Management Is A Simple Process
Change Request - 1 week
Provision Production Data - 1 week
Create Test DB
and Mask Data - 1
week
Create Samples
of Production Data – 2 days
Push Production Data To
Test – Hours
Repeat Process –
3-4 weeks
![Page 11: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/11.jpg)
Confidential and Proprietary11
The Evolution of Data Management
THE NEXT 25 YEARS
THE TRADITIONALWORLD
Data ManagementData Platforms
![Page 12: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/12.jpg)
Confidential and Proprietary12
Talena in Production
Test Cluster
ResearchCluster
Talena GUI
Hadoop/Spark Cluster
Cassandra Cluster
Vertica Cluster
Couchbase Cluster
Talena Smart
Storage Cluster
![Page 13: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/13.jpg)
Confidential and Proprietary13
The Talena Architecture
• Deep de-duplication and compression with app-aware architecture
• Incremental-forever backup architecture• High availability via erasure coding in distributed cluster
architecture
Smart Storage Optimizer
![Page 14: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/14.jpg)
Confidential and Proprietary14
The Talena Architecture
Native querying and analytics via active compute layer
Unbounded scale with a Hadoop-native architecture
Smart Storage Optimizer
Active Compute Services Distributed File System
![Page 15: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/15.jpg)
Confidential and Proprietary15
The Talena Architecture
• Google-like catalog shortens data recovery time
• Automatic schema generation for mirroring and backups
• Granular recovery at an object level
• Recovery to multiple topologies
• Native integration with LDAP and Kerberos for authentication
• Role-based access control defines specific privileges
• Transparent data encryption
• Masking for PII data
Smart Storage Optimizer
Active Compute Services Distributed File System
Metadata Catalog Data Orchestration ServicesSecurity Services
![Page 16: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/16.jpg)
Confidential and Proprietary16
Smart Storage Optimizer
The Talena Architecture
GUI CLI API
Active Compute Services Distributed File System
• ‘Single pane of glass’ for multiple use cases and data platforms• Agentless architecture minimizes management overhead• GUI, CLI, REST-based Talena API options
Metadata Catalog Data Orchestration ServicesSecurity Services
![Page 17: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/17.jpg)
Confidential and Proprietary17
Hadoop Support
Supports and/or certified against multiple distributions–Apache, Cloudera, Hortonworks, IBM BigInsights
Supports multiple applications–HDFS, Hive, HBase, Impala, Presto
Deployed either on-premise or in private/public clouds
![Page 18: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/18.jpg)
Confidential and Proprietary18
Q&A We’ll send you a link to our eBook “The Hadoop Backup Guide”
Additional resources: talena-inc.com/resources and talena-inc.com/blog
Ping us with any additional questions: [email protected]
![Page 19: Debunking Common Myths of Hadoop Backup & Test Data Management](https://reader035.vdocuments.site/reader035/viewer/2022081604/58714dc01a28ab55588b7299/html5/thumbnails/19.jpg)
Confidential and Proprietary19
Q and A