hadoop, hbase, and healthcare
DESCRIPTION
Hadoop, HBase, and Healthcare. Ryan Brush. Topics. The Why The What Complementing MapReduce with streams HBase and indexes The future. Health data is fragmented. Pieces of a person’s health spread across many systems. How many times have you filled out a clipboard?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/1.jpg)
Hadoop, HBase, and HealthcareRyan Brush
![Page 2: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/2.jpg)
Topics- The Why- The What- Complementing MapReduce with streams- HBase and indexes- The future
![Page 3: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/3.jpg)
Health data is fragmented
![Page 4: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/4.jpg)
Pieces of a person’s healthspread across many systems
![Page 5: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/5.jpg)
How many times have you filled out a clipboard?
![Page 6: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/6.jpg)
We need to put the piecestogether again
Better-informed decisions Application of best available evidence
Systemic improvement of careHealth recommendations
![Page 7: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/7.jpg)
Some ways Hadoop is helping solve this
![Page 8: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/8.jpg)
Chart Search
![Page 9: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/9.jpg)
Chart Search- Information extraction- Semantic markup of
documents- Related concepts in search
results- Processing latency: tens of
minutes
![Page 10: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/10.jpg)
Medical Alerts
![Page 11: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/11.jpg)
Medical Alerts- Detect health risks in
incoming data- Notify clinicians to address
those risks- Quickly include new
knowledge- Processing latency: single-
digit minutes
![Page 12: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/12.jpg)
Exploring live data
![Page 13: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/13.jpg)
Exploring live data- Novel ways of exploring
records- Pre-computed models
matching users’ access patterns
- Very fast load times- Processing latency: seconds or
faster
![Page 14: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/14.jpg)
And many othersPopulation analytics
Care coordinationPersonalized health plans
- Data sets growing at hundreds of GBs per day- > 500 TB total storage- Rate is increasing; expecting multi-petabyte data sets
![Page 15: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/15.jpg)
A trend towards competing needs- Analyze all data holistically- Quickly apply incremental updates
![Page 16: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/16.jpg)
A trend towards competing needsMapReduce- (re-)Process all data- Move computation to data- Output is a pure function
of the input- Assumes set of static input
Stream- Incremental updates- Move data to computation- Needs to clean up
outdated state- Input may be incomplete
or out of orderBoth processing models are necessary
and the underlying logic must be the same
![Page 17: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/17.jpg)
Speed Layer
Batch Layer
http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
A trend towards competing needs
![Page 18: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/18.jpg)
Speed Layer
Batch LayerHigh Latency (minutes or hours to process)
Low Latency (seconds to process)
Move data to computation
Move computation to dataYears of data
Hours of data
Bulk loads
Incremental updates
A trend towards competing needs
![Page 19: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/19.jpg)
Speed Layer
Batch LayerMapReduce
Storm
Stream-based
Hadoop
A trend towards competing needs
![Page 20: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/20.jpg)
Into the rabbit hole- A ride through the system- Techniques and lessons learned along the
way
![Page 21: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/21.jpg)
Data ingestion
- Stream data into HTTPS service- Content stored as Protocol Buffers- Mirror the raw data as simply as possible
![Page 22: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/22.jpg)
Scan for updates
Process incoming data- Initially modeled after
Google Percolator- “Notification” records
indicate changes- Scan for notifications
Data Tablesource:1/document:123
source:2/allergy:345
source:2/document:456
. . .
source:150/order:71
Notification Tablesource:1/document:123
source:150/order:71
![Page 23: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/23.jpg)
But there’s a catch…- Percolator-style notification records require
external coordination- More infrastructure to build, maintain- …so let’s use HBase’s primitives
![Page 24: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/24.jpg)
Process incoming data
- Consumers scan for items to process- Atomically claim lease records (CheckAndPut)- Clear the record and notifications when done- ~3000 notifications per second per node
Row Key Qualifiers (lease record and keys of updated items)split:0 0000_LEASE, source:2/allergy:345, source:150/order:71, …
split:1 0000_LEASE, source:4/problem:78, source:205/document:52, …
. . .
![Page 25: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/25.jpg)
Advantages- No additional infrastructure- Leverages HBase guarantees
- No lost data- No stranded data due to machine failure
- Robust to volume spikes of tens of millions of records
![Page 26: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/26.jpg)
Downsides- Weak ordering guarantees- Must be robust to duplicate processing- Lots of garbage from deleted cells
- Schedule major compactions!- Simpler alternatives if latency isn’t an issue
![Page 27: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/27.jpg)
Measure Everything
- Instrumented HBase client to see effective performance- We use Coda Hale’s Metrics API and Graphite Reporter- Revealed impact of hot HBase regions on clients
![Page 28: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/28.jpg)
The story so far
![Page 29: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/29.jpg)
Into the Storm- Storm: scalable processing of data in motion- Complements HBase and Hadoop- Guaranteed message processing in a
distributed environment- Notifications scanned by a Storm Spout
![Page 30: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/30.jpg)
Processing with Storm
![Page 31: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/31.jpg)
Challenges of incremental updates- Incomplete data- Outdated previous state- Difficult to reason about changing state and
timing conditions
![Page 32: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/32.jpg)
Handling Incomplete Data
Row Key Summary Family Staging Familydocument:1 page:1
Incoming data
- Process (map) components into a staging family
![Page 33: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/33.jpg)
Handling Incomplete Data
Row Key Summary Family Staging Familydocument:1 page:1 page:3
- Process (map) components into a staging family
Incoming data
![Page 34: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/34.jpg)
Handling Incomplete Data
Row Key Summary Family Staging Familydocument:1 page:1 page:2 page:3
- Process (map) components into a staging family
Incoming data
![Page 35: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/35.jpg)
Handling Incomplete Data
Row Key Summary Family Staging Familydocument:1 document_summary page:1 page:2 page:3
- Process (map) components into a staging family- Merge (reduce) components when everything is available - Many cases need no merge phase – consuming apps
simply read all of the components
Incoming data
![Page 36: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/36.jpg)
Different models, same logic- Incremental updates like a rolling MapReduce- Write logic as pure functions- Coordinate with higher libraries
- Storm- Apache Crunch
- Beware of external state- Difficult to reason about and scale
![Page 37: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/37.jpg)
Getting complicated?- Incremental logic is complex and error prone- Use MapReduce as a failsafe
![Page 38: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/38.jpg)
Reprocess during uptime
- Deploy new incremental processing logic- “Older” timestamps produced by MapReduce- The most recently written cell in HBase need not be
the logical newest
Row Key Document Familydocument:1 {doc, ts=50}
document:2 {doc, ts=100}
Real time incremental update
, {doc, ts=300}
MapReduce outputs
, {doc ts=200}
, {doc, ts=200}
![Page 39: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/39.jpg)
Completing the Picture
![Page 40: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/40.jpg)
Completing the Picture
![Page 41: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/41.jpg)
Building indexes with MapReduce- A shard per task- Build index in Hadoop- Copy to index hosts
![Page 42: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/42.jpg)
Pushing incremental updates- POST new records- Bursts can overwhelm
target hosts- Consumers must deal
with transient failures
![Page 43: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/43.jpg)
Pulling indexes from HBase- Custom Solr plugin scans a
range of HBase rows- Time-based scan to get only
updates- Pulls items to index from
HBase- Cleanly recovers from
volume spikes and transient failures
![Page 44: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/44.jpg)
A note on schema: simplify it!- Heterogeneous row keys
great for hardware but hard on wetware
- Must inspect row key to know what it is
- Mismatches tools like Pig or Hive
Row Key Qualifiersperson:1/name <content>
person:1/address <content>
person:1/friend:1 <content>
person:1/friend:2 <content>
person:2/name <content>
…
person:n/name <content>
person:n/friend:m <content>
![Page 45: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/45.jpg)
Logical parent per row
- The row is the unit of locality- Tabular layout is easy to understand- No lost efficiency for most cases- HBase Schema Design -- Ian Varley at HBaseCon
Row Key Qualifiersperson:1 name<…> address:<…> friend:1:<…> friend:2:<…>
person:2 name<…> address:<…> friend:1:<…>
. . .
person:n name<…> address:<…> friend:1:<…>
![Page 46: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/46.jpg)
The path forward
![Page 47: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/47.jpg)
This pattern has been successful…but complexity is our biggest enemy
![Page 48: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/48.jpg)
We may be in the assemblylanguage era of big data
![Page 49: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/49.jpg)
Higher-level abstractions for these patterns will emerge
It’s going to be fun
![Page 50: Hadoop, HBase, and Healthcare](https://reader036.vdocuments.site/reader036/viewer/2022062810/56815c3d550346895dca3aa5/html5/thumbnails/50.jpg)
Questions?@ryanbrush