accelerate pharmaceutical r&d with mongo db
TRANSCRIPT
Mongo Boston 2013
Jason Tetrault
Architect - AstraZeneca
Accelerate Pharmaceutical R&D with
Big Data and MongoDB
AstraZeneca at a glanceWe are a global, innovation led biopharmaceutical company
with a mission to make a meaningful difference to patient health
through great medicines and a belief that health connects us all
Global Targeted Collaborative
Committed to driving business success responsibly
57,000 people
Sales in 100 countries
Manufacturing in 16
R&D across 3 continents
$4 bn invested in R&D
$33 bn sales in 2011
Constantly anticipating and
adapting to the needs of a
changing world.
Cancer
Cardiovascular
Gastrointestinal
Infection
Neuroscience
Respiratory & inflammation
Driving continued innovation
where we can make the most
difference.
HCPs
Patients
Payers
Regulators
Partners
Local communities
Connecting with others to
achieve common goals
in improving healthcare.
Architect: R&D Information
What does this mean?
• Support the Researchers• AstraZeneca has Multiple iMeds that are
focused on different areas of R&D
• Specifically, I work with the Oncology and
Infection iMeds here in Waltham
• Support different software and system
builds and / or purchases
• Looking to apply new technologies to
enable Researchers
• Core Focus:• Next Generation Sequencing
Scaling
• IAAS
• Big Data Pilots and Exploration
Introduction of Disruptive Technology:
Step 1: Introduce Concepts
• What• Unstructured Data
• NoSQL• Categories (Document, Key Value, Graph)
• Hadoop
• Map Reduce
• Horizontal Scalability
• Cloud (IAAS and SAAS)
• How• Lunch and Learns
• Examples (Craigslist uses this)
• “Big Cookies for Big Data”
• Demonstrations
Introduction of Disruptive Technology:
Step 2: Pilots
• Goals:• We needed to show what “Unstructured Data” actually means.
• We needed to prove what these technologies can and cannot
do for us.
• Find something difficult and make it easy!
• We needed to find the best way to enable researchers.
How quickly can I make indirect associations between gene sequence
features and structural fingerprints?
6
Iterative Agile Analytics
AnalyzeGather Aggregate
Compound
Data sources
AssayResults
Target mappingsDecorate
JSON
• Compound with Fingerprints
• Gene sequence
• Target mappings
• Assay results
Fingerprint with
compounds
Pivot
Map Reduce
GeneCatalog
Tanimoto matrix
Gene matrix
Matrix
• Easily convert to JSON and import an initial cut of data from different sources (e.g.
spreadsheets, RDBMS, …)
• Embrace unstructured data, massage it into a more useful format: Rinse, Wash, Repeat!
• Ability to decorate data, adding fields and additional datastores quickly
(300K Compounds) – 200Gb (1.4M fingerprints) – 1Gb (500m pairs) – 81Gb
Now scale up to 4M compounds, 20K
assays…and more decoration – 5to50 Tb
Introduction of Disruptive Technology:
Pilot Findings
• Tech Findings:• GSON can help with weird character
conversions.
• Per Node write limits (500 per second)
but, you can save a bunch of documents
at once (Change to bulk Insert).
• Users think that even though they could
do it relationally, this was way quicker.
• Using arrays for multiple results in a doc
can be interesting.
• JSON and JavaScript is fairly natural to
technical researchers (python).
• We are not alone…• Davy Suvee
• tranSMART
• Seven Bridges
• …
Next Generation Sequencing:
Driving Question:
How many other cancer types
that I have processed have the
same variation as the cancer
type I am working on?
Can we predict which drug is
most effective against
specific tumors?
Fairly Inaccurate Overview of Genetics
Processing
A 2 Minutes Over Simplification to a Really Hard
Problem
9
Fairly Inaccurate Overview of Genetics
Processing
Sequencing
10
Fairly Inaccurate Overview of Genetics
Processing
Sequencing
11
Fairly Inaccurate Overview of Genetics
Processing
Alignment
12 Set area descriptor | Sub level 1
HG19
Fairly Inaccurate Overview of Genetics
Processing
Down Stream Processing (Variant)
13
HG19
Can I Process 88 Whole Human Genomes?
Researcher: I would like to process 88 public Genomic Samples from of Cancer Patients. They are Whole Human
Genomes. Each patient has 2 genomic sequences, one of the tumor and one from a normal cell.
Amazon StarCluster
Elastic HPC Infrastructure
Shared Storage
Scripts, programs, referenc
e
Elastic Node Expansion
Compute
Local Storage Processing
Result offload to S3
Transition to Glacier
Tech:
• 200 GB raw uncompressed fastq per
experiment
• 176 Genome Pipelines to process
• Each “pipeline” runs on a m1.xlarge
• We ran 4 runs of ~3.5 days on 50 nodes
• Total processed data in the pipeline may be 5X
per experiment
•Could expand to 10X or more for more complex
pipelines
• ~86 GB result average to save
• Stored in S3 / Glacier
• Totals:
• ~171 TB Total Processed Storage
• ~14,784 hours of processing
• ~15 TB of results
PartnersStorageGenePatternBig Data
StoreInbound Seven Bridges
Genome Upload /
Curation
Pipeline
Engines
Long Term
Storage
Experiment
Management /
Metadata
Management
Partner
Integration
Big Data Storage
and Analytics
A Possible Vision for Experiment Management
Services
NGS Data
Explants
Tumors-FFPE
Tumors –fresh frozen
Cell lines
RNA-Seq
Expression
Variants
DNA-Seq
Amplicon Coding and non-coding
variantsWhole exome Coding
variantsWhole genome
New Target ID
Patient stratification
Biomarkers for
prognosis, drug
response, safety
Mechanism of drug
action
Mechanism of disease
Lets look at a Variant …
Another Area Mongo May Help
16
VCF Format
17
##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,30/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,22/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
VCF as JSON
Header and Variant Information
18
{"_id" : ObjectId("52617b613004b77f64efed62"),"ALT" : [
"A"],"QUAL" : "29","NA00001" : "0|0:48:1:51,51","POS" : 14370,"NA00002" : "1|0:48:8:51,51","FILTER" : "PASS","CHROM" : "20","NA00003" : "1/1:43:5:.,.","FORMAT" : "GT:GQ:DP:HQ","__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27","ID" : "rs6054257","INFO" : {
"DP" : "14","AF" : "0.5","NS" : "3"
},"REF" : "G"
}
{"_id" : ObjectId("52617b613004b77f64efed67"),"phasing" : "partial","fileformat" : "VCFv4.1","fileDate" : "20090805","source" : "myImputationProgramV3.1","FORMAT" : {
"Description" : "\"Haplotype Quality\"","Type" : "Integer","Number" : "2","ID" : "HQ"
},"__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27","contig" : {
"species" : "\"Homo sapiens\"","assembly" : "B36","md5" : "f126cdf8a6e0c7f379d618ff66beb2da","length" : "62435964","ID" : "20","taxonomy" : "x"
},"INFO" : {
"Description" : "\"HapMap2 membership\"","Type" : "Flag","Number" : "0","ID" : "H2"
},"reference" : "file:///seq/references/1000GenomesPilot-
NCBI36.fasta","FILTER" : {
"Description" : "\"Less than 50% of samples have data\"","ID" : "s50"
}}
Query
Search Variant Ranges
19
// Here is our range definition
var begin = 10000;
var end = 10200;
// The Chromosome position is fuzzy in format so, we use a regex
var chromosome = ".*17$";
var variant = "A"
// Query for range and chromosome position.
db.publicvariants.find(
{"POS":{$gte: begin, $lt: end},
"CHROM":{$regex : chromosome}
})
db.variants.find(
{"POS":{$gte: begin, $lt: end},
"CHROM":{$regex : chromosome}
})
// Query for a specific variant in a range
db.publicvariants.find(
{"POS":{$gte: begin, $lt: end},
"CHROM":{$regex : chromosome},
"ALT":variant})
db.variants.find(
{"POS":{$gte: begin, $lt: end},
"CHROM":{$regex : chromosome},
"ALT":variant})
Wrap Up and Panel
20
• Thanks• Todd Nelson, Rajan Desai
• Sebastien Lefebvre, Robin Brouwer
• Sara Dempster
• Panel• DenizKural: Founder and CEO – SevenBridges
• Code: • https://github.com/jjtetrault/bio-mongo
The Panel
…
21
22
Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and
remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or
disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2
6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com