accelerate pharmaceutical r&d with mongo db

Mongo Boston 2013

Jason Tetrault

Architect - AstraZeneca

Accelerate Pharmaceutical R&D with

Big Data and MongoDB

AstraZeneca at a glanceWe are a global, innovation led biopharmaceutical company

with a mission to make a meaningful difference to patient health

through great medicines and a belief that health connects us all

Global Targeted Collaborative

Committed to driving business success responsibly

57,000 people

Sales in 100 countries

Manufacturing in 16

R&D across 3 continents

$4 bn invested in R&D

$33 bn sales in 2011

Constantly anticipating and

adapting to the needs of a

changing world.

Cancer

Cardiovascular

Gastrointestinal

Infection

Neuroscience

Respiratory & inflammation

Driving continued innovation

where we can make the most

difference.

HCPs

Patients

Payers

Regulators

Partners

Local communities

Connecting with others to

achieve common goals

in improving healthcare.

Architect: R&D Information

What does this mean?

• Support the Researchers• AstraZeneca has Multiple iMeds that are

focused on different areas of R&D

• Specifically, I work with the Oncology and

Infection iMeds here in Waltham

• Support different software and system

builds and / or purchases

• Looking to apply new technologies to

enable Researchers

• Core Focus:• Next Generation Sequencing

Scaling

• IAAS

• Big Data Pilots and Exploration

Introduction of Disruptive Technology:

Step 1: Introduce Concepts

• What• Unstructured Data

• NoSQL• Categories (Document, Key Value, Graph)

• Hadoop

• Map Reduce

• Horizontal Scalability

• Cloud (IAAS and SAAS)

• How• Lunch and Learns

• Examples (Craigslist uses this)

• “Big Cookies for Big Data”

• Demonstrations


Step 2: Pilots

• Goals:• We needed to show what “Unstructured Data” actually means.

• We needed to prove what these technologies can and cannot

do for us.

• Find something difficult and make it easy!

• We needed to find the best way to enable researchers.

How quickly can I make indirect associations between gene sequence

features and structural fingerprints?

6

Iterative Agile Analytics

AnalyzeGather Aggregate

Compound

Data sources

AssayResults

Target mappingsDecorate

JSON

• Compound with Fingerprints

• Gene sequence

• Target mappings

• Assay results

Fingerprint with

compounds

Pivot

Map Reduce

GeneCatalog

Tanimoto matrix

Gene matrix

Matrix

• Easily convert to JSON and import an initial cut of data from different sources (e.g.

spreadsheets, RDBMS, …)

• Embrace unstructured data, massage it into a more useful format: Rinse, Wash, Repeat!

• Ability to decorate data, adding fields and additional datastores quickly

(300K Compounds) – 200Gb (1.4M fingerprints) – 1Gb (500m pairs) – 81Gb

Now scale up to 4M compounds, 20K

assays…and more decoration – 5to50 Tb

http://www.mongodb.org/




Pilot Findings

• Tech Findings:• GSON can help with weird character

conversions.

• Per Node write limits (500 per second)

but, you can save a bunch of documents

at once (Change to bulk Insert).

• Users think that even though they could

do it relationally, this was way quicker.

• Using arrays for multiple results in a doc

can be interesting.

• JSON and JavaScript is fairly natural to

technical researchers (python).

• We are not alone…• Davy Suvee

• tranSMART

• Seven Bridges

• …

Next Generation Sequencing:

Driving Question:

How many other cancer types

that I have processed have the

same variation as the cancer

type I am working on?

Can we predict which drug is

most effective against

specific tumors?

Fairly Inaccurate Overview of Genetics

Processing

A 2 Minutes Over Simplification to a Really Hard

Problem

9


Processing

Sequencing

10


Processing

Sequencing

11


Processing

Alignment

12 Set area descriptor | Sub level 1

HG19


Processing

Down Stream Processing (Variant)

13

HG19

Can I Process 88 Whole Human Genomes?

Researcher: I would like to process 88 public Genomic Samples from of Cancer Patients. They are Whole Human

Genomes. Each patient has 2 genomic sequences, one of the tumor and one from a normal cell.

Amazon StarCluster

Elastic HPC Infrastructure

Shared Storage

Scripts, programs, referenc

e

Elastic Node Expansion

Compute

Local Storage Processing

Result offload to S3

Transition to Glacier

Tech:

• 200 GB raw uncompressed fastq per

experiment

• 176 Genome Pipelines to process

• Each “pipeline” runs on a m1.xlarge

• We ran 4 runs of ~3.5 days on 50 nodes

• Total processed data in the pipeline may be 5X

per experiment

•Could expand to 10X or more for more complex

pipelines

• ~86 GB result average to save

• Stored in S3 / Glacier

• Totals:

• ~171 TB Total Processed Storage

• ~14,784 hours of processing

• ~15 TB of results

PartnersStorageGenePatternBig Data

StoreInbound Seven Bridges

Genome Upload /

Curation

Pipeline

Engines

Long Term

Storage

Experiment

Management /

Metadata

Management

Partner

Integration

Big Data Storage

and Analytics

A Possible Vision for Experiment Management

Services

NGS Data

Explants

Tumors-FFPE

Tumors –fresh frozen

Cell lines

RNA-Seq

Expression

Variants

DNA-Seq

Amplicon Coding and non-coding

variantsWhole exome Coding

variantsWhole genome

New Target ID

Patient stratification

Biomarkers for

prognosis, drug

response, safety

Mechanism of drug

action

Mechanism of disease


Lets look at a Variant …

Another Area Mongo May Help

16

VCF Format

17

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,30/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,22/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

VCF as JSON

Header and Variant Information

18

{"_id" : ObjectId("52617b613004b77f64efed62"),"ALT" : [

"A"],"QUAL" : "29","NA00001" : "0|0:48:1:51,51","POS" : 14370,"NA00002" : "1|0:48:8:51,51","FILTER" : "PASS","CHROM" : "20","NA00003" : "1/1:43:5:.,.","FORMAT" : "GT:GQ:DP:HQ","__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27","ID" : "rs6054257","INFO" : {

"DP" : "14","AF" : "0.5","NS" : "3"

},"REF" : "G"

}

{"_id" : ObjectId("52617b613004b77f64efed67"),"phasing" : "partial","fileformat" : "VCFv4.1","fileDate" : "20090805","source" : "myImputationProgramV3.1","FORMAT" : {

"Description" : "\"Haplotype Quality\"","Type" : "Integer","Number" : "2","ID" : "HQ"

},"__vcfid" : "40770f6f-165a-4930-8092-05e98e4e0b27","contig" : {

"species" : "\"Homo sapiens\"","assembly" : "B36","md5" : "f126cdf8a6e0c7f379d618ff66beb2da","length" : "62435964","ID" : "20","taxonomy" : "x"

},"INFO" : {

"Description" : "\"HapMap2 membership\"","Type" : "Flag","Number" : "0","ID" : "H2"

},"reference" : "file:///seq/references/1000GenomesPilot-

NCBI36.fasta","FILTER" : {

"Description" : "\"Less than 50% of samples have data\"","ID" : "s50"

}}

Query

Search Variant Ranges

19

// Here is our range definition

var begin = 10000;

var end = 10200;

// The Chromosome position is fuzzy in format so, we use a regex

var chromosome = ".*17$";

var variant = "A"

// Query for range and chromosome position.

db.publicvariants.find(

{"POS":{$gte: begin, $lt: end},

"CHROM":{$regex : chromosome}

})

db.variants.find(


"CHROM":{$regex : chromosome}

})

// Query for a specific variant in a range

db.publicvariants.find(


"CHROM":{$regex : chromosome},

"ALT":variant})

db.variants.find(


"CHROM":{$regex : chromosome},

"ALT":variant})

Wrap Up and Panel

20

• Thanks• Todd Nelson, Rajan Desai

• Sebastien Lefebvre, Robin Brouwer

• Sara Dempster

• Panel• DenizKural: Founder and CEO – SevenBridges

• Code: • https://github.com/jjtetrault/bio-mongo

The Panel

…

21

22

Confidentiality Notice

This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and

remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or

disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2

6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com

accelerate pharmaceutical r&d with mongo db

Technology

big data demonstrations

researchers astrazeneca

initial cut of data

pharmaceutical rd

different areas of rd

rd information

big cookies

different sources