medchemica bigdata what is that all about?
TRANSCRIPT
MedChemica BigData
‘What is that ALL about?’
Al Dossetter [email protected] MedChemica Limited Macclesfield Sci Bar 25th April 2016
Big Data – ‘What is that all about?’
• Introduction to Big Data
• Examples from History
• Big Data and science
• MedChemica – advancing drug design through actionable knowledge
About Us Passionate about generating better decisions from data
Dr Andrew G. Leach Technical Director Liverpool John Moores 12 years experience Applied computational and medicinal chemistry
Dr Ed Griffen Technical Director 21 years experience Medicinal chemistry and large scale statistical analysis methods
Dr Al Dossetter Managing Director 17 years Medicinal chemistry and extensive cloud computing experience
Dr Ali Griffen Business Analyst PhD Fungal Vascular wilt disease 21 years experience Team leader bioscientist and biological data curation
Dr Shane Montague Lead Data Scientist PhD Computer Science 13 years experience Data science and information security
Dr Jia Wu Consultant Data Scientist PhD Machine Learning 12 years experience in data mining and machine learning. Projects in finance, energy and criminology.
Best Definition of Big Data • Any analysis of a data set that is too large to
do by hand – Requires computational techniques – Requires statistical techniques
• Yields – Knowledge - Knowledge that can be counter intuitive
It got ‘Big’ because: - the internet made a lot of data available very
quickly (often for free) It got interesting because:
- Knowledge yields real benefits to the bottom line - Reduce costs or Increased sales
You the consumer benefit…. - Cheaper goods, available on-line - Flights on time, trains on time, deliveries on time
Big Data “The Revolution that will change the world we live in” • Principles of Big Data – Use ALL of the Data
• however noisy – Analyse in an unbiased way – “DO WHAT” it tells you
• Do Not Worry About “WHY” – KEEP everything
• ‘you never know what question you want to ask’
The 4 Vs
• Picture from Google or someone • What does it mean? • Mostly it is about using lots of computers
Most issues are sorted out by more CPUs, more drive space, and better stats
Its actually been around quite a while…
• It was genius to break the codes • Further genius of collating the data and reducing it so
that analysts can use in a timely manner (volume / velocity / veracity)
• ….saved many many lives on both sides
….and banking, finance and trading
What do Nappies and Beer have in common? • Analysis of shopping habits found these two things were bought together • Put them close together in the store and sell more
+
=
UPS delivery service • Fitted sensors to all delivery
trucks and gathered data • Analysed data to detect
early engine issues BEFORE breakdown
• Therefore FIX early and keep the van on the road
• The customer benefits
because: • Deliveries on-time
• Even larger dataset – high degree of predicition on deliver times
Jet Engines – reliable service • Sensors on jet engines – monitored in flight • Similar to UPS • Therefore FIX early and keep the planes in the air • The customer benefits because: • Flights on time and reliable
Google translate The Unreasonable Effectiveness of Data
“Because of a huge shared cognitive and cultural context, linguistic expression can be highly ambiguous and still often be understood correctly.”
• h@ps://en.wikipedia.org/wiki/File:Google_Translate_Icon.png • h@ps://en.wikipedia.org/wiki/Google_Translate • h@ps://www.youtube.com/watch?v=yvDCzhbjYWs • University of BriQsh Columbia DisQnguished Lecture Series -‐ Sept 23rd 2011
Groups or pairs of words associated together on websites around the internet Statistical analyse of frequency of pairing Therefore this word (or group) probably translates into this word
What about science? We need to be accurate (don’t we?)
• Large Hadron Collider shows how we can gather a lot of data very accurately
• Large amount needs to reduce the errors – very very big data
The Life Science industry has woken up to Big Data
• Human Genome • Biological systems • Kinome • Metabolomics • Proteomics • 3D structural information (CDC /
Protein Data Bank) • Literature and Patents (GVK Bio,
ChEMBL, Pubmed, PubChem) • Reaction infomatics – what works,
what doesn’t • Document management • Regulatory submissions Huge Opportunity in this area
What about life sciences?
• Hard and harder to discover drugs. • They have to work • They have to be safe • People want them cheaply
• A description of the drug research and development process
Company Ticker Number of drugs approved
R&D Spending Per Drug ($Mil)
Total R&D Spending 1997-2011 ($Mil)
AstraZeneca AZN 5 11,790.93 58,955
GlaxoSmithKline GSK 10 8,170.81 81,708
Sanofi SNY 8 7,909.26 63,274
Pfizer Inc. PFE 14 7,727.03 108,178
Roche Holding AG RHHBY 11 7,803.77 85,841
Johnson & Johnson JNJ 15 5,885.65 88,285
Eli Lilly & Co. LLY 11 4,577.04 50,347
Abbott Laboratories ABT 8 4,496.21 35,970
Merck & Co Inc MRK 16 4,209.99 67,360
Bristol-Myers Squibb Co.
BMY 11 4,152.26 45,675
Novartis AG NVS 21 3,983.13 83,646
Amgen Inc. AMGN 9 3,692.14 33,229
Sources: InnoThink Center For Research In Biomedical Innovation; Thomson Reuters Fundamentals via FactSet Research Systems
The Truly Staggering Cost Of Inventing New Drugs Matthew Herper - Forbes
Drug failures later in development are mainly due to EFFICACY and SAFETY
Actual spending – all LO projects are biggest spend
Paul, S. M. et al How to improve R&D productivity: the pharmaceutical industry’s grand challenge, Nat. Rev. Drug Discovery 2010, 9, 203
Snap-Shot of a medium sized companies R&D spend in one year - $1.7 billion
For a period large pharma set targets at each stage of the process – an attrition model - unsuccessful and very wasteful
Better chemistry Reduce the number of projects
Chemistry influence success and speed Methods that really work, new formulations
What Causes Attrition in Development?
PK 7%
Lack of efficacy in
man 46%
Adverse effects in man
17%
Animal toxicity 16%
Commercial reasons
7%
Miscellaneous 7%
Many compounds fail in development through inadequate
pharmacokinetics / bioavailability and unacceptable
toxicological profiles in addition to lack of efficacy in man
liver
kidneys
bladder Dissolve
Cross Membranes
Metabolism
Avoid Excretion
Oral Dosing of Drugs
BBB (Blood Brain Barrier)
Target (maybe in the brain)
Survive pH range 1.5-8
Absorption Distribution Metabolism Excretion Toxicity
Roche Data
rule finder
Roche Database
Genentech Data
rule finder
Genentech Data
AZ Data
rule finder
AZ Database
Grand Rule Database
Grand Rule database Better medicinal chemistry by sharing knowledge not data & structures
MedChemica
Grand Rule Database
Grand Rule Database
Grand Rule Database
AZ ExploitaQon
Roche ExploitaQon
Genentech ExploitaQon
Pharma 4 Data
rule finder
Pharma 4 Data
Grand Rule Database
Pharma 4 ExploitaQon
Grand Rule Database
Pharma 5 Data
rule finder
Pharma 5 Data
Grand Rule Database
Pharma 5 ExploitaQon
Grand Rule Database
>500 million pairs from companies + 12 million from public data
…so what are you going to
make next…?
Who is GOOD at Big Data? The people making the money!
Chemical transform
to improve metabolism
Chemists who wanted to fix metabolism also made these…
R =
SaltTraX© -‐ [email protected] [email protected]
What about clinical safety?
SAFE DRUGS
‘Potency’ Do not sacrifice
The be@er it is the lower the dose
Improved tes=ng in-‐vivo
with fewer animals
Clinical linkage to protein target
Can test In-‐Vivo AnQ SAR
e.g. hERG, Nav1.5, 5-‐HT2a…
Analysis of In-‐Vivo data Pfizer – rat data
<0.2mg/Kg Dose
Metabolism & Pharmacokine=cs
Be@er design so dose is lower
Grand Rule Database
Hughes et al, Bioorg Med Chem Le>. 2008, 18(17), 4872
Collaborators and Users
The ‘Internet of Things (IoT)’ A higher diversity of devices connected to the internet with flow of data to and from For example Smart Watches
Life style device – marketed on selling fitness / wellness Like UPS vans and RR jet engines can we detect the illness pre-symptomatically?
Big Data – ‘What is that all about?’
• Introduction to Big Data – Big enough to need a computer / advanced stats
• Examples from History – Bletchley park, UPS, Beer and Nappies….
• Big Data and science – Hadron collider….
• MedChemica – Advancing drug design through actionable knowledge – Allows sharing of knowledge to accelerate and
reduce costs of finding new, safe medicines