lies, damn lies, and big data

26
Lies, Damn Lies, and Big Data Applications, Limitations, Misconceptions Brian Bissett Senior Member Institute of Electrical and Electronics Engineers (IEEE) 03/15/2022 1

Upload: brian-bissett

Post on 30-Jul-2015

282 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Lies, Damn Lies, and Big Data

Applications, Limitations, Misconceptions

Brian BissettSenior MemberInstitute of Electrical and Electronics Engineers (IEEE)

04/15/20231

04/15/20232

Overview

What is Big Data

Common Attributes of Big Data

Challenges of Working with Big Data

Validity Space

Outliers

Variance

Correlation and Causality

Summary

What is “Big Data”?

Depends who you ask. . . .

Gartner – define by the “three Vs”: Volume, Velocity and Variety.

Oracle - the derivation of value from traditional relational database-driven business decision making, augmented with new sources of unstructured data.

Intel – the generation of a median of 300 terabytes of data a week.

3 04/15/2023

What is “Big Data”?

Microsoft - the process of applying serious computing power—the latest in machine learning and artificial intelligence—to seriously massive and often highly complex sets of information.

The Method for an Integrated Knowledge Environment (MIKE) project argues that big data is not a function of the size but of complexity. (A high degree of permutations and interactions within a data set defines big data.)

National Institute of Standards and Technology (NIST) - big data “exceed(s) the capacity or capability of current methods and systems.”

4 04/15/2023

The Current 8 V’s of Big Data

Volume

Velocity

Variety

Value – is this worth something to someone?

Validity – is this correct?

Viability – can this stand independently?

Variability – is the same result reported consistently?

Verifiability – do we know where this came from?

5 04/15/2023

The Original Three V’sDefine “Big” Data

Not Actionable

The 5 P’s for Biomedical Big Data

Evidence Based, Outcome Driven, and Affordable Health Care will Require the Five P’s:

Predictive

Precise

Preventive

Personalized

Patient-Centric

The Cancer Genome Atlas (TCGA)

6 04/15/2023

Challenges of Dealing with Big Data

Management – In 10 Years at Zettabyte Levels!

Infrastructure

Performance Analytics – TBD.

Unstructured – Lacks any Meaningful Standards.

Data Visualization – Humans see in 3D Only.

Navigation – Siloed Data is Difficult to Access.

Missing Data – Average of 30% from HIT Data.

Incorrect Data – Average of 25% - 30%.7 04/15/2023

The Three C’s (Challenges)

Collection– is it worth saving?– Value = Actionable

Consolidation – Clean it up! "Not Collected Here"

Consumption – Easy to Add Processors– Difficult to move Data.

8 04/15/2023

Transactions: Real Time & Queued

Real Time – must be done ASAP– Retail: Credit Card Transactions

– Security: Is Passenger on the “no fly list”

– NICS Checks for Firearms Purchases

– Stock Purchases

Queued – Everything else that can wait– Traffic Data, process images from Traffic Cameras to

determine speed and volume.

– Daily Customer Counts

– Daily or Monthly Volume for Stock Transactions

9 04/15/2023

When are the Conclusions Drawn from Big Data Most Accurate?

Big Data is most reliable when working in Two and sometimes Three Dimensional Matrices.

Where the Assumption to be derived is Boolean.

Where the Data Acquired is known to be of Good Quality.

Example: Traffic Data at Checkpoint– Record: Number of Cars, Time, Maybe Speed

– Derive: Is Traffic Flowing without Delay?

10 04/15/2023

Big Data = Big Problems

More Excess Data as Compared to Real Signals = More Spurious Relationships.

11 04/15/2023

Source: N.N. Taleb

Outliers: Goldmine or Nuisance

An Outlier can either be a Goldmine (the needle in the haystack sought) or a Nuisance (an artifact to be ignored)

Example: Lipinski’s Rule of 5 (Ro5)

16% of oral drugs violate at least one of the criteria, and 6% fail two or more.

Billion Dollar Drugs that have failed the Ro5 criteria:

Lipitor, Singulair

12 04/15/2023

Outliers: Goldmine or Nuisance

Example: Nuisance Outlier

The speed of the Motorcycle in no way reflects the true speed of the Traffic.

13 04/15/2023

No rigid mathematical definition exists of what constitutes an outlier, or when an Outlier may be omitted from an analysis.

Mahalanobis Distance - distance between data point and a multivariate space's centroid (overall mean). (Commonly used in Linear Regression)

Outliers – Bonedigger and Milo

Bonedigger the lion and Milo the sausage dog are inseparable. The friendship between an 11-pound wiener dog and a 500-pound lion is the only one ever seen in the world.

14 04/15/2023

Melanoma ExampleDealing with Variance

Impossible to Positively Discern without Biopsy15 04/15/2023

MELANOMA

BENIGN

RE

AL WORLD

TEST CASE

Melanoma ~ 80% Diagnostic Rate with Current Image Algorithms

Because Melanoma can present in all Colors, Shapes, Granularities, and Textures; More Data is unlikely to improve Current Diagnostic Image algorithms.

Sensitivity – Rule out Condition when Negative

= true positives/(true positives + false negatives)

80% Sensitive Test will Detect 8 out of 10 Cancers.

Specificity – Rule in Condition when Positive

=true negatives/(true negatives + false positives)

95% Specific Test -> False Positive rate of 5%

Sensitivity and specificity are inversely proportional16 04/15/2023

Variance – The Batch Effect

High-throughput technologies.

Batch Effects when measurements are affected by laboratory conditions, reagent lots, and personnel differences.

Pharmaceutical Mergers - Particularly troubling when merging data sets from different labs.

Normalization for Batch Effects is extremely difficult.

“What level is your pain on a scale from 1 to 10?”

17 04/15/2023

Qualitative Variance

Massachusetts General Hospital Harvard Medical School investigated discrepancy rates for the interpretation of Radiology Films.

60 examinations - 30 previously interpreted by themselves and 30 interpreted by their peers.

Interobserver Disagreement Rate = 26%.

Intraobserver Disagreement Rate = 32%.

Radiologists agreed with other Radiologists more than themselves.

18 04/15/2023

Correlation vs. Causation

Correlation is easy to prove.

How much of a Correlation is Easy to Prove.

R2 = 1.0 – Perfect Correlation.

R2 = 0.0 – No Correlation.

Causation is nearly Impossible to Prove.

US Spending on Science, Space, and Technology correlates Nearly Perfectly (R2 = 0.99208) with Suicides by Hanging, Strangulation and Suffocation.

19 04/15/2023

Bradford Hill Causality Proof

Strong – Five or Ten Fold Increase

Consistent – Populations or Time does not Effect

Specific – A Link (a location, mechanism, etc.)

Temporal - Association Increases with Duration

Gradient - Association Increases with Exposure

Plausible – Association Easily Seen

Coherent – Experimental Evidence Supports

Similar Behavior in Analogous Situations

20 04/15/2023

Big Data Governance Does not Exist

No laws exist to address the utilization of big data.

Concerns about citizen privacy and business liability have yet to be addressed.

Critical Challenge to the Federal Government.

Federal Agencies that Utilize Big Data do so on an ad-hoc basis.

Little guidance exists on using petabyte sizes of private citizen data for predictive analytics.– Privacy Act of 1974 and HIPAA 1996.

21 04/15/2023

Hierarchy of Evidence

22 04/15/2023

Big Data = Observational Study

Data is not Collected to Examine a Specific Problem using a Protocol.

The Treatment Group and the Control Group are outside the control of the Investigator.

Groups Differing in Outcome are identified and compared on the basis of a supposed causal attribute.

Longitudinal - repeated observations of the same variables over long periods of time.

23 04/15/2023

Summary

The World is Accumulating a Lot of Data.

Nobody Agrees on What “Big” is.

On Average, 30% of the Data is Incorrect.

On Average, 30% of the Data is Missing.

Correlation is the Easy Part.

Bradford Hill gives Guidance on Proving Causation.

There is a Hierarchy of Evidence and Expert Opinion and Big Data are at the bottom of it.

24 04/15/2023

Selected Publications

Automated Data Analysis with Excel – Softcover: 442 Pages

– Chapman & Hall (June 2007)

– Second Edition Coming in 2016

– ISBN: 1-58488-885-7

Practical Pharmaceutical Laboratory Automation– Hardcover: 464 pages

– Publisher: CRC Press (May 2003)

– ISBN: 0849318149

25 04/15/2023

References

Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL. Quantifying the chemical beauty of drugs. Nat Chem. 2012;4:90–98. doi: 10.1038/nchem.1243. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3524573/

The Big Data Conundrum: How to Define It? http://www.technologyreview.com/view/519851/the-big-data-conundrum-how-to-define-it/

Abujudeh, HH, Boland, GW, Kaewalai, R, et al. Abdominal and Pelvic Computed Tomography (CT) Interpretation: discrepancy rates among experienced radiologists. Eur Radiol.2010;20(8): 1952-7.

Maryam Ramezani, Alireza Karimian, and Payman Moallem. Automatic Detection of Malignant Melanoma using Macroscopic Images. J Med Signals Sens. 2014 Oct-Dec; 4(4): 281–290. PMCID: PMC4236807

26 04/15/2023