previous lecture : exploring data

Post on 04-Jan-2016

52 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Previous Lecture : Exploring Data. This Lecture. Introduction to Biostatistics and Bioinformatics Descriptive Statistics. Process of Statistical Analysis. Population. Random Sample. Make Inferences. Describe. Sample Statistics. Distributions. Normal. Skewed. Long tails. Complex. - PowerPoint PPT Presentation

TRANSCRIPT

Previous Lecture: Exploring Data

Introduction to Biostatistics and Bioinformatics

Descriptive Statistics

This Lecture

Process of Statistical Analysis

Population

Random Sample

Sample Statistics

Describe

MakeInferences

DistributionsComplex Normal Skewed Long tails

Randomly Sample from any Distribution

1. Generate a pair of random numbers within the range.

2. Assign them to x and y3. Keep x if the point (x,y) is within the distribution.4. Repeat 1-3 until the desired sample size is

obtained.5. The values x obtained in this was will be

distributed according to the original distribution.

Mean

n

ni

iix

1

xxx n,...,,21

Mean

Sample

MeanComplex Normal Skewed Long tails

Sample Size

100

1

-1

0.2

-0.2

Median, Quartiles and Percentiles

xxx n,...,,21

Sample

Quartiles

xQ i

1 for 25% of the sample

xQ i

2for 50% of the sample

(median)xQ i

3 for 75% of the sample

xP im for m% of the sample

Percentiles

Inter Quartile Range

QQIQR13

Median and MeanComplex Normal Skewed Long tails

Sample Size

100

1

-1

0.2

-0.2

Median - Gray

Quartiles and MeanComplex Normal Skewed Long tails

Sample Size

100

1

-1

0.2

-0.2

Q3 - Purple

Q1 – Gray

Central Limit Theorem

The sum of a large number of values drawn from many distributions converge normal if:

• The values are drawn independently;• The values are from the one distribution; and • The distribution has to have a finite mean and

variance.

Variance

n

ni

iix

1

xxx n,...,,21

Variance

Sample

Mean

n

i

ni

ix

1

2

2)(

VarianceComplex Normal Skewed Long tails

Sample Size

100

0.6

0

0.1

0

Inter Quartile Range and Standard Deviation

Complex Normal Skewed Long tails

Sample Size

100

1.0

0

0.4

0

IRQ/1.349 - Gray

Uncertainty in Determining the MeanComplex Normal Skewed Long tails

n=3

n=10

Average

n=100

n=3

n=10

n=100

n=3

n=10

n=100

n=10

n=100

n=1000

Standard Error of the Mean

n

ni

iix

1

xxx n,...,,21

Variance

Sample

Mean

n

i

ni

ix

1

2

2)(

nmes

..

Standard Error of the Mean

Error bars

M. Krzywinski & N. Altman, Error Bars, Nature Methods 10 (2013) 921

In 2012, error bars appeared in Nature Methods in about two-thirds of the figure panels in which they could be expected (scatter and bar plots). The type of error bars was nearly evenly split between s.d. and s.e.m. bars (45% versus 49%, respectively). In 5% of cases the error bar type was not specified in the legend. Only one figure used bars based on the 95% CI.

None of the error bar types is intuitive. An alternative is to select a value of CI% for which the bars touch at a desired P value (e.g., 83% CI bars touch at P = 0.05).

Box Plot

M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119

n=5

Box PlotsComplex Normal Skewed Long tails

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

Box Plots with All the Data PointsComplex Normal Skewed Long tails

n=5

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

n=5

n=10

n=100

Box Plots, Scatter Plots and Bar GraphsNormal Distribution

Error bars: standard deviation error bars: standard deviation

error bars: standard error error bars: standard error

Box Plots, Scatter Plots and Bar GraphsSkewed Distribution

Error bars: standard deviation error bars: standard deviation

error bars: standard errorerror bars: standard error

Box Plots, Scatter Plots and Bar GraphsDistribution with Fat

TailError bars: standard deviation error bars: standard deviation

error bars: standard errorerror bars: standard error

Application: Analytical Measurements

Theoretical Concentration

Measu

red

C

on

cen

trati

on

A Few Characteristics of Analytical Measurements

Accuracy: Closeness of agreement between a test result and an accepted reference value.

Precision: Closeness of agreement between independent test results.

Robustness: Test precision given small, deliberate changes in test conditions (preanalytic delays, variations in storage temperature).

Lower limit of detection: The lowest amount of analyte that is statistically distinguishable from background or a negative control.

Limit of quantification: Lowest and highest concentrations of analyte that can be quantitatively determined with suitable precision and accuracy.

Linearity: The ability of the test to return values that are directly proportional to the concentration of the analyte in the sample.

Measuring Blanks

Coefficient of Variation

n

ni

iix

1

xxx n,...,,21

Variance

Sample

Mean

n

i

ni

ix

1

2

2)(

Coefficient of Variation (CV)

Lower Limit of Detection

The lowest amount of analyte that is statistically distinguishable from background or a negative control.

Two methods to determine lower limit of detection:

1. Lowest concentration of the analyte where CV is less than for example 20%.

2. Determine level of blank by taking 95th percentile of the blank measurements and add a constant times the standard deviation of the lowest concentration.

K. Linnet and M. Kondratovich, Partly Nonparametric Approach for Determining the Limit of Detection, Clinical Chemistry 50 (2004) 732–740.

Limit of Detection and Linearity

Theoretical Concentration

Theoretical Concentration

Measu

red

C

on

cen

trati

on

Measu

red

C

on

cen

trati

on

Precision and Accuracy

Theoretical Concentration

Theoretical Concentration

Measu

red

C

on

cen

trati

on

Measu

red

C

on

cen

trati

on

Descriptive Statistics - Summary

• Example distribution: • Normal distribution• Skewed distribution• Distribution with long tails• Complex distribution with several peaks

• Mean, median, quartiles, percentiles

• Variance, Standard deviation, Inter Quartile Range (IQR), error bars

• Box plots, bar graphs, and scatter plots

• Application: Analytical measurements:• Accuracy and precision• Limit of detection and quantitation• Linearity• Robustness

Descriptive Statistics – Recommended Reading

http://blogs.nature.com/methagora/2013/08/giving_statistics_the_attention_it_deserves.html

Descriptive Statistics – Recommended Reading

http://greenteapress.com/thinkstats/

Next Lecture: Data types and representations

in Molecular Biology

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACAACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTTGCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACCCACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTGTGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCAGGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCATCTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGATGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG

@SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAACCTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC+SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152+50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGGGGGDDIIGIIEGIGG>GGGGGGDGGGGGIIHIIBIIIGIIIHIIIIGII@SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152NTGTGATAGGCTTTGTCCATTCTGGAAACTCAATATTACTTGCGAGTCCTCAAAGGTAATTTTTGCTATTGCCAATATTCCTCAGAGGAAAAAAGATACAATACTATGTTTTATCTAAATTAGCATTAGAAAAAAAATCTTTCATTAGGTGT+SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152#.,')2/@@@@@@@@@@<:<<:778789979888889:::::99999<<::<:::::<<<<<@@@@@::::::IHIGIGGGGGGDGGDGGDDDIHIHIIIII8GGGGGIIHHIIIGIIGIBIGIIIIEIHGGFIHHIIIIIIIGIIFIG

##gff-version 3#!gff-spec-version 1.20##species_http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7425NC_015867.2 RefSeq cDNA_match 66086 66146 . - . ID=aln0;Target=XM_008204328.1 1 61 +; for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1NC_015867.2 RefSeq cDNA_match 65959 66007 . - . ID=aln0;Target=XM_008204328.1 62 110 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1NC_015867.2 RefSeq cDNA_match 65799 65825 . - . ID=aln0;Target=XM_008204328.1 111 137 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1

FASTQ

FASTA GFF3

Next Tutorial: Python Programming

Saturday 9/13 at 3 PM in TRB 120

top related