previous lecture : exploring data
DESCRIPTION
Previous Lecture : Exploring Data. This Lecture. Introduction to Biostatistics and Bioinformatics Descriptive Statistics. Process of Statistical Analysis. Population. Random Sample. Make Inferences. Describe. Sample Statistics. Distributions. Normal. Skewed. Long tails. Complex. - PowerPoint PPT PresentationTRANSCRIPT
Previous Lecture: Exploring Data
Introduction to Biostatistics and Bioinformatics
Descriptive Statistics
This Lecture
Process of Statistical Analysis
Population
Random Sample
Sample Statistics
Describe
MakeInferences
DistributionsComplex Normal Skewed Long tails
Randomly Sample from any Distribution
1. Generate a pair of random numbers within the range.
2. Assign them to x and y3. Keep x if the point (x,y) is within the distribution.4. Repeat 1-3 until the desired sample size is
obtained.5. The values x obtained in this was will be
distributed according to the original distribution.
Mean
n
ni
iix
1
xxx n,...,,21
Mean
Sample
MeanComplex Normal Skewed Long tails
Sample Size
100
1
-1
0.2
-0.2
Median, Quartiles and Percentiles
xxx n,...,,21
Sample
Quartiles
xQ i
1 for 25% of the sample
xQ i
2for 50% of the sample
(median)xQ i
3 for 75% of the sample
xP im for m% of the sample
Percentiles
Inter Quartile Range
QQIQR13
Median and MeanComplex Normal Skewed Long tails
Sample Size
100
1
-1
0.2
-0.2
Median - Gray
Quartiles and MeanComplex Normal Skewed Long tails
Sample Size
100
1
-1
0.2
-0.2
Q3 - Purple
Q1 – Gray
Central Limit Theorem
The sum of a large number of values drawn from many distributions converge normal if:
• The values are drawn independently;• The values are from the one distribution; and • The distribution has to have a finite mean and
variance.
Variance
n
ni
iix
1
xxx n,...,,21
Variance
Sample
Mean
n
i
ni
ix
1
2
2)(
VarianceComplex Normal Skewed Long tails
Sample Size
100
0.6
0
0.1
0
Inter Quartile Range and Standard Deviation
Complex Normal Skewed Long tails
Sample Size
100
1.0
0
0.4
0
IRQ/1.349 - Gray
Uncertainty in Determining the MeanComplex Normal Skewed Long tails
n=3
n=10
Average
n=100
n=3
n=10
n=100
n=3
n=10
n=100
n=10
n=100
n=1000
Standard Error of the Mean
n
ni
iix
1
xxx n,...,,21
Variance
Sample
Mean
n
i
ni
ix
1
2
2)(
nmes
..
Standard Error of the Mean
Error bars
M. Krzywinski & N. Altman, Error Bars, Nature Methods 10 (2013) 921
In 2012, error bars appeared in Nature Methods in about two-thirds of the figure panels in which they could be expected (scatter and bar plots). The type of error bars was nearly evenly split between s.d. and s.e.m. bars (45% versus 49%, respectively). In 5% of cases the error bar type was not specified in the legend. Only one figure used bars based on the 95% CI.
None of the error bar types is intuitive. An alternative is to select a value of CI% for which the bars touch at a desired P value (e.g., 83% CI bars touch at P = 0.05).
Box Plot
M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119
n=5
Box PlotsComplex Normal Skewed Long tails
n=10
n=100
n=5
n=10
n=100
n=5
n=10
n=100
n=5
n=10
n=100
Box Plots with All the Data PointsComplex Normal Skewed Long tails
n=5
n=10
n=100
n=5
n=10
n=100
n=5
n=10
n=100
n=5
n=10
n=100
Box Plots, Scatter Plots and Bar GraphsNormal Distribution
Error bars: standard deviation error bars: standard deviation
error bars: standard error error bars: standard error
Box Plots, Scatter Plots and Bar GraphsSkewed Distribution
Error bars: standard deviation error bars: standard deviation
error bars: standard errorerror bars: standard error
Box Plots, Scatter Plots and Bar GraphsDistribution with Fat
TailError bars: standard deviation error bars: standard deviation
error bars: standard errorerror bars: standard error
Application: Analytical Measurements
Theoretical Concentration
Measu
red
C
on
cen
trati
on
A Few Characteristics of Analytical Measurements
Accuracy: Closeness of agreement between a test result and an accepted reference value.
Precision: Closeness of agreement between independent test results.
Robustness: Test precision given small, deliberate changes in test conditions (preanalytic delays, variations in storage temperature).
Lower limit of detection: The lowest amount of analyte that is statistically distinguishable from background or a negative control.
Limit of quantification: Lowest and highest concentrations of analyte that can be quantitatively determined with suitable precision and accuracy.
Linearity: The ability of the test to return values that are directly proportional to the concentration of the analyte in the sample.
Measuring Blanks
Coefficient of Variation
n
ni
iix
1
xxx n,...,,21
Variance
Sample
Mean
n
i
ni
ix
1
2
2)(
Coefficient of Variation (CV)
Lower Limit of Detection
The lowest amount of analyte that is statistically distinguishable from background or a negative control.
Two methods to determine lower limit of detection:
1. Lowest concentration of the analyte where CV is less than for example 20%.
2. Determine level of blank by taking 95th percentile of the blank measurements and add a constant times the standard deviation of the lowest concentration.
K. Linnet and M. Kondratovich, Partly Nonparametric Approach for Determining the Limit of Detection, Clinical Chemistry 50 (2004) 732–740.
Limit of Detection and Linearity
Theoretical Concentration
Theoretical Concentration
Measu
red
C
on
cen
trati
on
Measu
red
C
on
cen
trati
on
Precision and Accuracy
Theoretical Concentration
Theoretical Concentration
Measu
red
C
on
cen
trati
on
Measu
red
C
on
cen
trati
on
Descriptive Statistics - Summary
• Example distribution: • Normal distribution• Skewed distribution• Distribution with long tails• Complex distribution with several peaks
• Mean, median, quartiles, percentiles
• Variance, Standard deviation, Inter Quartile Range (IQR), error bars
• Box plots, bar graphs, and scatter plots
• Application: Analytical measurements:• Accuracy and precision• Limit of detection and quantitation• Linearity• Robustness
Descriptive Statistics – Recommended Reading
http://blogs.nature.com/methagora/2013/08/giving_statistics_the_attention_it_deserves.html
Descriptive Statistics – Recommended Reading
http://greenteapress.com/thinkstats/
Next Lecture: Data types and representations
in Molecular Biology
>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACAACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTTGCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACCCACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTGTGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCAGGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCATCTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGATGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG
@SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAACCTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC+SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152+50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGGGGGDDIIGIIEGIGG>[email protected] MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152NTGTGATAGGCTTTGTCCATTCTGGAAACTCAATATTACTTGCGAGTCCTCAAAGGTAATTTTTGCTATTGCCAATATTCCTCAGAGGAAAAAAGATACAATACTATGTTTTATCTAAATTAGCATTAGAAAAAAAATCTTTCATTAGGTGT+SRR350953.7 MENDEL_0047_FC62MN8AAXX:1:1:1724:932 length=152#.,')2/@@@@@@@@@@<:<<:778789979888889:::::99999<<::<:::::<<<<<@@@@@::::::IHIGIGGGGGGDGGDGGDDDIHIHIIIII8GGGGGIIHHIIIGIIGIBIGIIIIEIHGGFIHHIIIIIIIGIIFIG
##gff-version 3#!gff-spec-version 1.20##species_http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7425NC_015867.2 RefSeq cDNA_match 66086 66146 . - . ID=aln0;Target=XM_008204328.1 1 61 +; for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1NC_015867.2 RefSeq cDNA_match 65959 66007 . - . ID=aln0;Target=XM_008204328.1 62 110 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1NC_015867.2 RefSeq cDNA_match 65799 65825 . - . ID=aln0;Target=XM_008204328.1 111 137 +;for_remapping=2;gap_count=1;num_ident=8766;num_mismatch=0;pct_coverage=100;pct_coverage_hiqual=100;pct_identity_gap=99.9886;pct_identity_ungap=100;rank=1
FASTQ
FASTA GFF3
Next Tutorial: Python Programming
Saturday 9/13 at 3 PM in TRB 120