intro to field_of_bioinformatics
TRANSCRIPT
09/05/13 K-INBRE Bioinformatics Core KSU
Bioinformatics
1
Introduction to the field of bioinformatics
Sept, 2013Jennifer Shelton
K-INBRE Bioinformatics Core KSU
09/05/13 K-INBRE Bioinformatics Core KSU
Outline
2
I. Basic conceptsi. Definition of bioinformaticsii. Databases (flat-file and
relational)iii. Assembly (Overlap-layout-
consensus)II. Steps you can take on your
own
09/05/13 K-INBRE Bioinformatics Core KSU
Definition of bioinformatics
3
Acquire dataStore/archive data
Organize data
Analy
ze da
ta
Visu
alize
dat
a
Biological, Medical,
Behavioral, or Health
“Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.”
-NIH Biomedical Information Science and Technology Initiative Consortium 2000
09/05/13 K-INBRE Bioinformatics Core KSU
Definition of bioinformatics
4
Acquire dataStore/archive data
Organize data
Analy
ze da
ta
Visu
alize
dat
a
Biological, Medical,
Behavioral, or Health
Acquire dataStore/archive data
Organize data
Analy
ze data
Visu
alize
data
Biological, Medical,
Behavioral, or Health
“Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.”
-NIH Biomedical Information Science and Technology Initiative Consortium 2000
09/05/13 K-INBRE Bioinformatics Core KSU
Problem with volume
5
“We believe the field of bioinformatics for genetic analysis will be one of the biggest areas of disruptive innovation in life science tools over the next few years,”
-Isaac Ro, Goldman Sachs
Mark Smiciklas, Flickr.com/photos/intersectionconsulting
Ro, Goldman SachsPer year worldwide we can generate ~13,000,000,000,000,000 bp of data
09/05/13 K-INBRE Bioinformatics Core KSU
"This unprecedented amount of sequencing information poses bottlenecks that vary, depending on application, at the level of data extraction, analysis, and interpretation” "These challenges have become part and parcel of the biomedical research community where investigators have increasingly needed to incorporate bioinformatics and biostatistics into their armamentarium."
Problem with volume
6
Mark Smiciklas, Flickr.com/photos/intersectionconsulting
Opportunities and Challenges Associated with Clinical Diagnostic Genome Sequencing: A Report of the Association for Molecular Pathology. The Journal of Molecular Diagnostics - November 2012
09/05/13 K-INBRE Bioinformatics Core KSU
“It sounds like an analog solution in a digital age,”-Sifei He, head of cloud computing for BGI (referring to FedExing disks of data because internet connections are often too slow)
NY Times 2011 article: DNA Sequencing Caught in a Deluge of Data http://www.nytimes.com/2011/12/01/business/dna-sequencing-caught-in-deluge-of-data.html?pagewanted=all&_r=0
Problem with volume
7
09/05/13 K-INBRE Bioinformatics Core KSU
Examples of bioinformatics tools
8
9/4/13 tumblr_m5sa3oXBAB1rrtrfso1_500.jpg (500×500)
25.media.tumblr.com/tumblr_m5sa3oXBAB1rrtrfso1_500.jpg 1/1
?? ?
??
? ?
??
09/05/13 K-INBRE Bioinformatics Core KSU
Outline
9
I. Basic conceptsi. Definition of bioinformaticsii. Databases (flat-file and
relational)iii. Assembly (Overlap-layout-
consensus)II. Steps you can take on your
own
09/05/13 K-INBRE Bioinformatics Core KSU
Flat-file databases
‘records’ about one unique object
‘fields’ same kind of data about different object
http://www.ncbi.nlm.nih.gov/genbank/
10
GenBank:
09/05/13 K-INBRE Bioinformatics Core KSU 11
Flat-file databases
Any flat-file database, like GenBank can be thought of as a single spreadsheet called a ‘table’ of ‘fields’ and ‘records’
09/05/13 K-INBRE Bioinformatics Core KSU
Relational databases Have multiple tables
with some shared fields and some different
**‘fields’ same kind of data about different objects
http://www.genome.jp/kegg/pathway.html
12
09/05/13 K-INBRE Bioinformatics Core KSU
Relational databases Relational databases are like multiple tables that are linked with a
shared field. This acts like a “key” between them
13
9/25/12 KEGG PATHWAY: hsa05204
2/10www.genome.jp/dbget-‐‑bin/www_bget?pathway+hsa05204
Organism Homo sapiens (human) [GN:hsa]
Gene 1543 CYP1A1; cytochrome P450, family 1, subfamily A, polypeptide 1(EC:1.14.14.1) [KO:K07408] [EC:1.14.14.1]
1576 CYP3A4; cytochrome P450, family 3, subfamily A, polypeptide 4(EC:1.14.13.67 1.14.13.97 1.14.13.32) [KO:K07424][EC:1.14.14.1]
1577 CYP3A5; cytochrome P450, family 3, subfamily A, polypeptide 5(EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]
1551 CYP3A7; cytochrome P450, family 3, subfamily A, polypeptide 7(EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]
64816 CYP3A43; cytochrome P450, family 3, subfamily A, polypeptide43 (EC:1.14.14.1) [KO:K07424] [EC:1.14.14.1]
5743 PTGS2; prostaglandin-endoperoxide synthase 2 (prostaglandinG/H synthase and cyclooxygenase) (EC:1.14.99.1) [KO:K11987][EC:1.14.99.1]
10 NAT2; N-acetyltransferase 2 (arylamine N-acetyltransferase)(EC:2.3.1.5) [KO:K00622] [EC:2.3.1.5]
9 NAT1; N-acetyltransferase 1 (arylamine N-acetyltransferase)(EC:2.3.1.5) [KO:K00622] [EC:2.3.1.5]
1544 CYP1A2; cytochrome P450, family 1, subfamily A, polypeptide 2(EC:1.14.14.1) [KO:K07409] [EC:1.14.14.1]
6799 SULT1A2; sulfotransferase family, cytosolic, 1A, phenol-preferring, member 2 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]
6817 SULT1A1; sulfotransferase family, cytosolic, 1A, phenol-preferring, member 1 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]
6818 SULT1A3; sulfotransferase family, cytosolic, 1A, phenol-preferring, member 3 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]
445329 SULT1A4; sulfotransferase family, cytosolic, 1A, phenol-preferring, member 4 (EC:2.8.2.1) [KO:K01014] [EC:2.8.2.1]
1545 CYP1B1; cytochrome P450, family 1, subfamily B, polypeptide 1(EC:1.14.14.1) [KO:K07410] [EC:1.14.14.1]
1558 CYP2C8; cytochrome P450, family 2, subfamily C, polypeptide 8(EC:1.14.14.1) [KO:K07413] [EC:1.14.14.1]
1562 CYP2C18; cytochrome P450, family 2, subfamily C, polypeptide18 (EC:1.14.14.1) [KO:K07413] [EC:1.14.14.1]
1557 CYP2C19; cytochrome P450, family 2, subfamily C, polypeptide19 (EC:1.14.13.48 1.14.13.49 1.14.13.80) [KO:K07413][EC:1.14.14.1]
1559 CYP2C9; cytochrome P450, family 2, subfamily C, polypeptide 9(EC:1.14.13.48 1.14.13.49 1.14.13.80) [KO:K07413][EC:1.14.14.1]
2052 EPHX1; epoxide hydrolase 1, microsomal (xenobiotic)
09/05/13 K-INBRE Bioinformatics Core KSU
Outline
14
I. Basic conceptsi. Definition of bioinformaticsii. Databases (flat-file and
relational)iii. Assembly (Overlap-layout-
consensus)II. Steps you can take on your
own
09/05/13 K-INBRE Bioinformatics Core KSU
Assembly
15
Of the ~13,000,000,000,000,000bp of sequence data we can generate each year, most is not the full length of the molecule of DNA or RNA.
Instead, scientists get back multiple copies of their genome (or transcriptome) but all in short segments (between 50bp and several kbs)
Steps of Overlap-Layout-Consensus (OLC):
1) Lets’ think of a genome like the text of a book. We get back multiple copies of the book
09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
16
1) Instead of being nicely bound, we get randomly shredded text all mixed together from our multiple copies
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tire
d of sittin
g by her siste
r on the bank, an
d
of having nothing
09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
17
2) We look for lines that overlap for more than some minimum number of letters (in these programs all overlaps are found, then a single “path” is found through this “graph” of overlaps)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tire
d of sittin
g by her siste
r on the bank, an
d
of having nothing
09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
18
2) We look for lines that overlap for more than some minimum number of letters (in these programs overlaps are found, then a single “path” is found through this “graph” of overlaps)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
19
3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
20
3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
21
3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
22
3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
09/05/13 K-INBRE Bioinformatics Core KSU
OLC Assembly
23
3) We move column by column counting the letters in a column a make a note of the most common letter (take the consensus)
ice was beginning to get very tired of
sitting by her tister on the bank, and of
having nothing to do
Alice was
beginning to get vory tired of sitting by her sister on
the bank, and of having nothing to do: once
lice was beginning to get
very tired of sitting by her sister on the bank, and
of having nothing
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do
09/05/13 K-INBRE Bioinformatics Core KSU
0"
10"
20"
30"
40"
50"
60"
400! 500! 600! 700! 800!
Sand"bluestem"(removed)"
Sand"bluestem"(intact)"
0!
10!
20!
30!
40!
50!
60!
400! 500! 600! 700! 800!
Big$bluestem$(removed)$
Big$bluestem$(intact)$
Relat
ive re
flecta
nce o
f EW
C
Wavelength (nm)
Big bluestem Sand bluestem
Bischof B.
Bittersweet Balsam
Assemblies
homenursery.com gardeninginsomnia.com
24
60
145
230
315
400
23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61M
IRA
(454
)M
IRA
clus
ter 0
75
150
225
300
375
450
525
600Sand bluestem assembly length and number of contigs
Cum
ulat
ive
leng
th o
f seq
uenc
es (M
b)
Assembly k-mer value or name
Num
ber o
f seq
uenc
es (k
)
Cumulative length of sequences (Mb)Number of sequences x 10^5
0.4
1.6
2.7
3.9
5.0
23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61M
IRA
(454
)M
IRA
clus
ter
Sand bluestem N values
Con
tig le
ngth
(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)N25 (kb)
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of
sequences (Mb)
Number of sequences x
105
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of
sequences (Mb)
Number of sequences x
105
27374757mergeCDH clusterMIRA cluster
1.219 2.028 3.126 142.633358 1.28113 27 1.219 2.028 3.126 142.633358 1.281131.206 2.008 3.087 128.100083 1.1091 37 1.206 2.008 3.087 128.100083 1.10911.195 1.977 3.051 113.176134 0.93839 47 1.195 1.977 3.051 113.176134 0.938391.271 2.035 3.096 102.507455 0.82755 57 1.271 2.035 3.096 102.507455 0.827551.41 2.211 3.331 345.752982 2.31102 merge 1.41 2.211 3.331 345.752982 2.311021.44 2.27 3.422 84.202533 0.59174 CDH cluster 1440 2270 3422 84202533 59174
1.804 2.69 3.941 105.920843 0.50279 MIRA cluster 1804 2690 3941 105920843 50279
1.1
1.7
2.3
2.8
3.4
4.0
27 37 47 57
mer
ge
CDH
clu
ster
MIR
A cl
uste
r
Balsam N values
Con
tig le
ngth
(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)N25 (kb)
80
185
290
395
500
27 37 47 57
mer
ge
CDH
clu
ster
MIR
A cl
uste
r 0
0.75
1.5
2.25
3Balsam assembly length and number of contigs
Cum
ulat
ive
leng
th o
f seq
uenc
es (M
b)
Assembly k-mer value or name
Num
ber o
f seq
uenc
es x
10^
5
Cumulative length of sequences (Mb)Number of sequences x 10^5
k-mer N75 (kb) N50 (kb) N25 (kb) Cumulative length of
sequences (Mb)
Number of sequences x
105
27374757mergeCDH clusterMIRA cluster
1.213 2.11 3.221 175.505163 1.619521.176 2.026 3.068 154.222168 1.369471.168 1.948 2.932 129.331497 1.075451.218 1.974 2.95 111.672465 0.903851.404 2.23 3.299 418.762352 2.778331.399 2.274 3.339 96.411479 0.70852 CDH cluster 1399 2274 3339 96411479 708521.825 2.676 3.856 123.666263 0.59598 MIRA cluster 1825 2676 3856 123666263 59598
100
200
300
400
500
27 37 47 57
mer
ge
CDH
clu
ster
MIR
A cl
uste
r 0
0.75
1.5
2.25
3Bittersweet assembly length and number of contigs
Cum
ulat
ive
leng
th o
f seq
uenc
es (M
b)
Assembly k-mer value or name
Num
ber o
f seq
uenc
es x
10^
5
Cumulative length of sequences (Mb)Number of sequences x 10^5
1.1
1.8
2.6
3.3
4.0
27 37 47 57
mer
ge
CDH
clu
ster
MIR
A cl
uste
r
Bittersweet N values
Con
tig le
ngth
(kb)
Assembly k-mer value or name
N75 (kb) N50 (kb)N25 (kb)
Red flour beetle
Day E.
09/05/13 K-INBRE Bioinformatics Core KSU
Outline
25
I. Basic conceptsi. Definition of bioinformaticsii. Databases (flat-file and
relational)iii. Assembly (Overlap-layout-
consensus)II. Steps you can take on your
own
09/05/13 K-INBRE Bioinformatics Core KSU
What can you do to get prepared?
26
-Manoj Samanta http://www.homolog.us/blogs/2011/07/22/a-beginners-guide-to-bioinformatics-part-i/
•Layer 1 – Using web to analyze biological data•Layer 2 – Ability to install and run new programs•Layer 3 – Writing own scripts for analysis in PERL, python or R•Layer 4 – High level coding in C/C++/Java for implementing existing algorithms or modifying existing codes for new functionality•Layer 5 – Thinking mathematically, developing own algorithms and implementing in C/C++/Java
If you are interested in studying bioinformatics here is an outline of increasingly complex levels of skills you might work towards
09/05/13 K-INBRE Bioinformatics Core KSU
K-INBRE resources
27
Over the fall semester the Bioinformatics Core and Virginia Rider from Pittsburg State University will be hosting an undergraduate bioinformatics club.
Our first topic will be command-line blast. Students will get an account on Beocat (Kansas’ largest compute cluster).
http://bioinformaticsk-state-undergrad.blogspot.com
09/05/13 K-INBRE Bioinformatics Core KSU
K-INBRE resources
28
K-INBRE hosts a journal club, Wednesday at noon, via PolyCom to discuss current bioinformatics tools.
http://bioinformaticsk-state.blogspot.com/
09/05/13 K-INBRE Bioinformatics Core KSU
K-INBRE resources
29
Bradley Olson and K-INBRE – PerlJustin Blumenstiel et al. – Python
http://bioinformaticskstateperl.blogspot.com/
09/05/13 K-INBRE Bioinformatics Core KSU
K-INBRE resources
30
K-INBRE and i5K have begun a Github script sharing organization to archive and share scripts.
https://github.com/i5K-KINBRE-script-share
i5K-KINBRE-script-share
RNA-Seq annotation and
comparison
genome annotation and
comparison
genome and transcriptome
assembly
read cleaning and format conversion
KSU bioinfo
labOlson
labreadme
KSU bioinfo
labOlson
labreadme
readme
KSU bioinfo
labOlson
labreadme
GitHub organization
Category of ‘omics’ tool
Lab or research group
List and description of scripts
09/05/13 K-INBRE Bioinformatics Core KSU
K-INBRE resources
31
-Git has very well developed version control built-in http://git-scm.com/video/what-is-version-control-Easy to search-More advantages are reviewed in this quick introduction http://git-scm.com/video/quick-wins-Provides continuity within labs (as students and post docs rotate out) - Increases collaboration and sharing of workflows between our community- It is also a good way to distribute the code you describe in a publication.- Git is also widely used by beginners as well as developers of technology and software in the omics community. Including:https://github.com/broadinstitute (The Broad Institute)https://github.com/lh3 (Li H. developer of BWA etc)https://github.com/dzerbino (Daniel Zerbino developer of oases and velvet)https://github.com/PacificBiosciences
09/05/13 K-INBRE Bioinformatics Core KSU
Questions?
32
9/4/13 tumblr_mp3qolvEiS1rr34bqo1_500.jpg (497×628)
31.media.tumblr.com/7c979b49ccf3bb50a9c42db116e4d686/tumblr_mp3qolvEiS1rr34bqo1_500.jpg 1/1
Contact information:[email protected]
K-INBRE Bioinformatics Core:
http://www.kumc.edu/kinbre/bioinformatics.html
http://bioinformatics.k-state.edu/