bioinformatics tools in context how informatics improves research every day adelaide fletcher, mlis...
TRANSCRIPT
Bioinformatics Tools in Context
How Informatics Improves Research Every Day
Adelaide Fletcher, MLIS
Tzu L. Phang Ph.D.
July 27, 2012
2
True or False?
You have to be a collaborator on someone’s clinical trial to make discoveries with their genetic data...
3
• Stanford School of Medicine's Atul Butte identified a new drug target for diabetes by downloading data from 130 gene-expression studies in mice, rats, and humans that were done by other researchers and doing a meta-analysis to look for a common link
• wet lab experiments are more for validating hypotheses than making discoveries
7
Meet Our Hero...
• Name: Hunter
• Research Interests: The role of mammary epithelial cells in breast cancer
• Goal: Develop a genetic drug tarFget for breast cancer
• Post-grad experience: < 1 year
• Funding: $0
7
9
Finding out what’s known
• Google Scholar - http://scholar.google.com
• Web Of Science - http://isiknowledge.com/WOS– (
http://hsl-ezproxy.ucdenver.edu/login?url=http://isiknowledge.com/WOS)
9
10
Google Scholar
• http://scholar.google.com - search “mammary epithelial cells” 10
13
Now that we’ve found “Data” what are we going to “Tzu”?
http://cctsi.ucdenver.edu/RIIC
GPL (Geo PLatform)
• Describe list of elements in the array– cDNAs, oligonucleotide probesets, ORFs,
antibodies)
• Each platform is assigned a unique and stable GEO accession number (GPLxxx)
• Example:– GPL570: Affymetrix GeneChip Human
Genome U133 Plus 2.0 Array
GSM (Geo SaMple)
• Describe the conditions under which an individual Sample was handled, the manipulation it underwent, and the abundance measurement of each element derived from it!
• A Sample entity must reference only one Platform and may be included in multiple Series
• Example: GSM300166 (remember HW 2??!) PostcentralGyrus_female_91yrs_indiv10
GSE (Geo SEries)
• Defines a set of related Samples considered to be part of a group
• Provide a focal point and description of the experiment as a whole
• Example:
Let’s look at an example
• Goto the GEO site• Under “GEO accession”, type:
– GSE11882
• Find these terms:– GPL– GSM– GSE
GDS (Geo DataSet)
• Curated sets of GEO Sample data• Represents a collection of biologically
and statistically comparable GEO Samples– Same platform– Shared common set of probe elements– Samples’ intensities calculated in an
equivalent manner (background correction, normalization, etc)
• Example: GSD200 (see next page)
What’s wrong with the approach?
• Only show one gene at a time• Hard to select a gene set for
downstream analysis such as clustering
• Hard to output a gene list.
BRB-ArrayTools
Free, open-source softwareMicrosoft Excel plug-in Only works on Windows platform Imposed by all Excel limitations
http://linus.nci.nih.gov/BRB-ArrayTools.html
BRB-ArrayTools• Biometric Research Branch (BRB)
– Statistical/biomathematical component – Division of Cancer Treatment and Diagnosis (NCI)
• Richard Simon & BRB-ArrayTools Development Team
• BRB ArrayTools– Visualization and statistical analysis of DNA microarray
gene expression data– Developed by statisticians – Excel add-in– Analytic/visualization tools: R statistical system, C and
Fortran programs, Java applications.– Visual Basic for Applications integrates components
Objectives
• “provide scientists with software … without requiring them to learn a programming language”
• “encapsulate into software the experience of professional statisticians”
• “facilitate education of scientists in statistical methods for the analysis of DNA microarray data”
Installing BRB-ArrayTools
• Windows 98/2000/NT/XP/Vista/7 • Loads package as add-in to Microsoft
Excel– Excel 2000 or later– Creates ArrayTools menu on Excel menu
bar
• Intensive computations performed in R or compiled programs
Installation
• Go to “http://linus.nci.nih.gov/BRB-ArrayTools.html”• Click on “All required components in ONE file”
Installation
• Click on “Download Standard Version 3.7.1 (All in one file)”• When prompted, enter User name and Password
(these will be sent to you after your FREE registration)
DemonstrationDemonstration
Installation
• Follow the step-by-step procedures• In the interest of time, the software has already
been installed on your machine
DemonstrationDemonstration
43
List of 220 or so genes with potential indications for treatment or further understanding of Breast
Cancer pathways
44
List of 220 or so genes with potential indications for treatment or further understanding of Breast
Cancer pathways
List of 6 or so genes with a shared biological pathway (transcription factor activity)
45
Do these genes have a CA connection?
• In NCBI GENE search: “(TBX6 OR ZNF423 OR NR4A3 OR SCAND2 OR CEBPE OR SIX2) AND Cancer”
45
48
Browsing Genes and Genomes
• NCBI • Ensembl• UCSC Genome Browser
– Which one to use?• http://cctsi.ucdenver.edu/RIIC/Pages/
TranslationalInformaticsVideos.aspx#GenomeBrowsers
– A full day of Ensembl training: http://hsl2.ucdenver.edu/ensembl/
48
49
BLASTing
• To what gene does this nucleotide sequence most likely belong?
• gggtgaacag ccgcacggga gtaggtacgc acctgacctc gctggcactg ccgggcaagg cagagggtgt ggcgtcgctc accagccagt gcagctacag cagcaccatc gtccatgtgg gagacaagaa gccgcagccg gagttagaga tggtggaaga tgctgcgagt gggccagaat
• http://blast.ncbi.nlm.nih.gov/Blast.cgi
• http://www.ensembl.org/Danio_rerio/blastview
• http://genome.ucsc.edu/cgi-bin/hgBlat?command=start
50
BLASTing
• What about this one?
• acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcacc tgactcctga ggagaagtct gcggttactg ccctgtgggg caaggtgaac gtggatgaag ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg agtcctttgg ggatctgtcc actcctgatg cagttatggg caaccctaag gtgaaggctc atggcaagaa agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg gcacctttgc cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact tcaggctcct gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca ccccaccagt gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc acaagtatca ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc ctaagtccaa ctactaaact gggggatatt atgaagggcc ttgagcatct ggattctgcc taataaaaaa catttatttt
51
Genetics in Literature
• What does this Sequence:
• ATTAAAGATGATTTTTACAGTCAATGAGCCACGTCAGGGAGCGATGGCACCCGCAGGCGGTATCAACTGATGCAAGTGTTCAAGCGAATCTCAACTCGTTTTTTCCGGTGACTCATTCCCGGCCCTGCTTGGCAGCGCTGCACCCTTTAACTTAAACCTCGGCCGGCCGCCCGCCGGGGGCACAGAGTGTGCGCCGGGCCGCGCGGCAATTGGTCCCCGCGCCGACCTCCGCCCGCGAGCGCCGCCGCTTCCCTTCCCCGCCCCGCGTCCCTCCCCCTCGGCCCCGCGCGTCGCCTGTCCTCCGAGCCAGTCGCTGACAGCCGCGGCGCCGCGAGCTTCTCCTCTCCTCACGACCGAGGCAGGTAAACGCCCGGGGTGGGAGGAACGCGGGCGGGGGCAGGGGAGCCGCGGGGGCCGAGTGAGGACCCCGGGCCTCGGGTCCCAGGCGCAAGGGTGCCCGGCCGGGCGGGGTCGGGACCCCAGTGAGGAGGGGCCGGGGGCTGCCCCGCGGGCGCGTGACGCGTCTCGGGCCTGCCCGGCTGCGCTGGTCTCCGCTCGGGTGAGGCGGCTTGGCTTCGCTTTTCAGGTTAGGAAAGCTCCCTTTACTGCGCGTTGGGGGGCTGGGGGAGCTGGCGGAGCCCCGTTAGGGAGGTCGGTGGCGCCGGGGTGTCTCAGCGCCCCCTGCACCCCGCGCGGGTCCGGCCCAGCGGGCGATCGCTGGCGCCCAGGGAACTCCGGGAGGGCCGCCAGCGGGCTCCGCAGGGCGCGGGGCGGGGAGGGGCGCCTGGGGGCCGCGGGGCTCGCGCTCCCCGCCCGTTGGCCGCCCCTCGGAGGCCGAGATCGGGGCCCAGAACGCCCCTTGGCAAGGCCTGGCGCTTCCGCGATGCCCAGAGGGTGCTTGGGGGGATGGAGAGAGGGGCGCCCGCCGGGGGAGTTCCGGGAGCCTCGGTGCCTCCCGCCGCAGCTGCAGCGTTCCTCCCGGGAGGCGGCCCAGCCCTTCATCCTCGCCGCCTGAGCTTCTCCGAGGGGGGCTGCAGCCTTGCGGCCGTTGCCACCGCCTGGAGAAGCGGCCCACGCGGACTGACGGGCGGGGGCGGGGCCTCGGGCCTCGGCGGGGGCGGGGTCCGGGGAGGCCCCACCCTCTGTTCTCCAGGGGCGGGGAGAGAGGAGCTGCAGGTCTGCGGCCTGGC
• Have to do with this book?http://www.amazon.com/The-Family-That-Couldnt-Sleep/dp/1400062454
Phylogenetics• Scientific procedure to reconstruct the
evolutionary history of organism or sequences• Evolutionary theory: groups of similar organisms
are descended from common ancestor.• Cladistics:
– Developed by Will Hennig, German entomologist (1950)
– Phylogenetic systematics: a mathematical approach
– Method of taxonomic classification of organism based on their evolution
• So, why do we study phylogenetics?
What can Phylogenetic tell you?
• Discovering the function of a gene– Is your gene of interest orthologous to
another well-characterized gene from another species
• Retracing the origin of a gene– Most genes travel together through
evolutionary time.– Determine if genes undergo genomic
modification such as mutation, deletion, duplication, speciation, loss and gain of function, inactivation and etc.
DNA; a good measurement
• Advantages over morphological taxonomic characters:– Character states are unambigous– Large number of characters can be used
to perform the analysis.
62
Find collaborators
• Colorado Profiles:http://profiles.ucdenver.edu/Search.aspx – Search: “mammary epithelial cells”
• Colorado Translational Informatics Community on Facebook: http://www.facebook.com/pages/Colorado-Translational-Informatics-Community/136023206424789
62
63
Get Informatics Help
• http://cctsi.ucdenver.edu/RIIC – 5 x 5 Videos– Find informatics experts – Monthly podcast– SeDLAC (Secondary Database Library
and Analysis Center)– Consultation and Data Analysis
63
64
Get $$
• NLM Professional Development Repository: http://cnx.org/content/m37008/latest/
• CCTSI Funding: http://cctsi.ucdenver.edu/Funding/Pages/default.aspx
• UC Denver Office of Grants and Contracts: http://www.ucdenver.edu/academics/research/AboutUs/GrantsContractsOffice/Pages/default.aspx
64
65
Find a Journal to Publish Findings
• http://www.biosemantics.org/jane/ - Example Search:
• “cDNA microarrays and a clustering algorithm were used to identify patterns of gene expression in human mammary epithelial cells growing in culture and in primary human breast tumors. Clusters of coexpressed genes identified through manipulations of mammary epithelial cells in vitro also showed consistent patterns of variation in expression among breast tumor samples. By using immunohistochemistry with antibodies against proteins encoded by a particular gene in a cluster, the identity of the cell type within the tumor specimen that contributed the observed gene expression pattern could be determined. Clusters of genes with coherent expression patterns in cultured cells and in the breast tumors samples could be related to specific features of biological variation among the samples. Two such clusters were found to have patterns that correlated with variation in cell proliferation rates and with activation of the IFN-regulated signal transduction pathway, respectively. Clusters of genes expressed by stromal cells and lymphocytes in the breast tumors also were identified in this analysis. These results support the feasibility and usefulness of this systematic approach to studying variation in gene expression patterns in human cancers as a means to dissect and classify solid tumors.”
65
66
Get Informatics Help!
• http://cctsi.ucdenver.edu/RIIC – 5 x 5 Videos– Find informatics experts – Monthly podcast– SeDLAC (Secondary Database Library
and Analysis Center)– Consultation and Data Analysis