where we start? we will start with the biological problem translate that to what the data looks like...

26
Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre- processing data Ways to Analyze data Using existing methods Adapting existing methods Using newer ideas

Upload: suzanna-sparks

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Where we start?

• We will start with the Biological Problem• Translate that to what the data looks like• Think about issues with pre-processing data• Ways to Analyze data

– Using existing methods– Adapting existing methods– Using newer ideas

Page 2: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Some Basic Biology

Lets get familiar with some terms:• Cell• Nucleus• DNA• Genome• Gene• Central Dogma of Molecular Biology

• DISCLAIMER: My Biology is VERY rudimentary, so don’t count on it TOO much.

Page 3: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Biological Hierarchies

Molecule

Cell

Tissue

Organ

Organism

Page 4: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

For Example

• Organism: human• Organ: say liver• Cell• Organelles: say nucleus ribosome • Macromolecules

– DNA: 22+XX(Y) chromosomes:3x10^9bp– RNA: ~2000 molecules– Proteins ~ 30,000-50,000

• Building blocks: Nucleotides (ATGC) and amino acids

Page 5: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

What is a Cell?

Page 6: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Cell Nucleus

Page 7: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

DNA

• Genetic material (DNA) is present in the nucleus, as a DNA-protein complex called chromatin.

• The DNA is present as a number of discrete units known as chromosomes.

• Each DNA strand wraps around groups of small protein molecules called histones, forming a series of bead-like structures, called nucleosomes, connected by the DNA strand.

Page 8: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

DNA

                             

Page 9: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Genome

The sum of all information contained in the DNA for any living thing. The sequence of all the nucleotides in all the chromosomes of an organism.

Page 10: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Gene

• A hereditary unit consisting of a sequence of DNA that occupies a specific location on a chromosome and determines a particular characteristic in an organism.

• The nucleus of each eukaryotic (nucleated) cell has a complete set of genes.

• Each gene provides a blueprint for the synthesis (via RNA) of enzymes and other proteins and specifies when these substances are to be made.

• Genes undergo mutation when their DNA sequence changes.

Page 11: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Gene: More Facts• Genes govern both the structure and metabolic functions of

the cells, and thus of the entire organism and, when located in reproductive cells, they pass their information to the next generation.

• Chemically, each gene consists of a specific sequence of DNA building blocks called nucleotides. Each nucleotide is composed of three subunits: a nitrogen-containing compound, a sugar, and phosphoric acid. Genes may vary in their precise makeup from person to person, including, for example, one nucleotide in a certain location in some people but another nucleotide in that location in others.

Page 12: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Genes: More Facts

• Geometrically, the gene is a double helix formed by the nucleotides.

• Gene loci are often interspersed with segments of DNA that do not code for proteins; these segments are termed “junk DNA.”

• When junk DNA occurs within a gene, the coding portions are called exons and the noncoding (junk) portions are called introns.

• Junk DNA makes up 97% of the DNA in the human genome, and, despite its name, is necessary for the proper functioning of the genes.

Page 13: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Some more facts about genes:

Almost every cell of the body of any organism contains identical genes.

• ·  Only a fraction of these genes are "expressed"(turned on) and these confer unique properties to each cell type.

• ·  Scientists study the kinds and amounts of expressed genes in a cell, which in turn provides insights into how the cell responds to its changing needs.

• ·  Gene expression is a highly complex and tightly regulated process that allows a cell to respond dynamically both to environmental stimuli and to its own changing needs.

Page 14: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Central dogma of molecular biology

• Each gene is transcribed (at the appropriate time) from DNA into mRNA, which then leaves the nucleus and is translated into the required protein.

• Any gene which is active in this way at any particular time is said to be expressed.

• THIS IS CRUCIAL TO REMEMBER FOR MICROARRAYS

Page 15: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Breakthrough: Sequencing

• Sequencing: DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine—in a strand of DNA.

• So when the genome was sequenced there was a flurry of research in this area.

Page 16: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Some Questions that are being asked:

• What genes contribute to cancer ? • Are these genes similar in mice, rat, humans and other

species?• What genes are involved in depression ?• What genes respond to cocaine ?• What genes are present in a particular cancer cell type and

not in others ?• How do humans think as opposed to monkey thoughts ?

(given 99.2% genome homology)

Page 17: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

The National Center for Biotechnology Information states that:

• The proper and harmonious expression of a large number of genes is a critical component of normal growth and development and the maintenance of proper health.

• Disruptions or changes in gene expression are responsible for many diseases.

Page 18: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

How DO we answer the questions asked?

• What is the BEST way to study Genes?

• How can we effectively answer questions related to genes?

• Should we focus on a FEW genes and look at it through time or conditions, have a focused study?

• Should we look at many genes at once (sometimes the whole genome) and compare them all across conditions?

Page 19: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Forward and Reverse genetics approaches in biology

Biological System

(Organism)

Building Blocks

(Genes/Molecules)

Reverse Genetics Approach

(Bioinformatics): Discover all genes that are different in cancer cells as compared to

control. (n=300) (t=1month)

Forward Genetics Approach

(Experiments)

e.g. the ras oncogene

Hypothesis: Specific alterations in genes lead to cancer. What

are these genes? (t=10 years/ lab)

Page 20: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Reverse Genetics Approach

• Requires almost complete information of the genome: Sequences annotated and stored in a database

• THIS WILL BE OUR FOCUS FOR THIS CLASS.

• Hence we will have many genes (few conditions) and often few replicates.

• This falls under the general heading of GENOMICS.

Page 21: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Biological Information Processing

• DATA: Genomics• Storage and Retrieval: Database• Summary,Analysis and Visualization :

Statistics

Page 22: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Outline for this class• What types of data are we interested in?

– Microarrays– RNA seq– GWAS

• What types of experiments do they come from?• What are the similarities and difference?• What statistical models and methods are used to understand the

structure of the data• What statistical techniques are used to analyze this data?• Assumptions about distributions• Pitfalls about the three different data types.

Page 23: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Types of Data

• To decide whether I do Microarray or RNA-seq experiment the following has to be taken into account:

• Potential Deciding Factors:– What genome info do I have?– How much money do I have?– What statistical methods are we familiar with?

• Potential Goals are also important in the decision:• Goal is?

– Differential Expression– Absolute Quantification– Discovering Novel Genes– Isoform Expressions– Low Level Expressions– Alternative Splicing

Page 24: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Common to both platforms

• Data are more or less reproducible• Both are subject to high background noises• Both supposedly have high correlation to gene content• Similar Statistical methods• Both subject to biases

Page 25: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

Pros and Cons

• Microarray: Pros– Reliable robust, around for a while– Easily automated– Some consensus on statistical

analysis– Quick turnaround– Lower Cost

• Cons:– Dependent on prior knowledge– Cannot detect structural forms– No isoforms or low level

expressions detected– Relative expression NOT absolute

quantification

• RNA-seq: Cons– NOT Reliable robust, new– HARD to automate– Little consensus on statistical analysis– LONG turnaround– Higher Cost

• Pros:– NOT dependent on prior knowledge– Can detect structural forms alternate

splicing– Isoforms or low level expressions

detected– Absolute quantification– Increased dynamic range

Page 26: Where we start? We will start with the Biological Problem Translate that to what the data looks like Think about issues with pre-processing data Ways to

How they are different

• Microarray measures• PROBE intensity• NEED to know the

sequence prior to experiemnt

• RNA-seq measures PROBE count in terms of the number of reads for a particular sequence

• SO new sequences can be found