cs 466 introduction to bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/lecture1.pdf ·...

41
CS 466 Introduction to Bioinformatics Saurabh Sinha

Upload: nguyentruc

Post on 16-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

CS 466Introduction to Bioinformatics

Saurabh Sinha

Page 2: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

What is the course about?

• Algorithmic concepts, applied to sample (toy)problems in “bioinformatics”– Follows the text book

• “Real” bioinformatics research– Follows the best journals

• Not about practical training in the use ofpopular bioinformatics software

Page 3: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Grading

• Assignments: 40%– About one every two weeks

• Mid Term: 30%• Final: 30%

Page 4: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Expectations

• Some programming skills– Any programming language is fine

Page 5: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Administrative Details• Instructor:

– Saurabh Sinha– Room 2122, Siebel Center– Email: [email protected]

• Class hrs: Tue & Thu, 3:30pm-4:45pm, 1131SC• Office hrs: Tue, before class (2:30 - 3:30 pm) 2122SC• Web site: http://veda.cs.uiuc.edu/courses/fa08/cs466/• Welcome to sit in, if not taking for credit

Page 6: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Text books

• Jones and Pevzner: required

• Durbin et al.: recommended

Page 7: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Other course

• CS 591 BIO: weekly seminar onbiomedical informatics

• Fridays at 10:30 am.

Page 8: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Motivating bioinformatics

Page 9: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Special issue of journal Science, July 1, 2005.

Page 10: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

>What Is the Universe Made Of?>What is the Biological Basis ofConsciousness?>Why Do Humans Have So Few Genes?>ToWhat Extent Are Genetic Variation and Personal HealthLinked?>Can the Laws of Physics Be Unified?>How Much CanHuman Life Span Be Extended?>What Controls OrganRegeneration?>How Can a Skin Cell Become a NerveCell?>How Does a Single Somatic Cell Become a WholePlant?>How Does Earth's Interior Work?>Are We Alone in theUniverse?>How and Where Did Life on Earth Arise?>WhatDetermines Species Diversity?>What Genetic Changes MadeUs Uniquely Human?>How Are Memories Stored andRetrieved?>How Did Cooperative Behavior Evolve?>How WillBig Pictures Emerge from a Sea of Biological Data?>How FarCan We Push Chemical Self-Assembly?>What Are the Limits ofConventional Computing?>Can We Selectively Shut OffImmune Responses?>Do Deeper Principles Underlie QuantumUncertainty and Nonlocality?>Is an Effective HIV VaccineFeasible?>How Hot Will the Greenhouse World Be?>What CanReplace Cheap Oil -- and When?>Will Malthus Continue to BeWrong?

Page 11: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Many of the most profound scientificquestions of today are within therealm of bioinformatics research

Page 12: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

“Why do humans have so fewgenes ?”

Page 13: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

A simple organism

GENE

Raw

mat

eria

lsEnvironmental signal

Response (protein)

Page 14: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

A simple organism

GENE1

GENE2

GENE3

Page 15: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

A simple organism

GENE1

GENE2

GENE3

GENE4

GENE5

GENE6

GENE7

GENE8

GENE9

GENE10

Page 16: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

A complex organism

GENE1

GENE2

GENE3

GENE4

GENE5

GENE6

GENE7

GENE8

GENE9

GENE10

Complex circuit of interactions

Page 17: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Regulatory networks• This may be the reason why humans

have so few genes (the circuit, not thenumber of switches, carries thecomplexity)

• Bioinformatics can unravel suchnetworks, given the genome (DNAsequence) and gene activity information

Page 18: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Decoding the regulatory network

• Find patterns (“motifs”) in DNAsequence that occur more often thanexpected by chance

• Statistics on DNA sequences and words

• Knowing these can tell us about edgesin the regulatory network

Page 19: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Decoding the regulatory network

• An example computational problem:• Given a string of length 10,000 over the

alphabet {A,C,G,T}• Count the number of occurrences Nw

ofevery 6 letter word w

• Are there specific words that occurmore frequently than expected bychance?

Page 20: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Decoding the regulatory network

• What is expected by chance?• What is “more frequently”?• Interesting mathematical questions

• The Moby Dick example

Page 21: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How
Page 22: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Decoding the regulatory network

• What is expected by chance?• What is “more frequently”?• Interesting mathematical questions

• The Moby Dick example• This is called “motif finding”, and helps

decode the regulatory network!

Page 23: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Comparing DNA• Humans are about 99.9% identical to each

other, DNA-wise.

• How do we know that ?

• Compare the genome of two individuals.

• The computational problem: Are twosequences similar ?

Page 24: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Sequence alignment• Why is this a problem?

• The two sequences will differ by “substitutions”,“insertions” and “deletions” accumulated duringevolution

• The comparison algorithm has to be robust tosuch possibilities.– A special technique called “dynamic programming”

does all this, and is “efficient”

Page 25: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Sequence alignment

• Why should we care?

• Compare human genome with fish. You’ll seesome portions that are highly similar.

• These “conserved” portions are often genes…

• … or regulatory sequences! The regulatorynetwork again.

Page 26: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

On counting genes

• The original question was “Why do humanshave so few genes?”

• How do we know how many genes there arein the human genome ? (And where they arein the genome)

• Experiments can be designed, butbioinformatics plays a major role

Page 27: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Gene prediction

• The task of predicting the locations ofgenes in a new genome (“annotation”)

• Gene prediction software

• The more sophisticated ones use“Hidden Markov models” (HMM) andmultiple species comparison

Page 28: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

HMM for Gene Prediction

http://researchweb.watson.ibm.com/journal/rd/453/birney.html

What is this graph?

It captures the “architecture” of a gene.

It translates into a “probabilisticmodel”.

It leads directly to a gene finding algorithm

Page 29: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

“What controls organregeneration ?”

“How does a single somaticcell become a whole plant ?”

Page 30: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Development and Regeneration

• Developmental biology

• The timeline from a single cell (with geneticmaterial from mother and father) to amulticellular embryo, and to an adult

• A paradox : All cells in the adult body havethe same DNA, then how come different cellsare different ?

Page 31: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Regulatory networks again

• Bioinformatics used to scan entire genomefor regions that participate in “segmenting”the embryo

• Hidden Markov models used to detect suchregions

• Multiple species comparison aids discovery

Page 32: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

“How did cooperative behaviorevolve?”

Page 33: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Social behavior and bioinformatics?

• Social behavior in honey bees

• Young worker bees are nurses in the hive;older ones go out to forage

• This behavioral maturation is determined byneeds of colony. What is the genetic basis ofthis ?

Page 34: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Social behavior and bioinformatics

• Illinois team scanned the genome tounderstand this (2006)

• Regulatory network of social behavior

• Statistical tools, such as Hypergeometric test

• Machine learning tools such as “supportvector machine classification”

Page 35: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Other challenges

Page 36: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Protein structure prediction

http://www.denizyuret.com/students/vkurt/thesis-main_dosyalar/image006.gif

Page 37: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Protein structure prediction• Can we predict the 3-D structure of a protein

from its amino acid sequence ?

• Why ?– One good reason: structure gives clues about function. If we

can tell the structure, we can perhaps tell the function

– We can design amino acid sequences that will fold intoproteins that do what we want them to do. Drug design !!

• Neural networks, a popular technique incomputer science, applied to this problem

Page 38: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Metagenomics• Most studies to date are on genomes of one

species

• A sample from the soil contains hundreds ofbacteria, thousands of viruses. Can we studyall of these ?

• The Sorcerer II expedition• http://www.sorcerer2expedition.org/version1/HTML/main.htm

Page 39: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Many more challenges

• New types of data come due totechnological breakthroughs in biology

• High throughput data carriesunprecedented amount of information

• Too much noise• Bioinformatics removes the noise and

reveals the truth

Page 40: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Bioinformatics

• Is not about one problem (e.g., designingbetter computer chips, better compilers,better graphics, better networks, betteroperating systems, etc.)

• Is about a family of very different problems,all related to biology, all related to each other

• How can computers help solve any of thisfamily of problems ?

Page 41: CS 466 Introduction to Bioinformaticsveda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture1.pdf · Introduction to Bioinformatics ... –Any programming language is fine. ... •How

Bioinformatics and You

• You can learn the tools of bioinformatics• These tools owe their origin to computer

science, information theory, probabilitytheory, statistics, etc.

• You can learn the language of biology,enough to understand what the problems are

• You can apply the tools to these problemsand contribute to science