bioinformatics 生物信息学理论和实践 唐继军 [email protected]...
DESCRIPTION
Bioinformatics 生物信息学理论和实践 唐继军 [email protected] 北京林业大学计算生物学中心 www.bjfuccb.edu. Hash. Initialize: my %hash = (); Add key/value pair: $hash{$key} = $value; Add more keys: %hash = ( 'key1', 'value1', 'key2', 'value2 ); %hash = ( key1 => 'value1', key2 => 'value2', ); - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/1.jpg)
Bioinformatics生物信息学理论和实践唐继军[email protected]北京林业大学计算生物学中心www.bjfuccb.edu
![Page 2: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/2.jpg)
Hash
• Initialize: my %hash = ();• Add key/value pair: $hash{$key} = $value;• Add more keys:
• %hash = ( 'key1', 'value1', 'key2', 'value2 );• %hash = ( key1 => 'value1', key2 => 'value2', );
• Delete: delete $hash{$key};
![Page 3: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/3.jpg)
Print to file
• Open a file to print• open FILE, ">filename.txt";• open (FILE, ">filename.txt“);
• Print to the file• print FILE $str;
![Page 4: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/4.jpg)
#Appendopen(FILE, ">>out") or die "Cannot open file to write";
print FILE "Test\n";
close FILE;exit;
![Page 5: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/5.jpg)
#!/usr/bin/perlprint "My name is $0 \n";print "First arg is: $ARGV[0] \n";print "Second arg is: $ARGV[1] \n";print "Third arg is: $ARGV[2] \n";
$num = $#ARGV + 1; print "How many args? $num \n";print "The full argument string was: @ARGV \n";
![Page 6: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/6.jpg)
use BeginPerlBioinfo;
my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = '';my $regexp = ''; my @locations = ( );
@file_data = get_file_data($ARGV[0]);$dna = extract_sequence_from_fasta_data(@file_data);%rebase_hash = parseREBASE($ARGV[1]);
do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } }} until ( $query =~ /quit/ );
exit;
![Page 7: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/7.jpg)
Regular Expression• ^ beginning of string • $ end of string • . any character except newline • * match 0 or more times • + match 1 or more times • ? match 0 or 1 times; • | alternative • ( ) grouping; “storing” • [ ] set of characters • { } repetition modifier • \ quote or special
![Page 8: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/8.jpg)
\
![Page 9: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/9.jpg)
[]
![Page 10: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/10.jpg)
![Page 11: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/11.jpg)
![Page 12: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/12.jpg)
$mystring = "[2004/04/13] The date of this article.";
if($mystring =~ m/(\d)/) { print "The first digit is $1.";}
if($mystring =~ m/(\d+)/) { print "The first number is $1.";}
if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3";}
while($mystring =~ m/(\d+)/g) { print "Found number $1."; }
@myarray = ($mystring =~ m/(\d+)/g); print join(",", @myarray);
![Page 13: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/13.jpg)
$mystring = "[2004/04/13] The date of this article.";
if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3";}
![Page 14: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/14.jpg)
$mystring = "[2004/04/13] The date of this article.";
if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3";}
![Page 15: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/15.jpg)
Download and install programs
• Unzip or untar• unzip• If file.tar.gz, tar xvfz file.tar.gz
• Go to the directory and “./configure”• Then “make”
![Page 16: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/16.jpg)
Excercies
• Download clustalw• Try to install it
![Page 17: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/17.jpg)
System subroutine
system ("ls –ltr");
![Page 18: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/18.jpg)
Excercies 2
• Use pro.fasta• Find alignment for each triple of protein• Let’s design the program together• Use “system” in perl
• system ("command parameters");
![Page 19: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/19.jpg)
sub ReadFasta {
my ($fname) = @_; open(FILE, $fname) or die "Cannot open $fname\n"; my $data = ""; my @dnas = (); while(my $line = <FILE>) { if ($line =~ /^>/) { if ($data ne "") { push(@dnas, $data); } $data = ""; } $data .= $line; } if ($data ne "") { push(@dnas, $data); } close FILE;
return @dnas;}
![Page 20: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/20.jpg)
print "Please input file name:\n";my $fname = <STDIN>;
my @dnas = ReadFasta($fname);
my $len = $#dnas + 1;
for (my $i = 0; $i < $len; $i++) { for (my $j = $i+1; $j < $len; $j++) { for (my $k = $j+1; $k < $len; $k++) { $fname = "$i\_$j\_$k"; print $fname; open(OUT, ">$fname"); print OUT $dnas[$i]; print OUT $dnas[$j]; print OUT $dnas[$k]; close OUT; system ("./clustalw2 $i\_$j\_$k");
} }}
![Page 21: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/21.jpg)
Working with Single DNA Sequences
![Page 22: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/22.jpg)
Learning Objectives
• Discover how to manipulate your DNA sequence on a computer, analyze its composition, predict its restriction map, and amplify it with PCR
• Find out about gene-prediction methods, their potential, and their limitations
• Understand how genomes and sequences and assembled
![Page 23: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/23.jpg)
Outline
1. Cleaning your DNA of contaminants2. Digesting your DNA in the computer3. Finding protein-coding genes in your DNA
sequence4. Assembling a genome
![Page 24: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/24.jpg)
Cleaning DNA Sequences• In order to sequence genomes, DNA sequences are often
cloned in a vector (plasmid, YAC, or cosmide) • Sequences of the vector can be mixed with your DNA sequence• Before working with your DNA sequence, you should always
clean it with VecScreen
![Page 25: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/25.jpg)
VecScreen• http://www.ncbi.nlm.nih.gov/
VecScreen/VecScreen.html• Runs a special version of Blast• A system for quickly identifying
segments of a nucleic acid sequence that may be of vector origin
![Page 26: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/26.jpg)
![Page 27: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/27.jpg)
![Page 28: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/28.jpg)
![Page 29: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/29.jpg)
![Page 30: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/30.jpg)
What to do if hits found• If hits are in the extremity, can just
remove them• If in the middle, or vectors are not what
you are using, the safest thing is to throw the sequence away
![Page 31: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/31.jpg)
Computing a Restriction Map• It is possible to cut DNA sequences using restriction enzymes• Each type of restriction enzyme recognizes and cuts a different
sequence:• EcoR1: GAATTC• BamH1: GGATCC
• There are more than 900 different restriction enzymes, each with a different specificity
• The restriction map is the list of all potential cleavage sites in a DNA molecule
• You can compile a restriction map with www.firstmarket.com/cutter
![Page 32: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/32.jpg)
Cannot get it work!
![Page 33: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/33.jpg)
http://biotools.umassmed.edu/tacg4
![Page 34: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/34.jpg)
![Page 35: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/35.jpg)
Making PCR with a Computer• Polymerase Chain Reaction (PCR) is a method for amplifying DNA• PCR is used for many applications, including
• Gene cloning• Forensic analysis• Paternity tests
• PCR amplifies the DNA between two anchors• These anchors are called the PCR primer
![Page 36: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/36.jpg)
Designing PCR Primers• PCR primes are typically 20 nucleotides long• The primers must hybridize well with the DNA• On biotools.umassmed.edu, find the best location for the
primers: • Most stable• Longest extension
![Page 37: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/37.jpg)
![Page 38: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/38.jpg)
![Page 39: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/39.jpg)
![Page 40: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/40.jpg)
Analyzing DNA Composition• DNA composition varies a lot• Stability of a DNA sequence depends on its G+C
content (total guanine and cytosine)• High G+C makes very stable DNA molecules• Online resources are available to measure the
GC content of your DNA sequence• Also for counting words and internal repeats
![Page 41: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/41.jpg)
http://helixweb.nih.gov/emboss/html/
![Page 42: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/42.jpg)
Counting words
• ATGGCTGACT• A, T, G, G, C, T, G, A, C, T• AT, TG, GG, GC, CT, TG, GA, AC, CT• ATG, TGG, GGC, GCT, CTG, TGA, GAC, ACT
![Page 43: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/43.jpg)
www.genomatix.de/cgi-bin/tools/tools.pl
![Page 44: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/44.jpg)
EMBOSS servers
• European Molecular Biology Open Software Suite
• http://pro.genomics.purdue.edu/emboss/
![Page 45: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/45.jpg)
![Page 46: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/46.jpg)
![Page 47: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/47.jpg)
![Page 48: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/48.jpg)
ORF
• EMBOSS• NCBI
![Page 49: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/49.jpg)
![Page 50: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/50.jpg)
![Page 51: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/51.jpg)
ncbi.nlm.nih.gov/gorf/gorf.html
![Page 52: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/52.jpg)
![Page 53: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/53.jpg)
Internal repeats
• A word repeated in the sequence, long enough to not occur by chance
• Can be imperfect (regular expression)• Dot plot is the best way to spot it
![Page 54: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/54.jpg)
arbl.cvmbs.colostate.edu/molkit
![Page 55: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/55.jpg)
![Page 56: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/56.jpg)
Predicting Genes
• The most important analysis carried out on DNA sequences is gene prediction
• Gene prediction requires different methods for eukaryotes and prokaryotes
• Most gene-prediction methods use hidden Markov Models
![Page 57: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/57.jpg)
Predicting Genes in Prokaryotic Genome
• In prokaryotes, protein-coding genes are uninterrupted• No introns
• Predicting protein-coding genes in prokaryotes is considered a solved problem• You can expect 99% accuracy
![Page 58: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/58.jpg)
Finding Prokaryotic Genes with GeneMark
• GeneMark is the state of the art for microbial genomes
• GeneMark can• Find short proteins• Resolve overlapping genes• Identify the best start codon
• Use exon.gatech.edu/GeneMark
• Click the “heutistic models”
![Page 59: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/59.jpg)
Predicting Eukaryotic Genes
• Eukaryotic genes (human, for example) are very hard to predict
• Precise and accurate eukaryotic gene prediction is still an open problem• ENSEMBL contains 21,662 genes for the human genome• There may well be more genes than that in the genome, as yet unpredicted
• You can expect 70% accuracy on the human genome with automatic methods
• Experimental information is still needed to predict eukaryotic genes
![Page 60: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/60.jpg)
Finding Eukaryotic Genes with GenomeScan
• GenomeScan is the state of the art for eukaryotic genes
• GenomeScan works best with• Long exons• Genes with a low GC content
• It can incorporate experimental information
• Use genes.mit.edu/genomescan
![Page 61: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/61.jpg)
Producing Genomic Data• Until recently, sequencing an entire genome was very
expensive and difficult• Only major institutes could do it• Today, scientists estimate that in 10 years, it will cost
about $1000 to sequence a human genome• With sequencing so cheap, assembling your own
genomes is becoming an option• How could you do it?
![Page 62: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/62.jpg)
Sequencing and Assembling a Genome (I)
• To sequence a genome, the first task is to cut it into many small, overlapping pieces
• Then clone each piece
![Page 63: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/63.jpg)
Sequencing and Assembling a Genome (II)
• Each piece must be sequenced• Sequencing machines cannot do an entire sequence at once
• They can only produce short sequences smaller than 1 Kb• These pieces are called reads
• It is necessary to assemble the reads into contigs
![Page 64: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 北京林业大学计算生物学中心 bjfuccb](https://reader031.vdocuments.site/reader031/viewer/2022031614/56812cad550346895d915d5a/html5/thumbnails/64.jpg)
Sequencing and Assembling a Genome (III)
• The most popular program for assembling reads is PHRAP • Available at www.phrap.org
• Other programs exist for joining smaller datasets• For example, try CAP3 at pbil.univ-lyon1.fr/cap3.php