Sequence Analysis with Artemis
and
Artemis Comparison Tool (ACT)
Carribean Bioinformatics Workshop18th-29th January, 2010
Overview of the Pathogen Genomics, WTSI
Introduction to Artemis
Demonstration of Artemis
Hands on guided exercise in Artemis.
New features in Artemis
Viewing second generation sequencing data in Artemis
Introduction to Artemis Comparison Tool (ACT)
Demonstration of ACT
Hands on guided exercise in ACT
Viewing new technology sequencing data in Artemis
Workshop Overview
The Wellcome Trust Sanger Institute
Wellcome Trust Photo Library
The Wellcome Trust Sanger
Institute
•Funded by The Wellcome Trust, a
registered charity.
•Established in 1993 to begin the Human
genome project.
•First Draft (2000) complete (2004)
Data release policy:
All sequence data is released
immediately and is freely available
via the internet in order to
maximise its benefit for research.
http://www.sanger.ac.uk
ftp://ftp.sanger.ac.uk/Wellcome Trust Photo Library
The Genomic Revolution
1977 Sanger and co-workers sequence bacteriophage phiX174 (5386 bp)
1 millions years to complete human genome (~3,000 Mbp)
Late 1980s Sanger’s techniques refined
1,000’s of years to complete human genome
1990’s Race to sequence human genome
10 years to complete human genome
2009 Novel sequencing technologies
$1000 genome?
The Human Genome ProjectHuman Genome Sequence Contributors
CSHL
TIGR
UTSW
UOKNOR
SDSTDC
SHGC
UWMSC
GTC
Sanger Institute
WUGSC
WIBR
UWGC
JGI
BCM
Keio
RIKEN
Genoscope
Beijing
GBF
MPIMG
IMB
United States
United Kingdom
Japan
France
Germany
China
United States
United Kingdom
Japan
France
Germany
China
WHO morbidity and mortality estimates (‘02)
World Health Report, 2004
Cause
Mortality Morbidity
(DALYS*)
Population (000) 6 224 985 6 224 985 (000) % total (000) % total
TOTAL 57 029 100 1 490 126 100 I. Communicable diseases, maternal and perinatal
conditions and nutritional deficiencies 18 324 32.1 610 319 41.0
Infectious and parasitic diseases 10 904 19.1 350 333 23.5 Respiratory infections 3 963 6.9 94 603 6.3 Maternal conditions 510 0.9 33 632 2.3 Perinatal conditions
2 462 4.3 97 335 6.5
Nutritional deficiencies 485 0.9 34 417 2.3 II. Noncommunicable conditions 33 537 58.8 697 815 46.8 Malignant neoplasms 7 121 12.5 75 545 5.1 Other neoplasms 149 0.3 1 749 0.1 Diabetes mellitus 988 1.7 16 194 1.1 Nutritional/endocrine disorders 243 0.4 7 961 0.5 Neuropsychiatric disorders 1 112 1.9 193 278 13.0 Sense organ disorders 3 0.0 69 381 4.7 Cardiovascular diseases 16 733 29.3 148 190 9.9 Respiratory diseases 3 702 6.5 55 153 3.7 Digestive diseases 1 968 3.5 46 476 3.1 Diseases of the genitourinary system 848 1.5 15 217 1.0 Skin diseases 69 0.1 3 748 0.3 Musculoskeletal diseases 106 0.2 30 169 2.0 Congenital abnormalities 493 0.9 27 381 1.8 Oral diseases 2 0.0 7 372 0.5 III. Injuries 5 168 9.1 181 991 12.2 Unintentional 3 551 6.2 133 112 8.9 Intentional 1 618 2.8 48 879 3.3
* Disability adjusted life years
Pathogen Sequencing at the Sanger
Mycobacterium tuberculosis
Neisseria meningitidis
Salmonella typhi
Candida albicans
Aspergillus fumigatus
Flu
Dengue
Enteric phage
E. coli Inc plasmids Tsetse fly
Sandfly
Shistosoma mansoni
Plasmodium falciparum
Leishmania major
Trypanosoma brucei
Pathogen Genomics
Genome sequencing of prokaryotic and eukaryotic
pathogens that typically require:
What do we do?
• Bioinformatics tools/software development
• Integration of genome analyses and annotation,
and in silico analyses
• Comparative genomics/functional genomics
• Web accessible databases
Sequencing strategy and assembly
Contiguous sequence
DNA
pUC clones
end sequences
‘Draft sequence’
Order of contigs?
95% coverage, 4-5x depth.
large clone
end sequence
Finished sequence: 100% coverage, 10x depth.
physical gap sequence gap
Shotgun sequencing – strategy
Shotgun assembly - Yersinia pestis
Annotation Strategy
Generating the complete genome sequence
Primary
DNA sequence
Dotter BlastN BlastX
Gene finders
tRNA scan
Repeats Pseudo-genesrRNACDSs
tRNA
Preannotation
manual
curation
Primary
DNA sequence
Dotter BlastN BlastX
Gene finders
tRNA scan
Repeats Pseudo-genesrRNACDSs
tRNA
Fasta BlastP Pfam Prosite Psort SignalP TMHMM
Preannotation
Manual
curation
Manual
curationAnnotated
sequence