bioinformatics sequences alignment...

47
Bioinformatics Sequences alignment _________________________________________________________________________________________________ ___________________ Kirill Bessonov Sequence Alignment Practical Presented by Kirill Bessonov Nov 4, 2014

Upload: abner-gaines

Post on 25-Dec-2015

228 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 1

Sequence Alignment Practical

Presented by

Kirill Bessonov

Nov 4, 2014

Page 2: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 2

Talk Structure

• Introduction to sequence alignments• Methods / Logistics

– Global Alignment: Needleman-Wunsch– Local Alignment: Smith-Waterman– step by step illustration

• Computational implementation of alignment– Retrieval of sequences using R– Alignment of sequences using R

Page 3: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 3

Sequence Alignments

Comparing two objects is intuitive:– Identification of

• functional motifs / regions• protein function

– genetic manipulation (e.g. alternative splicing)– identification of binding sites of primers / TFs– de novo genome assembly

• alignment of the short “reads” from high-throughput sequencer (e.g. Illumina or Roche platforms)

Page 4: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 4

Comparing two sequences• There are two ways of pairwise comparison

– Global using Needleman-Wunsch algorithm (NW)– Local using Smith-Waterman algorithm (SW)

• Global alignment (NW)• Alignment of the “whole” sequence

• Local alignment (SW)• tries to align portions (e.g. motifs) • more flexible

– Considers sequences “parts”

• works well on – highly divergent sequences

entire sequence

perfect match

unaligned sequence

aligned portion

Page 5: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 5

Global alignment (NW)

• Sequences are aligned end-to-end along their entire length • Many possible alignments are produced

– The alignment with the highest score is chosen• Naïve algorithm is very inefficient (Oexp)

– To align sequence of length 15, need to consider• (insertion, deletion, gap)15 = 315 = 1,4*107

– Impractical for sequences of length >20 nt• Used to analyze homology/similarity of

– genes and proteins– between species

Page 6: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 6

Methodology of global alignment (1 of 4)

• Define scoring scheme for each event– mismatch between ai and bj

• if

– gap (insertion or deletion)

– match between ai and bj • if

• Provide no restrictions on minimal score• Start completing the alignment MxN matrix

Page 7: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 7

Methodology of global alignment (2 of 4)• The matrix should have extra column and row

– M+1 columns , where M is the length sequence M– N+1 rows, where N is the length of sequence N

1. Initialize the matrix– introduce gap penalty at every initial position

along rows and columns– Scores at each cell are cumulative

W H A T 0 -2 -4 -6 -8

W -2 H -4 Y -6

-2 -2 -2 -2-2

-2

-2

Page 8: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 8

Methodology of global alignment (3 of 4)

2. Alignment possibilities Gap (horiz/vert) Match (W-W diag.) Mismatch(W-H diag)

3. Select the maximum score– Best alignment

W H A T 0 -2 -4 -6 -8

W -2 2 0 -2 -4 H -4 0 4 2 0 Y -6 -2 2 3 1

W H 0 -2 -4

W -2 -4

W H 0 -2 -4

W -2 +2

W H 0 -2 -4

W -2 +2 -3-2

-2+2 -1-OR-

Page 9: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 9

Methodology of global alignment (4 of 4)4. Select the most very bottom right cell 5. Consider different path(s) going to very top left cell

– How the next cell value was generated? From where?

WHAT WHATWHY- WH-Y

Overall score = 1 Overall score = 1

6. Select the best alignment(s)

W H A T 0 -2 -4 -6 -8

W -2 2 0 -2 -4 H -4 0 4 2 0 Y -6 -2 2 3 1

W H A T 0 -2 -4 -6 -8

W -2 2 0 -2 -4 H -4 0 4 2 0 Y -6 -2 2 3 1

Page 10: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 10

Local alignment (SW)

• Sequences are aligned to find regions where the best alignment occurs (i.e. highest score)

• Assumes a local context (aligning parts of seq.)• Ideal for finding short motifs, DNA binding sites

– helix-loop-helix (bHLH) - motif– TATAAT box (a famous promoter region) – DNA binding site

• Works well on highly divergent sequences

Page 11: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 11

Methodology of local alignment (1 of 4)

• The scoring system is similar with one exception– The minimum possible score in the matrix is zero– There are no negative scores in the matrix

• Let’s define the scoring system as in globalmismatch between seq. ai and bj gap (insertion or deletion)

if

match between ai and bj if

Page 12: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 12

Methodology of local alignment (2 of 4)

• Construct the MxN alignment matrix with M+1 columns and N+1 rows

• Initialize the matrix by introducing gap penalty at 1st row and 1st column

W H A T

0 0 0 0 0

W 0

H 0

Y 0

s(a,b) ≥ 0(min value is zero)

Page 13: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 13

Methodology of local alignment (3 of 4)

• For each subsequent cell consider alignments– Vertical s(I, - )– Horizontal s(-,J)– Diagonal s(I,J)

• For each cell select the highest score– If score is negative assign zero

W H A T 0 0 0 0 0

W 0 2 0 0 0H 0 0 4 2 0Y 0 0 2 3 1

Page 14: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 14

Methodology of local alignment (4 of 4)• Select the initial cell with the highest score(s)• Consider different path(s) leading to score of zero

– Trace-back the cell values – Look how the values were originated (i.e. path)

WHWH

• Mathematically– where S(I, J) is the score for sub-sequences I and J

W H A T 0 0 0 0 0

W 0 2 0 0 0H 0 0 4 2 0Y 0 0 2 3 1

total score of 4B

A

J

I

Page 15: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 15

Local alignment illustration (1 of 2)

• Determine the best local alignment and the maximum alignment score for

• Sequence A: ACCTAAGG• Sequence B: GGCTCAATCA• Scoring conditions:

– if , – if and

Page 16: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 16

Local alignment illustration (2 of 2)

G G C T C A A T C A

A

C

C

T

A

A

G

G

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 0

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 0

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 3 1

T 0

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 2 1

T 0 0 0 0 4 2 1 0 2 0 1

A 0

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 2 1

T 0 0 0 0 4 2 1 0 2 0 1

A 0 0 0 0 2 3 4 3 1 1 2

A 0

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 2 1

T 0 0 0 0 4 2 1 0 2 0 1

A 0 0 0 0 2 3 4 3 1 1 2

A 0 0 0 0 0 1 5 6 4 2 3

G 0

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 2 1

T 0 0 0 0 4 2 1 0 2 0 1

A 0 0 0 0 2 3 4 3 1 1 2

A 0 0 0 0 0 1 5 6 4 2 3

G 0 2 2 0 0 0 3 4 5 3 1

G 0

G G C T C A A T C A

0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2

C 0 0 0 2 0 2 0 1 1 2 0

C 0 0 0 2 1 2 1 0 0 2 1

T 0 0 0 0 4 2 1 0 2 0 1

A 0 0 0 0 2 3 4 3 1 1 2

A 0 0 0 0 0 1 5 6 4 2 3

G 0 2 2 0 0 0 3 4 5 3 1

G 0 2 4 1 0 0 1 2 3 4 2

Page 17: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 17

Local alignment illustration (3 of 3)

CTCAA GGCTCAATCACT-AA ACCT-AAGG

Best score: 6

G G C T C A A T C A 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 2 2 0 0 2C 0 0 0 2 0 2 0 1 1 2 0C 0 0 0 2 1 2 1 0 0 2 1T 0 0 0 0 4 2 1 0 2 0 1A 0 0 0 0 2 3 4 3 1 1 2A 0 0 0 0 0 1 5 6 4 2 3G 0 2 2 0 0 0 3 4 5 3 1G 0 2 4 1 0 0 1 2 3 4 2

in the whole seq. context (globally)locally

Page 18: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 18

Aligning proteinsGlobally and Locally

Page 19: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 19

Protein Alignment• Protein local and global alignment

– follows the same rules as we saw with DNA/RNA • Differences (∆)

– alphabet of proteins is 22 residues (aa) long – scoring/substitution matrices used (BLOSUM)

• protein proprieties are taken into account– residues that are totally different due to charge such as polar

Lysine and apolar Glycine are given a low score

Page 20: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 20

Substitution matrices

• Protein sequences are more complex– matrices = collection of scoring rules

• Matrices over events such as– mismatch and perfect match

• Need to define gap penalty separately• Popular BLOcks SUbstitution Matrix (BLOSUM)

Page 21: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 21

BLOSUM-x matrices

• Constructed from aligned sequences with specific x% similarity– matrix built using sequences with no more then

50% similarity is called BLOSUM-50

• For highly mutating / dissimilar sequences use– BLOSUM-45 and lower

• For highly conserved / similar sequences use– BLOSUM -62 and higher

Page 22: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 22

BLOSUM 62

• What diagonal represents? • What is the score for substitution ED (acid a.a.)? • More drastic substitution KI (basic to non-polar)?

perfect match between a.a.

Score = 2

Score = -3

Page 23: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 23

Practical problem:Align following sequences both globally and locally using BLOSUM 62 matrix with gap penalty of -8

Sequence A: AAEEKKLAAASequence B: AARRIA

Page 24: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 24

Aligning globally using BLOSUM 62

AAEEKKLAAAAA--RRIA--

Score: -14Other alignment options? Yes

A A E E K K L A A A 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

A -8 4 -4 -12 -20 -28 -36 -44 -52 -60 -68A -16 -4 8 0 -8 -16 -24 -32 -40 -48 -56R -24 -12 0 8 0 -6 -14 -22 -30 -38 -46R -32 -20 -8 0 8 2 -4 -12 -20 -28 -36I -40 -28 -16 -8 0 5 -1 -2 -10 -18 -26

A -48 -36 -24 -16 -8 -1 4 -2 2 -6 -14

Page 25: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 25

Aligning locally using BLOSUM 62

KKLARRIA

Score: 10

A A E E K K L A A A

0 0 0 0 0 0 0 0 0 0 0

A 0 4 4 0 0 0 0 0 4 4 4

A 0 4 8 3 0 0 0 0 4 8 8

R 0 0 3 8 3 2 2 0 0 3 7

R 0 0 0 3 8 5 4 0 0 0 2

I 0 0 0 0 0 5 2 6 0 0 0

A 0 4 4 0 0 0 4 1 10 4 4

Page 26: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 26

Using R for:• Sequence Retrieval and Analysis

Page 27: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 27

Protein database

• UniProt database (http://www.uniprot.org/) has high quality protein data manually curated

• It is manually curated• Each protein is assigned UniProt ID

Page 28: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 28

Retrieving data from

• In search field one can enter either use UniProt ID or common protein name – example: myelin basic protein

• We will use retrieve data for P02686

Uniprot ID

Page 29: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 29

Understanding fields

• Information is divided into categories

• Click on ‘Sequences’ category and then FASTA

Page 30: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 30

FASTA format

• FASTA format is widely used and has the following parameters– Sequence name start with > sign– The fist line corresponds to protein name

Actual protein sequence starts from 2nd line

Page 31: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 31

Retrieving protein data with R and SeqinR

• Can “talk” programmatically to UniProt database using R and seqinR library– seqinR library is suitable for

• “Biological Sequences Retrieval and Analysis”• Detailed manual could be found here

– Install this library in your R environmentinstall.packages("seqinr")library("seqinr")

– Choose database to retrieve data from choosebank("swissprot")

– Download data object for target protein (P02686) query("MBP_HUMAN", "AC=P02686")

– See sequence of the object MBP_HUMAN MBP_HUMAN_seq = getSequence(MBP_HUMAN); MBP_HUMAN_seq

Page 32: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 32

Dot Plot (comparison of 2 sequences) (1of2)

• 2D way to find regions of similarity between two sequences– Each sequence plotted on either

vertical or horizontal dimension– If two a.a. from two sequnces at

given positions are identical the dot is plotted

– matching sequence segments appear as diagonal lines (that could be parallel to the absolute diagonal line if insertion or gap is present)

Page 33: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 33

Dot Plot (comparison of 2 sequences) (2of2)

• Visualize dot plotdotPlot(MBP_HUMAN_seq[[1]], MBP_MOUSE_seq[[1]],xlab="MBP - Human", ylab = "MBP - Mouse")

- Is there similarity between human and mouse form of MBP protein?- Where is the difference in the sequence between the two isoforms?

• Let’s compare two protein sequences– Human MBP (Uniprot ID: P02686)– Mouse MBP (Uniprot ID: P04370)

• Download 2nd mouse sequencequery("MBP_MOUSE", "AC=P04370");MBP_MOUSE_seq = getSequence(MBP_MOUSE);

INSERTION in MBP-Human or GAP in MBP-Mouse

Shift in diagonal line (identical regions)

Breaks in diagonal line = regions of dissimilarity

Page 34: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 34

Using R and Biostrings library for:• Pairwise global and local alignments

Page 35: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 35

Installing Biostrings library

• Install library from Bioconductorsource("http://bioconductor.org/biocLite.R")biocLite("Biostrings")library(Biostrings)

• Define substitution martix (e.g. for DNA)DNA_subst_matrix = nucleotideSubstitutionMatrix(match = 2,

mismatch = -1, baseOnly = TRUE)

• The scoring rules– Match: = 2 if – Mismatch : = -1 if – Gap: = -2 or = -2

DNA_subst_matrix

Page 36: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 36

Global alignment using R and Biostrings

• Create two sting vectors (i.e. sequences)seqA = "GATTA"seqB = "GTTA"

• Use pairwiseAlignment() and the defined rulesglobalAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2,

scoreOnly = FALSE, type="global")

• Visualize best paths (i.e. alignments) globalAlignAB

Global PairwiseAlignedFixedSubject (1 of 1)pattern: [1] GATTA subject: [1] G-TTA score: 2

Page 37: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 37

Local alignment using R and Biostrings

• Input two sequencesseqA = "AGGATTTTAAAA"seqB = "TTTT"

• The scoring rules will be the same as we used for global alignmentglobalAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2,

scoreOnly = FALSE, type="local")

• Visualize alignmentglobalAlignABLocal PairwiseAlignedFixedSubject (1 of 1)pattern: [5] TTTT subject: [1] TTTT score: 8

Page 38: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 38

Aligning protein sequences

• Protein sequences alignments are very similar except the substitution matrix is specified

data(BLOSUM62)BLOSUM62

• Will align sequencesseqA = "PAWHEAE"seqB = "HEAGAWGHEE"

• Execute the global alignmentglobalAlignAB <- pairwiseAlignment(seqA, seqB, substitutionMatrix = "BLOSUM62", gapOpening = -2,

gapExtension = -8, scoreOnly = FALSE)

Page 39: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 39

Summary

• We had touched on practical aspects of– Global and local alignments

• Thoroughly understood both algorithms• Applied them both on DNA and protein seq.• Learned on how to retrieve sequence data• Learned on how to retrieve sequences both

with R and using UniProt• Learned how to align sequences using R

Page 40: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 40

Resources

• Online Tutorial on Sequence Alignment– http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/s

rc/chapter4.html

• Graphical alignment of proteins– http://www.itu.dk/~sestoft/bsa/graphalign.html

• Pairwise alignment of DNA and proteins using your rules:– http://www.bioinformatics.org/sms2/pairwise_align_dna.html

• Documentation on libraries– Biostings: http://www.bioconductor.org/packages/2.10/bioc/manuals/Biostrings/man/Biostrings.pdf

– SeqinR: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf

Page 41: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 41

Homework – HW2

Page 42: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 42

Homework 2 – literature style (type 1)You are asked to analyze critically by writing a report and present one of the following papers in a group:1. Day-Williams AG, Zeggini E The effect of next-generation sequencing technology

on complex trait research. Eur J Clin Invest. 2011 May;41(5):561-7• A review paper on popular NGS under the context of genetics of complex diseases

2. Do R, Exome sequencing and complex disease: practical aspects of rare variant association studies. Hum Mol Genet. 2012 Oct 15;21(R1):R1-9• A more technical paper on how deep sequencing can help in association studies of rare variants

to disease phenotypes under context of statistical genetics

3. Hurd PJ, Nelson CJ. Advantages of next-generation sequencing versus the microarray in epigenetic research. Brief Funct Genomic Proteomic. 2009 May;8(3):174-83• An overview paper describing on how NGS technology can be used in the context of epigenetic

research. NGS technology described in detail

4. Goldstein DB. Sequencing studies in human genetics: design and interpretation. Nat Rev Genet. 2013 Jul;14(7):460-70 (password protected)• This paper describes on how NGS could be interpreted and contrasted to GWAS. The paper

focuses on functional interpretation of genetic variants found in the data

Page 43: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 43

Homework 2 – computer style (type 2)

• You would implement the Needleman–Wunsch global alignment algorithm in R– Follow the pseudo-code provided– Will translate it into R– Will understand alignment in-depth– Provide copy of your code and write a short report

• Report should contain information on scoring matrix and rules used

• Example sequences used for alignment• In code use comments (# comment)

Page 44: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 44

Homework 2 – Q&A style (type 3)

• Here you would need to answer questions– Complete the local and global alignment of DNA

and protein sequences graphically– Use seqinR library to retrieve protein sequences– Use Biostrings library to do alignment of

sequences– Complete missing R code– Copy output from R as a proof– Calculate alignment scores

Page 45: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 45

Feedback on HW1

Page 46: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 46

HW 1a feedback

• Some almost confused the name of the disease abbreviation with the disease associated genes (e.g. HDL syndromes has no HDL1 gene but PRNP gene is associated with HDL1)

• Some printed the whole genome sequence around the disease gene, but your were asked to print only the protein coding region (CDS)

• Would be nice to get more screen snapshots and see the search query used to find articles– From HW1a: “Provide below the search key words used to obtain the results”

Page 47: Bioinformatics Sequences alignment ____________________________________________________________________________________________________________________

Bioinformatics Sequences alignment

____________________________________________________________________________________________________________________Kirill Bessonov slide 47

HW 2b feedback • Computer style (type 2):

– Good analysis on gene level with literature searches– Could of addressed results variation before and after

cleaning data. What is overlap in results before and after QC?

– Would be nice to have top 10 SNPs and corresponding p-values before and after cleaning

– Overall, well done• Q&A style (type 2)

– The issue of loading *.phe and *.raw files• Set working directory in R where these files are located via

– setwd()

• Check current location by getwd()