a new multiple classifiers soft decisions fusion approach for exons prediction in dna sequences
DESCRIPTION
TRANSCRIPT
Ismail M. El-Badawy, Ashraf M. Aziz, Senior Member, IEEE, Safa Gasser and Mohamed E. Khedr
Department of Electronics & Communications Engineering
Arab Academy for Science, Technology and Maritime Transport, Egypt
Presented by
Ismail M. El-Badawy
A New Multiple Classifiers Soft Decisions Fusion
Approach
for Exons Prediction in DNA Sequences
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Outline Introduction
DNA Structure
Predicting Exons Locations
Exons Prediction using DFT
Proposed Soft Decisions Fusion Approach
Performance Evaluation
Conclusion
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Introduction
Digital Signal Processing has proved its success in different fields,and bioinformatics is one of these fields.
Identification of protein coding regions in DNA sequences is one ofthe important topics in biosignal processing and bioinformaticsarea.
With the significant growth of sequenced genomic data, it hasbecome important to come up with computarized methods forpredicting these important protein coding regions (exons) in DNAsequences.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
DNA Structure
DNA, or deoxyribonucleic acid, is the hereditary material in humans
and almost all other organisms.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
DNA Structure Organisms can be categorized intoprokaryotes (e.g bacteria) andeukaryotes (e.g human).
In both categories, DNA consistsof genes separated by intergenicregions.
In eukaryotes, genes are furtherdivided into protein-codingregions (exons) and non-coding regions (introns).
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
DNA Structure
DNA is made up of nucleotides.
Nucleotides are identified by the
four nitrogen bases.
Nitrogen bases pair up with each
other forming a double helix.
Adenine (A) Thymine (T)
Cytosine (C) Guanine (G)
The two DNA strands are
complementary to each other.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
DNA Structure
DNA = Chain of nucleotides {A, C, G and T}.
This DNA chain (Exons and introns) can symbolically be
represented by a character string of four alphabet letters.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
………TCCGATCGATCGATCTCTCTAGCGTCTACGCTAT
CATCGCTCTCTATTATCGCGCGATCGTCGATCGCGCG
AGAGTATGCTACGTCGATCGAATTG …………………………
DNA Structure
Protein-Coding regions (Exons) are the portions in DNA that
contain the information for producing proteins.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Predicting Exons Locations
Accurate prediction of the exons locations in DNA sequences is
an important issue for biologists since they are considered as
information bearing parts.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Exons
finder
TATTCCGATCGATCGATCT
CTCTAGCGTCTACGCTATC
ATCGCTCTCTATTATCGCG
CG ……
Exons Locations
Predicting Exons Locations The order of the nucleotides
stored in the Exons spell out acode for protein synthesis.
Triplets of nucleotides (codons)in the exonic segments of DNAspecify each type of amino acidbased on a genetic code.
Each amino acid is encoded by oneor more codons (many to onemapping).
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Predicting Exons Locations
It was shown in previous publications that exonic parts exhibit a
period-3 property due to the codon structure and the non-
uniform usage of codons in exonic regions.
This periodicity is absent outside the exonic segments.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
……… ACGTATTCCGATCGA …………… GACTCTAGCGTCTAC ………
Predicting Exons Locations
Three main steps to predict exons locations using digital signal
processing (DSP) tool.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Symbolic to
Numeric Mapping
Track the strength
of the period-3
component using
DSP tool
Decision Making
…TATTCCGATCGATCGATCTCTCTAGCGTCTAC
GCTATCATCGCTCTCTATTATCGCGCG ……
Exons Locations
Exons Prediction using DFT
Sliding window DFT is one of various DSP methods previously
proposed in the filed of exons prediction based on DNA spectral
analysis.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Numerical
Mapping
Sliding
Window
DFT
…TATTCCGATCGATCGATCTCTCTAGCGTCTAC
GCTATCATCGCTCTCTATTATCGCGCG ……
Exons Locations
X[n] S[L/3]
Exons Prediction using DFT
Calculating the power spectrum of a windowed DNA
numerical sequence at k=L/3 is sufficient as it is expected to be
large value in exonic regions and small value outside.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Numerical
Mapping
Sliding
Window
DFT
…TATTCCGATCGATCGATCTCTCTAGCGTCTAC
GCTATCATCGCTCTCTATTATCGCGCG ……
X[n] S[L/3]
Exons Prediction using DFT
A hard decision for each nucleotide (exonic or intronic
nucleotide) is made according to the corresponding S[L/3] value,
whether it is above or below a decision threshold.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Exons Locations
S[L/3]
Exons Prediction using DFT
In our work, we selected two symbolic-to-numeric mapping
schemes from different schemes that previously showed a
reasonable performance.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Numerical
Mapping
…TATTCCGATCGATCGATCTCTCTAGCGTCTAC
GCTATCATCGCTCTCTATTATCGCGCG ……
Nucleotide EIIP CIS
Adenine (A) 0.1260 1
Cytosine (C) 0.1340 -j
Guanine (G) 0.0806 -1
Thymine (T) 0.1335 j
X[n]
Exons Prediction using DFT
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
EIIP
Mapping
Sliding
Window
DFT
…TATTCCGATCGATCGAT…CTCTC…TAGCGTCT
ACGCTATCATCGCTCTCT…ATTATCGCGCG ……
CIS
Mapping
Sliding
Window
DFT
0 1000 2000 3000 4000 5000 6000 7000 80000
0.5
1
Nucleotide Positions
0 1000 2000 3000 4000 5000 6000 7000 80000
0.5
1
Nucleotide Positions
X[n]
X[n]
S[L/3]
S[L/3]
Gene F56F11.4 contains
five exons
Exons Prediction using DFT
Each mapping scheme is ablepronounce the peaks insome exonic segments than theother scheme.
The peaks in the exonicsegments are not alwaysconsistently large whilethose in the intronic segmentsare not always consistently low.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
0 1000 2000 3000 4000 5000 6000 7000 80000
0.5
1
Nucleotide Positions
0 1000 2000 3000 4000 5000 6000 7000 80000
0.5
1
Nucleotide Positions
Gene F56F11.4 contains
five exons
Proposed Soft Decisions Fusion Approach
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
EIIP
Mapping
Sliding
Window
DFT
…TATTCCGATCGATCGAT…CTCTC…TAGCGTCT
ACGCTATCATCGCTCTCT…ATTATCGCGCG ……
CIS
Mapping
Sliding
Window
DFT
X[n]
X[n]
S[L/3]
S[L/3]
Soft Decisions
Proposed Soft Decisions Fusion Approach
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Hard Decision (0 or 1)
Soft Decision (0 to 1)
Proposed Soft Decisions Fusion Approach
Each nucleotide belongs to exonic regions with a partial
membership value (i.e possibility of being an exonic nucleotide).
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)S[
L/
3]
Proposed Soft Decisions Fusion Approach
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
EIIP
Mapping
Sliding
Window
DFT
…TATTCCGATCGATCGAT…CTCTC…TAGCGTCT
ACGCTATCATCGCTCTCT…ATTATCGCGCG ……
CIS
Mapping
Sliding
Window
DFT
X[n]
X[n]
S[L/3]
S[L/3]
DFC
Exons Locations
Proposed Soft Decisions Fusion Approach
The DFC averages the two local soft decisions.
If the average exceeds 0.5 (i.e the average possibility of being
an exonic nucleotide exceeds 50% ),
the final decision is ‘1’,
otherwise ‘0’.
The combined decision
helps in making a more
reliable decision as compared to making
a hard decision depending on only one classifier.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
DFC
Exons Locations
Soft Decisions
Performance Evaluation Metrics
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Prediction
Decision
PositiveTrue
False
NegativeTrue
False
Performance Evaluation Metrics
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Prediction
Decision
PositiveTrue
False
NegativeTrue
False
Performance Evaluation Metrics
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Performance Evaluation Metrics
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Performance Evaluation Metrics
Area under the ROC curve (AUC) is a good indicator.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Performance Evaluation Metrics
F_measureVs Decision threshold is also a good indicator.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Performance Evaluation
MATLAB Simulation is conducted on real data (HMR195
dataset) which is available online.
It contains 195 mammalian sequences consisting of 43 single-
exon and 152 multi-exon genes.
Traditional and proposed approaches are simulated using
different window shapes with a constant length (L=351) as
reported in previous publications.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Performance Evaluation
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
AUC values for HMR195 dataset and ROC curves plotted in
case of using Bartlett window.
Window
Shape
Single Classifier Multiple
ClassifierEIIP CIS
Rectangular 0.7280 0.7398 0.7862
Nutall 0.7264 0.7439 0.7972
Parzen 0.7281 0.7457 0.7989
Bohman 0.7314 0.7490 0.8021
Blackman 0.7331 0.7504 0.8035
Hanning 0.7387 0.7553 0.8079
Hamming 0.7425 0.7580 0.8106
Bartlett 0.7438 0.7589 0.8115
Performance Evaluation
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Numerical
Scheme
used by the
classifier
Number of
Classifiers
% of exonic nucleotides detected as true
positives
at 10% FPR at 20% FPR at 30% FPR
EIIP 1 43.5 56.9 66.4
CIS 1 46.8 59.9 68.7
Both 2 54.1 67.3 76.0
At 10% FPR:
by 24.4 % over single classifier
using EIIP
by 15.6 % over single classifier
using CIS
Performance Evaluation
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Maximum F_measures achieved and corresponding
decision thresholds for HMR195 dataset.
Single Classifier Multiple
ClassifierEIIP CIS
Maximum
F_measure
0.4287 0.4562 0.5086
Decision
Threshold
0.029 0.048 0.037
by 18.6 % over single classifier
using EIIP
by 11.5 % over single classifier
using CIS
Conclusion In our work, a new multiple DFT-based classifiers approach for exons
prediction has been proposed.
Making soft decisions instead of hard decisions and depending on twoclassifiers instead of one helps in making more reliable decisions.
The prediction accuracy is enhanced at the expense of increasingcomputational time and complexity.
Although the analysis of the proposed approach has been investigated incase of only two classifiers for simplicity, it can be easily be extended tomore than two classifiers.
2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA 2013)
Thank You