a new supervised over-sampling algorithm with application to protein-nucleotide binding residue...

A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction

Li Lihong (Anna Lee)

Cumputer science22th,Apr

Content1

2

3

4

Abstract

Introduction

Materials and Methods

Dealing with Class Imbalance: A New Supervised Over-Sampling Method

6

5 Experimental Results and Analysis

Conclusion

Abstract

• Ubiquitous• Useful for both protein function annotation and drug

design • It’s a typical imbalanced learning problem• Little attention has been paid to the negative impact of

class imbalance• Proposed over-sampling algorithm, a predictor, called

TargetSOS• Cross-validation tests and independent validation tests

demonstrate the effectiveness of TargetSOS

Identifying interaction residues solely from protein sequences

IntroductionWhy? Effort in early stage, Great success, Imbalanced learning

problem, Solutions, Over-sampling technique in this study

Why?1: Nucleotides play critical roles in various metabolic2: significant importance for protein function analysis and drug design

Effort in early stage1:motif-based methods dominated this field2:challenges,characterize protein-nucleotide interaction within a relatively narrow range(usually only for a single nucleotide type); require tertiary protein structure as input

Machine-learning-based methods(great success)1: ATPint demonstrated the feasibility of predicting…..solely from protein sequence information2: NsitePred used for multiple nucleotides based on larger training datasets

Imbalanced learning problem1: The number of negative samples is significantly larger than that of positive samples2: No methods have considered this serious class imbalance phenomenon

SolutionsSample rescaling-based methods, learning-based methods, active learning, kernel learning ,hybrid methods.

Supervised over-sampling technique in this study1: Sample rescaling strategy is basic technique by balancing the sizes of different class by changing the number and distributions within them2: It’s different with under-sampling technique3:ROS, SMOTE,ADASYN,SOS4: New predictor: TargetSOS

Materials and Methods

Benchmark DatasetsFeature Representation and Classifier

A. Extract Feature Vector from the Position-Specific Scoring Matrix

B. Extract Feature Vector from the Predicted Protein Secondary Structure

C. Support Vector Machine

Benchmark Datasets

Two benchmark were chosen to evaluate the efficacy of the proposed SOS algorithm and of the implemented predictor

ATP168168 non-redundant, ATP-interacting protein sequences3104 for ATP binding, 59226 for ATP non-binding

NUC5Multiple nucleotide-interacting dataset(5)NUC5 consists of 227, 321, 140, 56, and 105 protein sequences that interact with five types of nucleotides, i.e., ATP, ADP, AMP, GTP, and GDP, respectively,

Similar point: which the maximal pairwise sequence identity is less than 40%.

Table 1 summarizes the detailed compositions of the two benchmarks datasets:

Feature Representation and Classifier

the position-specific scoring matrix (PSSM) and predicted protein secondary structure (PSS), both of which have been demonstrated to be especially useful for protein-nucleotide binding residue prediction, are taken to extract discriminative feature vectors.

Support vector machine (SVM) is used as a classifier for constructing a prediction model.

A. Extract Feature Vector from the Position-Specific Scoring Matrix.

PSSM is widely used in bioinformatics.

In this study, we obtain the PSSM of a query protein sequence by performing PSI-BLAST to search the Swiss-Prot database through three iterations and with 0.001 as the E-value cutoff against the query sequence.

normalize each score, denoted as x, that is contained in the PSSM using the logistic function

Based on the normalized PSSM, the feature vector, denoted Logistic PSSM, for each residue in the protein sequence can be extracted by applying a sliding-window technique.

the dimensionality of the Logistic PSSM feature vector of a residue is 17*20= 340-D.

B. Extract Feature Vector from the Predicted Protein Secondary Structure

PSIPRED can predict the probabilities of each residue in a query protein sequence belonging to three secondary structure classes, i.e., coil, helix, and strand.

We obtained the predicted protein secondary structure by performingPSIPRED against the query sequence. The obtained predictedsecondary structure is an L*3 probability matrix, where L is the length of the protein sequence.

Similar to the Logistic PSSM feature extraction, we can extract a 1763 =51-D feature vector, denoted as PSS, for each residue in the protein by applying a sliding window of size 17.

C. Support Vector Machine.

We use SVM as the base-learning model to evaluate the efficacy of the proposed SOS algorithm

Let be the set of samples, and +1 and -1 are the labels of positive class and negative class, respectively.In linearly separable cases, SVM constructs a hyperplane that separates the samples of two classes with a maximum margin. The optimal separating hyperplane (OSH) is constructed by finding another vector, w, and a parameter, b, that minimizes and satisfies the following conditions:

To allow for mislabeled examples, we use a soft ,margin techniqueFor each training sample, a corresponding slack variable is introduced, i=1,2,3,… ,N. Accordingly, the relaxedseparation constraint is given as:

Then, the OSH can be solved by minimizing.

Furthermore, to address non-linearly separable cases, the ‘‘kernel substitution’’ technique is introduced as follows: first, the input vector xi [Rd is mapped into a higher dimensional Hilbert space, H, by a non-linear kernel function, K(xi ,xj); then, the OSH in the mapped space, H, is solved using a procedure similar to that for a linear case, and the decision function is given by:

Dealing with Class Imbalance: A New Supervised Over-Sampling Method

A.Random Over-sampling.B. Synthetic Minority Over-sampling Technique.C. Adaptive Synthetic SamplingD. Proposed Supervised Over-sampling.

A. Random Over-sampling.

In the ROS technique, the minority set Smin is augmented by replicating randomly selected samples within the set.

Easy to perform, but tend to be over-fitted.

To solve the problem, We will introduce SMOTE and ADASYN.

B. Synthetic Minority Over-sampling Technique.

For each sample xi in Smin, let be the set of the K-nearest neighbors of xi in Smin under the Euclidian distance metric. To synthesize a new sample, an element in SK i , denoted as ^xi, is selected and then multiplied by the feature vector difference between ^xi and xi and by a random number between [0, 1]. Finally, this vector is added to xi :

The parameter in the function is a random number. Between 0 and 1.

C. Adaptive Synthetic Sampling

SMOTE creates the same number of synthetic samples for each original minority sample without considering the neighboring majority samples, which increases the occurrence of overlapping between classes.

In view of this limitation of it, ADASYN is introduced.

D. Proposed Supervised Over-sampling.

Experimental Results and Analysis

Evaluation indexes

Supervised Over-Sampling Helps to Enhance PredictionPerformance

Comparisons with Other Over-Sampling Methods

Comparisons with Existing PredictorsA.Cross-Validation Test. B. Independent Validation Test.

Evaluation Indexes

Let TP, FP, TN, and FN be the abbreviations for true positive, false positive, true negative, and false negative, respectively. Then, Sensitivity(Sen), Specificity(Spe), Accuracy(Acc), and the Matthews correlation coefficient (MCC) can be defined as follows:

Supervised Over-Sampling Helps to Enhance PredictionPerformance

Figure 1. ROC curves of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation. (a) ROC curves for ATP168; (b) ROC curves for ATP227.

Comparisons with Other Over-Sampling Methods

Comparisons with Existing PredictorsA.Cross-Validation Test. B. Independent Validation Test.

A. Cross-Validation Test.

B. Independent Validation Test.

Conclusion

In this study, a new SOS algorithm that balances the samples of different classes by synthesizing additional samples for minority class with a supervised process is proposed to address imbalanced learning problems. We apply the proposed SOS algorithm to protein-nucleotide binding residue prediction, and a web-server, called TargetSOS, is implemented. Cross-validation tests and independent validation tests on two benchmark datasets demonstrate that the proposed SOS algorithm helps to improve the performance of protein-nucleotide binding residue prediction. The findings of this study enrich the understanding of class imbalance learning and are sufficiently flexible to be applied to other bioinformatics problems in which class imbalance exists, such as protein functional residue prediction and disulfide bond prediction.

Thank you for your attention!

·

a new supervised over-sampling algorithm with application to protein-nucleotide binding residue...

Documents

protein sequences3104

proteinnucleotide interaction

learningbased methods

protein function analysis

tertiary protein structure

protein function annotation

protein sequence information2

kernel learning