pittsburgh learning classifier systems for protein structure prediction: scalability and...
DESCRIPTION
Jaume Bacardit explores the usage of GBML for protein structure predictionTRANSCRIPT
![Page 1: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/1.jpg)
Pittsburgh Learning Classifier Systems for ProteinStructure Prediction: Scalability and ExplanatoryPower
Jaume Bacardit & Natalio KrasnogorASAP research groupSchool of Computer Science & ITUniversity of Nottingham
![Page 2: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/2.jpg)
Outline
Introduction The GAssist Pittsburgh LCS Problem definition Experimentation design Results and discussion Conclusions and further work
![Page 3: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/3.jpg)
Introduction
Proteins are biological molecules ofprimary importance to the functioningof living organisms
Proteins are constructed as a chains ofamino acid residues
This chain folds to create a 3Dstructure
![Page 4: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/4.jpg)
Introduction
It is relatively easy to know the primarysequence of a protein, but much moredifficult to know its 3D structure
Can we predict the 3D structure of a proteinfrom its primary sequence? ProteinStructure Prediction (PSP)
PSP problem is divided in several subproblems. We focus on CoordinationNumber (CN) prediction
![Page 5: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/5.jpg)
Introduction
We will use the GAssist [Bacardit,2004] Pittsburgh LCS to solve thisproblem
We analyze two sides of the problem: Performance Interpretability
We also study the scalability issue
![Page 6: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/6.jpg)
The GAssist Pittsburgh LCS
GAssist uses a generational GA where eachindividual is a complete solution to theclassification problem
A solution is encoded as an ordered set ofrules
GABIL [De Jong & Spears, 93]representation and crossover operator areused
Representation is extended with an explicitdefault rule
Covering style initialization operator
![Page 7: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/7.jpg)
The GAssist Pittsburgh LCS
Representation of GABIL for nominalattributes Predicate → Class Predicate: Conjunctive Normal Form (CNF)
(A1=V11∨.. ∨ A1=V1
n) ∧.....∧ (An=Vn2∨.. ∨ An=Vn
m) Ai : ith attribute Vi
j : jth value of the ith attribute “If
(attribute ‘number’ takes values ‘1’ or ‘2’) and (attribute ‘letter’ takes values ‘a’ or ‘b’) and (attribute ‘color’ takes values ‘blue’ or ‘red’)
Then predict class ‘class1’
![Page 8: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/8.jpg)
The GAssist Pittsburgh LCS
Fitness function based on the MDLprinciple [Rissanen,78] to balanceaccuracy and complexity Our aim is to create compact individuals
for two reasons Avoiding over-learning Increasing the explicability of the rules
Complexity metric will promote Individuals with few rules Individuals with few expressed attributes
![Page 9: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/9.jpg)
The GAssist Pittsburgh LCS
Costly evaluation process if dataset is big Computational cost is alleviated by using a
windowing mechanism called ILAS
This mechanism also introduces somegeneralization pressure
Training set
0 Ex/n 2·Ex/n Ex3·Ex/n
Iterations
0 Iter
![Page 10: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/10.jpg)
Problem definition
MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTLPFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQREKIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKKHLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYLIKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE
Primary Structure = Sequence
Secondary Structure Tertiary
Local Interactions Global Interactions
![Page 11: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/11.jpg)
Problem definition
Coordination Number (CN) prediction Two residues of a chain are said to be in
contact if their distance is less than a certainthreshold
CN of a residue : number of contacts that acertain residue has
![Page 12: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/12.jpg)
Problem definition
Definition of CN [Kinjo et al., 05] Distance between two residues is defined as the
distance between their Cβ atoms (Cα for Glycine) Uses a smooth definition of contact based on a
sigmoid function instead of the usual crispdefinition
Discards local contacts
Unsupervised discretization is needed totransform the CN domain into a classificationproblem
![Page 13: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/13.jpg)
Experimentation design
Protein dataset Used the same set used by Kinjo et al. 1050 protein chains 259768 residues Ten lists of the chains are available, first 950
chains in each list are for training, the rest fortests (10xbootstrap)
Data was extracted from PDB format andcleaned of inconsistencies and exceptions
![Page 14: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/14.jpg)
Experimentation design
We have to transform the data into a regularstructure so that it can be processed by standardmachine learning techniques
Each residue is characterized by several features.We use one (i.e., the AA type) or more of them asinput information and one of them as target (CN)
RiCNi
Ri+1CNi+1
Ri-1CNi-1
Ri+2CNi+2
Ri-2CNi-2
Ri+3CNi+3
Ri+4CNi+4
Ri-3CNi-3
Ri-4CNi-4
Ri-5CNi-5
Ri+5CNi+5
Ri-1,Ri,Ri+1 CNiRi,Ri+1,Ri+2 CNi+1Ri+1,Ri+2,Ri+3 CNi+2
![Page 15: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/15.jpg)
Experimentation design
GAssist adjustements 150 strata for the ILAS windowing
system 20000 iterations Majority class will be used by the
default rule
![Page 16: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/16.jpg)
Experimentation design
Distance cut-off of 10Å Window size of 9 (4 residues at each side of
the target) 3 number of states (2, 3, 5) and 2 definitions
of state partitions 3 sources of input data that generate 6
combination of attributes GAssist compared against SVM,
NaiveBayes and C4.5
![Page 17: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/17.jpg)
Results and discussion
Summary of results Global ranking of performance:LIBSVM GAssist Naive Bayes C4.5 CN prediction accuracy
2 states: ~80% 3 states: ~67% 5 states: ~47%
![Page 18: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/18.jpg)
Results and discussion
LIBSVM consistently achieved betterperformance than GAssist. However: Learning time for these experiments was
up to 4.5 days Test time could take hours Interpretability is impossible. 70-90% of
instances were selected as supportvectors
![Page 19: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/19.jpg)
Results and discussion
Interpretability analysis of GAssist Example of a rule set for the CN1-UL-2
classes dataset
![Page 20: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/20.jpg)
Results and discussion
![Page 21: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/21.jpg)
Results and discussion
Interpretability analysis of GAssist All AA types associated to the central
residue are hydrophobic (core of aprotein)
D, E consistently do not appear in thepredicates. They are negatively chargesresidues (surface of a protein)
![Page 22: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/22.jpg)
Results and discussion
Interpretability analysis. Relevance andspecificity of the attributes
![Page 23: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/23.jpg)
Results and discussion
Is GAssist learning? Evolution of the training accuracy of the best solution
The knowledgelearnt by thedifferent strata doesnot integratesuccessfully into asingle solution
![Page 24: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/24.jpg)
Conclusions and further work
Initial experiments of the GAssist PittsburghLCS applied to Bioinformatics
GAssist can provide compact andinterpretable solutions
Its accuracy, however, needs to beimproved but is not far from state-of-the-art
These datasets are a challenge for LCS inseveral aspects such as scalability andrepresentations, but LCS can give moreexplanatory power than other methods
![Page 25: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/25.jpg)
Conclusions and further work
Future directions Domain solutions
Subdividing the problem into several subproblems andaggregate the solutions
Smooth predictions of neighbouring residues
Improving GAssist Finding a more stable (slower) tuning and parallelize With purely ML techniques Feeding back information from the interpretability
analysis to bias/restrict the search
![Page 26: Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power](https://reader034.vdocuments.site/reader034/viewer/2022042813/549fddedac795938768b4af4/html5/thumbnails/26.jpg)
Coordination number prediction usingLearning Classifier Systems
Questions?