deeploc multiclass predictions - part ii...24/12/2017 deeploc multiclass predictions - part ii...

6
24/12/2017 DeepLoc Multiclass predictions - Part II file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html 1/6 DeepLoc Multiclass predictions - Part II Before, we have shown that if we know, how many locations a protein has, we are able to predict these locations from the DeepLoc scores at ~50% accuracy. In this notebook, we will investigate, if we are also capable of predicting at how many locations a protein occurs. Load and prepare data We use the HPA as a reference dataset. We map the deeploc data to the HPA data. deeploc = read_tsv("../results/deeploc_all.tsv") location_mapping = read_tsv("../results/location_mapping.tsv") hpa = read_tsv("../results/hpa_filtered.tsv") After preparation we have a table like this: deeploc_scores ## # A tibble: 79,800 x 5 ## Prediction hgnc n_locations compartment score ## <chr> <chr> <int> <chr> <dbl> ## 1 Cytoplasm A1CF 1 Nucleus 0.0985 ## 2 Cytoplasm A1CF 1 Cytoplasm 0.4778 ## 3 Cytoplasm A1CF 1 Extracellular 0.0059 ## 4 Cytoplasm A1CF 1 Mitochondrion 0.0125 ## 5 Cytoplasm A1CF 1 Cell_membrane 0.0037 ## 6 Cytoplasm A1CF 1 Endoplasmic_reticulum 0.0008 ## 7 Cytoplasm A1CF 1 Plastid 0.0106 ## 8 Cytoplasm A1CF 1 Golgi_apparatus 0.0001 ## 9 Cytoplasm A1CF 1 Lysosome/Vacuole 0.0020 ## 10 Cytoplasm A1CF 1 Peroxisome 0.3881 ## # ... with 79,790 more rows

Upload: others

Post on 27-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DeepLoc Multiclass predictions - Part II...24/12/2017 DeepLoc Multiclass predictions - Part II file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html

24/12/2017 DeepLoc Multiclass predictions - Part II

file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html 1/6

DeepLoc Multiclass predictions - Part IIBefore, we have shown that if we know, how many locations a protein has, we are able to predict these locations from the DeepLoc scores at~50% accuracy.

In this notebook, we will investigate, if we are also capable of predicting at how many locations a protein occurs.

Load and prepare dataWe use the HPA as a reference dataset. We map the deeploc data to the HPA data.

deeploc = read_tsv("../results/deeploc_all.tsv") location_mapping = read_tsv("../results/location_mapping.tsv") hpa = read_tsv("../results/hpa_filtered.tsv")

After preparation we have a table like this:

deeploc_scores

## # A tibble: 79,800 x 5 ## Prediction hgnc n_locations compartment score ## <chr> <chr> <int> <chr> <dbl> ## 1 Cytoplasm A1CF 1 Nucleus 0.0985 ## 2 Cytoplasm A1CF 1 Cytoplasm 0.4778 ## 3 Cytoplasm A1CF 1 Extracellular 0.0059 ## 4 Cytoplasm A1CF 1 Mitochondrion 0.0125 ## 5 Cytoplasm A1CF 1 Cell_membrane 0.0037 ## 6 Cytoplasm A1CF 1 Endoplasmic_reticulum 0.0008 ## 7 Cytoplasm A1CF 1 Plastid 0.0106 ## 8 Cytoplasm A1CF 1 Golgi_apparatus 0.0001 ## 9 Cytoplasm A1CF 1 Lysosome/Vacuole 0.0020 ## 10 Cytoplasm A1CF 1 Peroxisome 0.3881 ## # ... with 79,790 more rows

Page 2: DeepLoc Multiclass predictions - Part II...24/12/2017 DeepLoc Multiclass predictions - Part II file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html

24/12/2017 DeepLoc Multiclass predictions - Part II

file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html 2/6

ResultsDistribution of ScoresHere, we show the score distribution for the highest, second highest, … prediction for proteins that have one or two locations according to HPArespectively.

While there are (statistically significant) differences, we already see that it will be hard to distinguish between the two categories using the scoreonly.

deeploc_score_distrib = deeploc_scores %>% filter(n_locations <= 2) %>% mutate(n_locations=as.factor(n_locations)) %>% group_by(hgnc) %>% mutate(rank=rank(-score, ties.method="max")) %>% arrange(hgnc) deeploc_score_distrib %>% ggplot(aes(x=as.factor(rank), y=score)) + geom_boxplot(aes(colour=n_locations))

Page 3: DeepLoc Multiclass predictions - Part II...24/12/2017 DeepLoc Multiclass predictions - Part II file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html

24/12/2017 DeepLoc Multiclass predictions - Part II

file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html 3/6

Simple classification by ThresholdThe basic idea is, to use a threshold for the highest score. Above a certain threshold, we consider the protein as “having only one subcellularlocation”. Below the threshold, we consider it as “having two subcellular locations”.

The figur below shows an example for a cutoff of 0.7:

Page 4: DeepLoc Multiclass predictions - Part II...24/12/2017 DeepLoc Multiclass predictions - Part II file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html

24/12/2017 DeepLoc Multiclass predictions - Part II

file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html 4/6

thres = .7 get_state = function(n_pred, n_actual) { if(n_pred == 1) { if(n_actual == 1) { return("TN") } else { return("FN") } } else { if(n_actual == 2) { return("TP") } else{ return("FP") } } } max_score = deeploc_score_distrib %>% filter(rank == 1) %>% mutate(pred_locations=ifelse(score>thres,1,2)) %>% mutate(state=get_state(pred_locations, n_locations)) max_score %>% ggplot(aes(x=n_locations, y=score)) + geom_violin(col='black') + geom_boxplot(width=.1) + geom_jitter(alpha=.3, shape=1, aes(colour=state)) + geom_hline(yintercept=thres, col='red')

Page 5: DeepLoc Multiclass predictions - Part II...24/12/2017 DeepLoc Multiclass predictions - Part II file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html

24/12/2017 DeepLoc Multiclass predictions - Part II

file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html 5/6

ROC AnalysisTo quantify the performance for all thresholds, we perform a ROC analysis. As expected, the overall performance is quite poor with an AUC of0.54.

roc_curve = roc(response=max_score$n_locations, predictor=max_score$score) plot(roc_curve) title(paste("AUC =", round(roc_curve$auc, 2)))

Page 6: DeepLoc Multiclass predictions - Part II...24/12/2017 DeepLoc Multiclass predictions - Part II file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html

24/12/2017 DeepLoc Multiclass predictions - Part II

file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html 6/6

ConclusionWe conclude that multiclass prediction from DeepLoc scores alone is not feasible. For further investigation we suggest either

obtaining unnormalized scores from the Neural Network oradding the number of locations as an output to the Neural Network directly.