deeploc multiclass predictions - part ii...24/12/2017 deeploc multiclass predictions - part ii...

24/12/2017 DeepLoc Multiclass predictions - Part II

file:///Volumes/share/Dokumente/PhD/Teaching/WS1718/PP2_CB/17-12-21_deeploc_multiclass_2.html 1/6

DeepLoc Multiclass predictions - Part IIBefore, we have shown that if we know, how many locations a protein has, we are able to predict these locations from the DeepLoc scores at~50% accuracy.

In this notebook, we will investigate, if we are also capable of predicting at how many locations a protein occurs.

Load and prepare dataWe use the HPA as a reference dataset. We map the deeploc data to the HPA data.

deeploc = read_tsv("../results/deeploc_all.tsv") location_mapping = read_tsv("../results/location_mapping.tsv") hpa = read_tsv("../results/hpa_filtered.tsv")

After preparation we have a table like this:

deeploc_scores

## # A tibble: 79,800 x 5 ## Prediction hgnc n_locations compartment score ## <chr> <chr> <int> <chr> <dbl> ## 1 Cytoplasm A1CF 1 Nucleus 0.0985 ## 2 Cytoplasm A1CF 1 Cytoplasm 0.4778 ## 3 Cytoplasm A1CF 1 Extracellular 0.0059 ## 4 Cytoplasm A1CF 1 Mitochondrion 0.0125 ## 5 Cytoplasm A1CF 1 Cell_membrane 0.0037 ## 6 Cytoplasm A1CF 1 Endoplasmic_reticulum 0.0008 ## 7 Cytoplasm A1CF 1 Plastid 0.0106 ## 8 Cytoplasm A1CF 1 Golgi_apparatus 0.0001 ## 9 Cytoplasm A1CF 1 Lysosome/Vacuole 0.0020 ## 10 Cytoplasm A1CF 1 Peroxisome 0.3881 ## # ... with 79,790 more rows



ResultsDistribution of ScoresHere, we show the score distribution for the highest, second highest, … prediction for proteins that have one or two locations according to HPArespectively.

While there are (statistically significant) differences, we already see that it will be hard to distinguish between the two categories using the scoreonly.

deeploc_score_distrib = deeploc_scores %>% filter(n_locations <= 2) %>% mutate(n_locations=as.factor(n_locations)) %>% group_by(hgnc) %>% mutate(rank=rank(-score, ties.method="max")) %>% arrange(hgnc) deeploc_score_distrib %>% ggplot(aes(x=as.factor(rank), y=score)) + geom_boxplot(aes(colour=n_locations))



Simple classification by ThresholdThe basic idea is, to use a threshold for the highest score. Above a certain threshold, we consider the protein as “having only one subcellularlocation”. Below the threshold, we consider it as “having two subcellular locations”.

The figur below shows an example for a cutoff of 0.7:



thres = .7 get_state = function(n_pred, n_actual) { if(n_pred == 1) { if(n_actual == 1) { return("TN") } else { return("FN") } } else { if(n_actual == 2) { return("TP") } else{ return("FP") } } } max_score = deeploc_score_distrib %>% filter(rank == 1) %>% mutate(pred_locations=ifelse(score>thres,1,2)) %>% mutate(state=get_state(pred_locations, n_locations)) max_score %>% ggplot(aes(x=n_locations, y=score)) + geom_violin(col='black') + geom_boxplot(width=.1) + geom_jitter(alpha=.3, shape=1, aes(colour=state)) + geom_hline(yintercept=thres, col='red')



ROC AnalysisTo quantify the performance for all thresholds, we perform a ROC analysis. As expected, the overall performance is quite poor with an AUC of0.54.

roc_curve = roc(response=max_score$n_locations, predictor=max_score$score) plot(roc_curve) title(paste("AUC =", round(roc_curve$auc, 2)))



ConclusionWe conclude that multiclass prediction from DeepLoc scores alone is not feasible. For further investigation we suggest either

obtaining unnormalized scores from the Neural Network oradding the number of locations as an output to the Neural Network directly.

deeploc multiclass predictions - part ii...24/12/2017 deeploc multiclass predictions - part ii...

Documents