Executive Secretary

III International Symposium on “Generation and Transfer of Knowledge for Digital Transformation

SITIC 2025

Loading Video...

Machine and deep learning classification of chitinase enzymes including explanations.

Abstract

The scientific community has recognised the challenge of enzyme classification in the field of bioinformatics. In this study, we present the implementation of a classifier based on vector representation of sequences, anomalous autoencoders and convolutional networks to characterise chitinases belonging to the glycoside hydrolases using the Python programming language and the TensorFlow framework. The performance of the deep learning model is compared with a model built using XGBoost. The classifier consists of three levels that determine whether a protein is an enzyme, whether it is a hydrolase, and its enzymatic activity, taking into account the low representativeness of these enzymes in the Ca-zy.org database. The results for the first two levels of the classifier were similar for both the neural networks and the XGBoost model, with an accuracy of around 90%. The proposed solutions are compared with those of Protinfer. In addition, the proteome of Bacillus spp. is explored in search of potential enzymes of these classes and the results are compared with those of Protinfer. To interpret and evaluate the significance of the features, the SHAP (SHapley Additive exPlanations) framework was applied to the predictions generated by the XGBoost model.

Resumen

La comunidad científica ha reconocido el desafío que representa la clasificación de enzimas en el campo de la bioinformática. En este estudio, presentamos la implementación de un clasificador basado en representaciones vectoriales de secuencias, autoencoders anómalos y redes convolucionales para caracterizar quitinasas pertenecientes a las hidrolasas de glicosilos, utilizando el lenguaje de programación Python y el framework TensorFlow. El rendimiento del modelo de aprendizaje profundo se compara con un modelo construido mediante XGBoost. El clasificador consta de tres niveles que determinan si una proteína es una enzima, si es una hidrolasa y cuál es su actividad enzimática, teniendo en cuenta la baja representatividad de estas enzimas en la base de datos Ca-zy.org. Los resultados de los dos primeros niveles del clasificador fueron similares tanto para las redes neuronales como para el modelo XGBoost, con una precisión aproximada del 90%. Las soluciones propuestas se comparan con las de Protinfer. Además, se explora el proteoma de Bacillus spp. en busca de posibles enzimas de estas clases y los resultados se comparan con los de Protinfer. Para interpretar y evaluar la importancia de las características, se aplicó el marco SHAP (SHapley Additive exPlanations) a las predicciones generadas por el modelo XGBoost.