Identifying critical protein-protein interactions with ML methods

Project summary The cloud module proposed here is focused on applying machine learning (ML) methods for the analysis of large datasets generated from molecular dynamics (MD) simulations of biomolecules. ML has become a common set of tools used in many areas of scientific research, albeit still with some barriers to their imple- mentation due in part to a relative dearth of training materials. Thus, the proposed module is especially timely. The dataset that will be used in the module is derived from long-timescale MD simulations of the SARS-CoV-2 or SARS-CoV spike protein receptor binding domain (RBD) bound to the human receptor on the cell surface, ACE2. The ML approaches covered are logistic regression, random forest, and multilayer perceptron (a type of neural network). These methods will be used to facilitate the identification of the key residues responsible for the increase in binding affinity of SARS-CoV-2 relative to SARS-CoV. The module will guide scientists and researchers through the different steps for analyzing a large amount of data with ML approaches and gleaning meaningful insights from them. The aim is to decrease the barrier for students, scientists, and researchers with a nascent interest in applying ML to problems in quantitative biology. The skills and concepts learned through the module will facilitate the further implementation of ML approaches in the user's own research using a cloud environment. Such approaches can be extended by users to the application of ML for analyzing large datasets produced in other areas of research, including experimentally. The design of the module is based on tutorials developed for a recent workshop with participants spanning the full gamut of education levels and coding experience, illustrating its adaptability, meeting the needs of all users.

Identifying critical protein-protein interactions with ML methods

Key facts

Abstract