ProCovar

ProCovar is an ERC-funded project that aims to investigate novel applications of amino acid residue covariation in proteins. These range from the prediction of protein structure (secondary, tertiary and quaternary) to the prediction of other biologically important features such as disorder and interactions with nucleic acids. Key outputs of this project are computational tools which we make freely available.

Prediction of inter-residue contacts (DeepCov and DeepMetaPSICOV)

The first applications of residue covariation analysis were aimed at the prediction of contacting residues in protein structures. Methods such as Direct-Coupling Analysis and our own PSICOV, though effective in many cases, are limited in that they require large numbers of diverse, homologous protein sequences in order to achieve satisfactory performance. We have recently developed methods that significantly extend the range of cases in which good contact predictions can be obtained. Our methods employ deep convolutional neural network models in order to learn patterns in residue covariation data across protein families. When properly trained, we find that these models can significantly outperform more traditional methods. Our latest tool, DeepMetaPSICOV or DMP was ranked highly in the recent CASP13 experiment.

Large-scale de novo structure modelling (DMPfold)

The ultimate goal of predicting inter-residue contacts is to be able to then use that information to predict whole protein structures. Building on our experience in contact prediction, we developed new deep learning-based predictors of inter-residue distances, backbone torsion angles and hydrogen bonds. Using these predictions as constraints to an off-the-shelf method originally used for X-ray and NMR structure determination, we are able to accurately predict structures for a vast array of proteins. Interestingly, the method can be used without modification on transmembrane proteins and achieves good results. Predictions are accompanied by calibrated estimates of their likely correctness. Perhaps most crucially, the method is fast enough to be run on a genomic scale, meaning that we are able to extend the structural coverage of proteomes of great importance to biological research.

Protein design (Protein-VAE)

New developments in machine learning allow us to train methods that generate new examples of images, objects or other data, given large training sets of similar entities. Recent work in the lab has demonstrated that it is possible to use such technology to modify existing protein sequences in order to introduce novel functionality or features, such as metal binding sites, into proteins that do not have them. This opens the door to a wide variety of protein design tasks, and work aimed at these designs in the laboratory is now underway.

Protein-protein interactions and modelling of protein multimers

Residue covariation signals have also been observed in protein interfaces. Ongoing work in the lab aims at predicting the presence/absence of protein-protein interactions, based on residue covariation data. Work is also underway to develop methods to predict structures of homomultimeric complexes of proteins, using extensions to our successful DMPfold approach (described above). Eventually, we hope to extend these ideas to the prediction of heteromeric complexes as well.

Protein disorder and interactions with nucleic acids

Large fractions of eukaryotic proteomes are known or predicted to be disordered, and disorder is known to be associated with specific biological functions, such as DNA/RNA binding, transcriptional and translational regulation, and cell cycle regulation. Flexible regions of protein structures often contain clusters of covarying residues which appear to be under selection to maintain the molecule's ability to undergo specific conformational changes. Residue covariation analysis of such regions has revealed that signals corresponding to multiple, alternate conformational states can be detected. Using this information, we aim to develop tools to predict these alternative conformations and where possible, complexed with DNA/RNA sequences.

Publications and Preprints

  • DMPfold
    Joe G. Greener, Shaun M. Kandathil, David T. Jones (2019) Extending genome-scale de novo protein modelling coverage using iterative deep learning-based prediction of structural constraints. arXiv preprint. https://arxiv.org/abs/1811.12355
  • DeepMetaPSICOV
    Shaun M. Kandathil, Joe G. Greener, David T. Jones (2019) Prediction of inter-residue contacts with DeepMetaPSICOV in CASP13. BiorXiv preprint. doi: https://doi.org/10.1101/586800
  • Protein-VAE
    Joe G. Greener, Lewis Moffat, David T. Jones (2018) Design of metalloproteins and novel protein folds using variational autoencoders. Scientific Reports, Volume 8, Article number: 16189 (2018). https://www.nature.com/articles/s41598-018-34533-1
  • DeepCov
    David T. Jones, Shaun M. Kandathil (2018) High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics, Volume 34, Issue 19, Pages 3308–3315, https://doi.org/10.1093/bioinformatics/bty341

People

    Principal Investigator
  • Professor David Jones
    Postdoctoral Research Associates
  • Dr Joe Greener
  • Dr Shaun Kandathil
  • Dr Cen Wan
    PhD Students
  • Nikita Desai
  • Michael Jones
  • Lewis Moffat

Software

Software developed as part of our research can be found at the PSIPRED GitHub page at https://github.com/psipred/.