Dr Lucy Colwell

University Associate Professor

Making sense of data; how can we gain new insight and understanding from large bodies of data?

Across the natural sciences, on-going technological advances enable ever-increasing numbers of variables to be measured during single experiments. It is of crucial importance that we develop analytical techniques to extract useful information from these datasets, which are typically large and highly under-sampled, placing them outside the realm of standard statistical analysis. My research combines theory and computation to elicit structural information about relationships between variables from data.

For a given large dataset we ask about the geometry of the data - are points constrained to lie on particular manifolds, or subspaces? What metric best describes the distance between data points? Relationships between variables constrain data, resulting in correlations; statistics of the observed data can be used to infer these constraints. Visualizing these structural constraints can help to interpret their meaning, allowing us to better understand the data.

An example dataset is the set of sequences corresponding to a particular protein. The amino acid sequence contains all the information necessary to specify both the 3D structure of the protein, and the function it carries out. My work asks how this information is encoded in the sequence, and how we can exploit the large numbers of protein sequences now available to crack this code.

Protein structure and function is maintained by groups of sequence residues that mutate in a correlated fashion. Our statistical analysis of large sequence alignments exploits these correlations to make predictions of protein 3D structure and function. In this example, the dependency structure of variables is closely related to the folded protein conformation. For other questions, the physical interpretation is less straightforward, but no less important.

Projects are theoretical or computational in nature and will require interest in working across different disciplines often with experimental collaborators. Research is driven by scientific questions, which means that new analysis tools and methods are constantly being invented, developed and adopted.

Publications

InterPro: The protein sequence classification resource in 2025

M Blum, A Andreeva, LC Florentino, SR Chuguransky, T Grego, E Hobbs, BL Pinto, A Orr, T Paysan-Lafosse, I Ponamareva, GA Salazar, N Bordin, P Bork, A Bridge, L Colwell, J Gough, DH Haft, I Letunic, F Llinares-López, A Marchler-Bauer, L Meng-Papaxanthos, H Mi, DA Natale, CA Orengo, AP Pandurangan, D Piovesan, C Rivoire, CJA Sigrist, N Thanki, F Thibaud-Nissen, PD Thomas, SCE Tosatto, CH Wu, A Bateman

Nucleic acids research

(2024)

D444

(doi: 10.1093/nar/gkae1082)

Investigation of protein family relationships with deep learning

I Ponamareva, A Andreeva, ML Bileschi, L Colwell, A Bateman

Bioinform Adv

(2024)

vbae132

(doi: 10.1093/bioadv/vbae132)

Machine learning designs new GCGR/GLP-1R dual agonists with enhanced biological potency.

AM Puszkarska, B Taddese, J Revell, G Davies, J Field, DC Hornigold, A Buchanan, TJ Vaughan, LJ Colwell

Nature chemistry

(2024)

1436

(doi: 10.1038/s41557-024-01532-x)

Predicting multiple conformations via sequence clustering and AlphaFold2.

HK Wayment-Steele, A Ojoawo, R Otten, JM Apitz, W Pitsawong, M Hömberger, S Ovchinnikov, L Colwell, D Kern

Nature

(2023)

625

832

(doi: 10.1038/s41586-023-06832-9)

ProteInfer, deep neural networks for protein functional inference

T Sanderson, ML Bileschi, D Belanger, LJ Colwell

eLife

(2023)

e80942

(doi: 10.7554/elife.80942)

Hallucinating functional protein sequences

D Belanger, LJ Colwell

Nature biotechnology

(2023)

1073

(doi: 10.1038/s41587-022-01634-2)

InterPro in 2022.

T Paysan-Lafosse, M Blum, S Chuguransky, T Grego, BL Pinto, GA Salazar, ML Bileschi, P Bork, A Bridge, L Colwell, J Gough, DH Haft, I Letunić, A Marchler-Bauer, H Mi, DA Natale, CA Orengo, AP Pandurangan, C Rivoire, CJA Sigrist, I Sillitoe, N Thanki, PD Thomas, SCE Tosatto, CH Wu, A Bateman

Nucleic Acids Res

(2022)

D418

(doi: 10.1093/nar/gkac993)

Prediction of multiple conformational states by combining sequence clustering with AlphaFold2

HK Wayment-Steele, S Ovchinnikov, L Colwell, D Kern

(2022)

(doi: 10.1101/2022.10.17.512570)

Using deep learning to annotate the protein universe

ML Bileschi, D Belanger, DH Bryant, T Sanderson, B Carter, D Sculley, A Bateman, MA DePristo, LJ Colwell

Nat Biotechnol

(2022)

932

(doi: 10.1038/s41587-021-01179-w)

Minding the gaps: The importance of navigating holes in protein fitness landscapes.

N Thomas, LJ Colwell

Cell Syst

(2021)

1019

(doi: 10.1016/j.cels.2021.10.004)

Dr Lucy Colwell

University Associate Professor

Publications

Research Interest Groups

Telephone number

Email address

College

About the Department

Departmental Services

Study at Cambridge

About the University

Research at Cambridge