HDS_2026

Title: Predicting Health Outcomes at the Population Level

Course: The Oxford EPSRC CDT in Healthcare Data Science

Year: 2025 for 2026 intake.

Application Code: LCDS202526

Supervisors:

Charles Rahal (Associate Professor in Data Science and Informatics)
Melinda Mills MBE, FBA (Professor of Demography and Population Health)

Unit: Demographic Science Unit

Centre: Leverhulme Centre for Demographic Science

Project Description:

Inequalities in health are pervasive and persistent. But what if we could predict adverse health outcomes before they occur, and intervene to prevent them? The possibility of highly accurate and personalized prediction is emerging [1, 2, 3], not least through advances in large population-level registers increasingly linked to genomic data. Drawing on the theoretical work from the determinants of health and life course literature [4], we aim to pioneer data representations that surpass conventional tabular formats, enabling fusion-based models to realise their full predictive potential. Specifically, we plan workflows that encode data as textual sequences of events—optimised for analysis by Large Language Models [5, 6]—and as “images” of health trajectories, allowing computer vision methods to detect patterns unrecoverable from tabular form. These representations will facilitate richer, more accurate predictions of health and social outcomes and open new avenues for targeted early intervention, as has already been shown as possible in other emergent work [7]. They are becoming increasingly feasible due to advancements in the provision of high-performance computing. We aim to demonstrate how these methods can ethically and responsibly reshape what is only feasible in an era of big data and advanced computational methods. By uncovering patterns in human existence that were previously invisible, we can better understand critical life events, their latent effects, and how these trajectories compare to the benchmarks traditionally used in life course research [8]. Our goal — as a research team of population data scientists — is to translate methodological innovation into tangible improvements in equity, ensuring that predictive tools serve not only to forecast health outcomes, but to transform them for the benefit of all. The prospective DPhil candidate — who will be fully funded for four years on the Healthcare Data Science CDT — will be a pivotal part of this. They will be expected to lead on the methodological development, and especially on proofs-of-concept for the algorithms which we are developing before they get deployed at scale. They would also be expected to take part in the full life of the Leverhulme Centre for Demographic Science and all its associated initiatives (e.g., the Metrics and Models lab).

Methods and Training

The successful candidate will undertake the mandatory first year sequence in the Healthcare Data Science Centre for Doctoral Training (HDS CDT). They should already have a strong familiarity with deep learning before enrolling, but will be encouraged to undertake additional training in this area to an advanced level during the course of the program. Auditing modules from auxiliary departments (e.g., Statistics, Computer Science) in the second through fourth years of the DPhil will be optional, but encouraged.

Candidate Background

We are looking for a candidate with a background in a computational (or at least heavily mathematical) subject such as engineering, computer science, economics, mathematics, statistics, data science, or informatics. They should be a highly competent Python programmer, and be familiar with GPU-computing on high performance infrastructures.

Application and Interview

Applicants should consider the eight references cited below and make contact with one of the prospective supervisors - Charles Rahal - prior to making an application. He will hold (informal) brief 15-minute meetings with all interested and qualified applicants to discuss their proposals. As part of all applications to the HDS CDT (and relevant to this project), applicants should cite the LCDS202526 code in their statement of purpose. Formal online interviews will be held for admission into this position; the date of the interviews is to be confirmed, but likely late January or early February 2026.

References

[1] Kraljevic, Z., Bean, D., Shek, A., Bendayan, R., Hemingway, H., Yeung, J. A., ... & Dobson, R. J. (2024). Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. The Lancet Digital Health, 6(4), e281-e290.

[2] Hansen, N. U., Ergemen, Y. E., & Kallestrup-Lamb, M. (2025). Individual health indices via register-based health records and machine learning: NU Hansen et al. European Actuarial Journal, 1-26.

[3] Moen, H., Raj, V., Vabalas, A., Perola, M., Kaski, S., Ganna, A., & Marttinen, P. (2024). Towards modeling evolving longitudinal health trajectories with a transformer-based deep learning model. arXiv preprint arXiv:2412.08873.

[4] Phelan, J. C., Link, B. G., & Tehranifar, P. (2010). Social conditions as fundamental causes of health inequalities: theory, evidence, and policy implications. Journal of Health and Social Behavior, 51(1_suppl), S28-S40.

[5] Yan, J., & Rahal, C. (2025). On the unknowable limits to prediction. Nature Computational Science, 1-3.

[6] AlSaad, R., Abd-Alrazaq, A., Boughorbel, S., Ahmed, A., Renault, M. A., Damseh, R., & Sheikh, J. (2024). Multimodal large language models in health care: applications, challenges, and future outlook. Journal of Medical Internet Research, 26, e59505.

[7] Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., ... & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172-180.

[8] Vabalas, A., Hartonen, T., Vartiainen, P., Jukarainen, S., Viippola, E., Rodosthenous, R. S., ... & Ganna, A. (2024). Deep learning-based prediction of one-year mortality in Finland is an accurate but unfair aging marker. Nature Aging, 4(7), 1014-1027.