Doctoral Thesis Defense of María Martínez García

Title: Machine Learning in Personalized Medicine and Genomics.

Author: María Martínez García.

Abstract:
Personalized medicine, also known as precision medicine, aims to use an individual’s genetic profile to guide decisions on disease prevention, diagnosis, and treatment. Unlike traditional medicine’s one-fits-all approach, personalized medicine considers individual variations that can influence disease risk, severity, treatment response, and susceptibility to side effects. The rise of high-throughput sequencing technologies has greatly advanced the study of the genome and its clinical applications, increasing sequencing capacity while reducing time and cost. These technologies can extract thousands of features from a single sample, even at the cellular level, resulting in high-dimensional, complex datasets. Traditional clinical study methods like univariate hypothesis testing are inadequate for analyzing these datasets. Consequently, significant efforts have been directed toward efficiently extracting knowledge from these vast datasets.

Machine Learning (ML) techniques can automatically extract valuable insights from raw data to solve specific tasks, such as predicting disease onset, making them highly useful for analyzing complex genetic datasets. However, working with high-dimensional data is challenging due to the curse of dimensionality, particularly when the sample size is small relative to the number of features, as is often the case with genetic data. In ML, this issue is well-known because it can lead to overfitting, poor generalization, or even hinder model training. Therefore, the initial step in analyzing these datasets typically involves obtaining a low-dimensional representation of the data through dimensionality reduction methods.

An important consideration when applying ML models to real-world problems is their ability to estimate the uncertainty of their predictions. It contributes to building trustworthy methods by providing transparency about the model’s confidence in its predictions, making outputs more interpretable, and aiding in outlier detection. Among probabilistic models, we focus on Variational Autoencoders (VAEs), offering a framework for approximating complex distributions while learning a low-dimensional representation of the observed data. This latent representation can be interpretable and captures the underlying factors of variation in the observed data, providing a probabilistic and informative dimensionality reduction. VAEs rely on Amortized Variational Inference (VI), leveraging deep architectures to capture intricate nonlinear relationships and define flexible stochastic functions that map observations to the latent variational posterior. They aim to approximate the true data distribution, therefore, once trained, these models can generate new synthetic data that resembles the observed data by sampling from the inferred distributions.

Although VAEs are recognized for their flexibility and versatility as generative models, they also present challenges that can impact their practical application, such as overregularization and the holes problem. Several strategies have been proposed in the literature to overcome these limitations and improve the tightness of the variational approximation, which depends on the complexity of the true posterior, the choice of the variational family, the architectures used to parameterize the distributions, and the sampling method employed to approximate the loss function.

This thesis focuses on using VAEs as flexible dimensionality reduction methods, overcoming the challenges of analyzing high-dimensional data while improving state-of-the-art approaches for dimensionality reduction in genomic data. In the first contribution of this thesis, we identify genetic markers that could help predict post-transplant complications following allogeneic stem-cell transplantation, such as graft-versus-host disease using an L1-regularized Bayesian Logistic Regression model trained with Amortized VI. This resulted in a highly interpretable and flexible model capable of capturing complex relationships within the data. In the second contribution, we present an extended model that includes global and local latent variables, enabling joint dimensionality reduction and classification for high-dimensional omics-like data. The third and final contribution introduces a novel method for improving inference in VAEs with discrete latent spaces. We propose using error-correcting codes to introduce redundancy into the low-dimensional representations, thereby reducing inference errors and narrowing the variational gap.