Doctoral Thesis Defense of Aurora Cobo Aguilera

  • Title: “Probabilistic Models and Natural Language Processing in Health”
  • Advisor: Antonio Artés Rodríguez.
ABSTRACT
We are living the Artificial Intelligence (AI) era. Whatever we want, we are surrounded by technology, computers, intelligent devices, and incredible machines able to perform more and more complex activities from our daily routine, from a writing predictor to a face recognition unblocking. The scope of AI is unlimited, and its key tool is known as ‘data’. Data can be anything. It can be a number, a category, a sentence, an image, a voice message… You are producing data constantly and you are not even aware of that. Then, AI takes all of this data and build models to learn from it. This is called ‘Machine Learning’. Machine Learning is in permanent growth. Now, we are not able to look at every paper that is published in the field. And the more we advance in computer abilities, the bigger and more powerful the models are. We find huge architectures with millions of parameters and amazing results. So, if we do have such a strong tool, why do not use it to solve the main concerns that globally affect to people? Over 800,000 people take their own lives every year worldwide, and approximately 20 times more people attempt suicide. Sleep disturbances, depression, stress… It is impossible you are not familiar to any of these states. Specially, stress is considered the 21st-century’s illness and all these symptoms are related and provoked by the frenetic lifestyle we face now. Mental health has become one of the cornerstones of today’s world. And AI could be used as a practical tool to collect data, extract information, and find solutions that help to improve this landscape.
This thesis brings the chance to study in more detail some of these solutions. If we recover the concept of data, in medicine, Electronic Health Records (EHR) are the main database. Massive use of smartphones and e-health questionnaires have become the perfect complement to EHR to improve the availability of medical information. In this thesis we take advantage of these resources to deal problems such as the diagnosis or the study of behavior profiles in psychiatry patients. In this field, disorders are usually misdiagnosed, require long periods of observation or directly we find missing information regarding the diagnosis of the patients. We present a model based on Transformers, a Machine Learning model from Natural Language Processing (NLP), capable of imputing the missing information in a EHR as well as detecting potential candidates of delusional disorder, a pathology with very low prevalence and difficult to be diagnosed. At the same time, we offer a direct application of a probabilistic model based in a non-parametric matrix factorization (SPFM) to extract information about behavior patterns in e-health questionnaries and find connections between disorders.
We wanted to emphasize the application of Machine Learning, and more precisely probabilistic models and NLP in mental health. For that, we firstly performed a detailed study in Transformer models and we bring a way of regularizing these architectures with the inclusion of a Gaussian Mixture Variational Auto-Encoder, GMVAE. We present NoRBERT as a model capable of imputing missing words in a text corpus with the advantage of generating information from a more generalized latent space, what involves the imputation of words from more diverse topics than other baselines. We differentiate between Top and Deep NoRBERT depending on the layer of the Transformer that is regularized with the
GMVAE. We show in this thesis the results in terms of improved accuracy and BLEU score for the reconstruction of original noisy sentences by Deep NoRBERT in several datasets. Regarding Top NoRBERT, we present examples of sentences reconstructed by it and by the non-regularized baseline and compare the diversity of the outputs. After showing the advantages of the previous experiments, we also include an additional application in data augmentation in order to solve an external task of text classification using augmented samples generated by NoRBERT. At the beginning of this work, we also prove the effectiveness of our method in basic language models, such as sequence-to-sequence architectures with and without attention. Through the chapter, we show the performance of our idea using different versions of the GMVAE, conditioning in contextual embeddings or information, and using different language models.
After that, we present PsyBERT, another Transformer adapted to work with heterogeneous data from EHRs and capable of imputing empty diagnosis problems. We highlight the advantages of these models in terms of faster diagnosis, wrong diagnosis correction in the EHRs and the imputation of missing information. All of these mechanisms serve as an additional tool that can help the doctor to perform his or her labor in a more effective manner. Then, we study e-health questionnaires by SPFM modeling. In this line of research, we present three different applications of employing the method to extract information from questionnaires answered by psychiatry patients via smartphones. In the first work we relate patterns in the behavior of patients from determined diagnoses. In the second, we show correlations in patients with suicidal thoughts and disturbances in sleeping. And finally, the third work compares the profiles of suicidal patients before and during the lockdown of coronavirus, obtaining a decrease in the suicidal risk from these patients during the lockdown. And finally, we present another regularization technique applied in image classification based in a theoretical study of the minimization of the empirical risk as a way of selecting samples from the minibatch during the training procedure. We study different mechanisms of applying the same idea and obtain improvements in accuracy and convergence rates in an image classification scenario with different models and databases.