Regularizing transformers with deep probabilistic layers

Aurora Cobo Aguilera, Pablo M Olmos, Antonio Artés-Rodríguez, Fernando Pérez-Cruz: Regularizing transformers with deep probabilistic layers. En: Neural Networks, 2023, ISSN: 0893-6080.

Resumen

Language models (LM) have grown non-stop in the last decade, from sequence-to-sequence architectures to attention-based Transformers. However, regularization is not deeply studied in those structures. In this work, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer. We study its advantages regarding the depth where it is placed and prove its effectiveness in several scenarios. Experimental result demonstrates that the inclusion of deep generative models within Transformer-based architectures such as BERT, RoBERTa, or XLM-R can bring more versatile models, able to generalize better and achieve improved imputation score in tasks such as SST-2 and TREC or even impute missing/noisy words with richer text.

BibTeX (Download)

@article{AGUILERA2023,
title = {Regularizing transformers with deep probabilistic layers},
author = {Aurora Cobo Aguilera and Pablo M Olmos and Antonio Art\'{e}s-Rodr\'{i}guez and Fernando P\'{e}rez-Cruz},
url = {https://www.sciencedirect.com/science/article/pii/S0893608023000448},
doi = {https://doi.org/10.1016/j.neunet.2023.01.032},
issn = {0893-6080},
year  = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
journal = {Neural Networks},
abstract = {Language models (LM) have grown non-stop in the last decade, from sequence-to-sequence architectures to attention-based Transformers. However, regularization is not deeply studied in those structures. In this work, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer. We study its advantages regarding the depth where it is placed and prove its effectiveness in several scenarios. Experimental result demonstrates that the inclusion of deep generative models within Transformer-based architectures such as BERT, RoBERTa, or XLM-R can bring more versatile models, able to generalize better and achieve improved imputation score in tasks such as SST-2 and TREC or even impute missing/noisy words with richer text.},
keywords = {Deep learning, Missing data, Natural language processing, Regularization, Transformers, Variational auto-encoder},
pubstate = {published},
tppubtype = {article}
}