Doctoral Thesis Defense of Pablo García Moreno

Pablo García Moreno, a PhD student in the Signal Processing Group of the University Carlos III de Madrid will defend his doctoral thesis titled “Bayesian Nonparametrics for Crowdsourcing” on November 13.

  • Title: “Bayesian Nonparametrics for Crowdsourcing”
  • Advisor: Fernando Pérez Cruz and Antonio Artés Rodríguez.
  • Event Date: Friday, November 13, 2015, 11:30 am.
  • Location: Adoración de Miguel (1.2.C16); Agustín de Betancourt Building; Leganés Campus; Universidad Carlos III de Madrid.


Supervised machine learning relies on a labeled training set, whose size is closely related to the achievable performance of any learning algorithm.

Thanks to the progresses in ubiquitous computing, networks, and data acquisition and storage technologies, the availability of data is no longer a problem. Nowadays, we can easily gather massive unlabeled datasets in a short period of time. Traditionally, the labeling was performed by a small set of experts so as to control the quality and the consistency of the annotations.

When dealing with large datasets this approach is no longer feasible and the labeling process becomes the bottleneck.

Crowdsourcing has been proven to be an effective and efficient tool to annotate large datasets. By distributing the labeling process across a potentially unlimited pool of annotators, it allows building large labeled datasets in a short period of time at a low cost. However, this comes at the expenses of a variable quality of the annotations, i.e. we need to deal with a large set of annotators of possibly unknown and variable expertise. In this new setting, methods to combine the annotations to produce reliable estimates of the ground truth are necessary.

In this thesis, we tackle the problem of aggregating the information coming from a set of different annotators in a multi-class classification setting. We assume that no information about the expertise of the annotators or the ground truth of the instances is available. In particular, we focus on the potential advantages of using Bayesian Nonparametric models to build interpretable solutions for crowdsourcing applications.

Bayesian Nonparametric models are Bayesian models which set a prior probability on an infinite-dimensional parameter space. After seeing a finite training sample, the posterior probability ends up using a finite number of parameters. Therefore, the complexity of the model depends on the training set and we can infer it from the data, avoiding the use of expensive model selection algorithms.

We focus our efforts on two specific problems. Firstly, we claim that considering the existence of clusters of annotators in this aggregation step can improve the overall performance of the system. This is especially important in early stages of crowdsourcing implementations, when the number of annotations is low. At this stage there is not enough information to accurately estimate the bias introduced by each annotator separately, so we have to resort to models that consider the statistical links among them. In addition, finding these clusters is interesting in itself, as knowing the behavior of the pool of annotators allows implementing efficient active learning strategies. Based on this, we propose in two new fully unsupervised models based on a Chinese Restaurant Process prior and a hierarchical structure that allows inferring these groups jointly with the ground truth and the properties of the annotators.

The second problem is modeling inconsistent annotators. The performance of the annotators can be in-homogeneous across the instance space due to several factors like his past experience with similar cases. To capture this behavior, we proposed an algorithm that uses a Dirichlet Process Mixture model to divide the instance space in different areas across which the annotators are consistent. The algorithm allows us to infer the characteristics of each annotator in each of the identified areas, the ground truth of the training set, as well as building a classifier for test examples. In addition, it offers an interpretable solution allowing to better understanding the decision process undertaken by the annotators, and implement schemes to improve the overall performance of the system.

We propose efficient approximate inference algorithms based on Markov Chain Monte Carlo sampling and variational inference, using auxiliary variables to deal with non-conjugancies when needed. Finally, we perform experiments, both on synthetic and real databases, to show the advantages of our models over state-of-the-art algorithms.