Crowdsourcing¶
Adaptive task assignment for crowdsourced classification (2013)¶
(Ho, Chien-Ju and Jabbari, Shahin and Vaughan, Jennifer Wortman)
Summary:
Looks at how to optimally assign workers to binary classification crowdsourcing tasks as they come online
Aims to infer true labels and true worker reliability levels (i.e. truth discovery)
With noisy workers, each task must be assigned to multiple workers to estimate the true label
Authors aim to minimise the workers per tasks (since in real crowdsourcing tasks, more workers means higher costs)
Model:
\(n\) tasks \(\{1,\ldots,n\}\)
The true label for task \(i\) is \(\ell_i \in \{-1,+1\}\)
\(m\) workers \(\{1,\ldots,m\}\)
Tasks are of \(T\) distinct types
Each worker \(j\) has a known capacity \(M_j\), and unknown skill level \(p_{i,j} \in [0,1]\) for each task \(i\)
Skill levels only depend on the task type: \(p_{i,j} = p_{i',j}\) whenever \(i\) and \(i'\) are of the same type
If assigned task \(i\), worker \(j\) produces a label \(\ell_{i,j}\) such that \(\ell_{i,j} = \ell_i\) w.p. \(p_{i,j}\), and \(\ell_{i,j} = -\ell_i\) w.p. \(1 - p_{i,j}\)
Algorithm assigns tasks to workers and produces an estimate \(\hat{\ell}_i\) for each task \(i\)
Assumptions and constraints:
Worker \(j\) cannot be assigned more than \(M_j\) tasks
The algorithm can observe worker output on each task before deciding whether to assign more tasks or move to the next worker
The algorithm may not return to previous workers
The algorithm has access to at least one gold standard task of each type; tasks for which the true label is already known
Offline, full information algorithm:
Authors first look at the case where the skill levels \(p_{i,j}\) are already known
Result (paraphrased): write \(w_{i,j} = 2p_{i,j} - 1\), and let \(J_i\) be the set of workers assigned to task \(i\). Then the weighted majority estimate
\[\hat{\ell_i} = \text{sign}{\sum_{i \in J_i}{w_{i,j}\ell_{i,j}}}\]satisfies
\[\text{Pr}(\hat{\ell_i} \ne \ell_i) \le \exp(-\frac{1}{2}\sum_{i \in J_i}{q_{i,j}})\]where \(q_{i,j} = (2p_{i,j} - 1)^2 = w_{i,j}^2\) is a measure of informativeness of worker \(j\) on task \(i\) (also see Ghosh et. al.)
Corollary: to guarantee probability of error at most \(\epsilon\), we only need to ensure \(\sum_{i \in J_i}{q_{i,j}} \ge 2\log{1 / \epsilon} =: C_\epsilon\)
Who moderates the moderators? (2011)¶
(Arpita Ghosh and Satyen Kale and Preston McAfee)
Summary:
Motivation: a lot of user-generated content online is spam or abuse. In large platforms we must rely on user ratings to moderate content (e.g. up/down votes), but users are not always trustworthy or reliable themselves.
Identifies a method for estimating ground truth labels via crowdsourced judgements. Only requirement: we know the identity of one user with reliability greater than 1/2.
Authors give probabilistic bounds on the fraction of errors the algorithm makes
Also look at robustness to adversarial users in a few settings
(From people at Yahoo research)
Model:
\(T\) items \(1,\ldots,T\). Each item has a true label (‘quality’) \(q_t \in \{1,-1\}\) (intuitively, \(q_t=1\) means item \(t\) is good, and \(q_t = -1\) means bad)
\(n\) users \(1,\ldots,n\). Each user has (unknown) accuracy \(\psi_i\) – the probability of rating an item correctly
\(u_{ti}\) is the rating given to item \(t\) by user \(i\):
\[u_{ti} = \begin{cases} q_t&, \text{with probablity } \psi_i \\ -q_t&, \text{with probablity } 1 - \psi_i \end{cases} \]\(U = [u_{ti}]\) is the \(T \times n\) matrix of ratings
We assume agent 1 has \(\psi_1 > \frac{1}{2}\) and write \(\gamma = \psi_1 - \frac{1}{2} > 0\).
The competence of \(i\) is \(\kappa_i = (2\psi_i - 1)^2\) (this is a measure of informativeness). \(\kappa = \sum_{i}{\kappa_i}\) is the total competence; \(\bar{\kappa} = \frac{1}{n}\kappa\) is average competence.
Algorithm:
Authors show that the vector of true labels \(q\) is the top eigenvector (eigenvector with largest eigenvalue in magnitude) of the matrix \(\mathbb{E}[UU^\top]\)
In practise we only have the realisation \(U\), and without knowing the true accuracy levels \(\psi\) and labels \(q\) we cannot compute the expectation directly
Algorithm:
Compute top eigenvector \(v\) of \(UU^\top\) with \(\|v\| = 1\). Let \(s = \text{sign}(v)\).
At this point \(-v\) is also possible. Use \(u_1\) (vector of ratings from trusted user 1) to distinguish: set \(\sigma = \text{sign}(u_1 \cdot v)\) and output \(q' = \sigma v\) as the estimate for \(q\)
(Note: calculating \(\sigma\) is essentially seeing which of \(v\) and \(-v\) agrees with \(u_1\) the most. I wonder what happens if there is a tie?)
Analysis:
Uses Chernoff-like bounds for random matrices to judge how far away \(q'\) is likely to be from the real \(q\)
Main result: Let \(\eta \in (0, 1)\). There is a constant \(c\) such that if \(T > \frac{2}{\gamma^2}\log\frac{4}{\eta}\) and \(\frac{n}{\log{n}} > \frac{128}{c\bar{\kappa}^2}\), then with probability at least \(1 - \eta\):
\[\frac{1}{T}|\{t \mid q'_t \ne q_t\}| \le \frac{8}{\bar{\kappa}}\sqrt{ \frac{\log{n}}{cnT} \log{\frac{4}{\eta}} }\]Important points:
LHS is the fraction of errors made by the algorithm
RHS goes to 0 as \(n, T \to \infty\)
The lower bounds on \(T\) and \(\frac{n}{\log{n}}\) improve with \(\gamma\) (the excess competence of user 1 above \(\frac{1}{2}\)) and \(\bar{\kappa}\) (the average informativeness) respectively
Similar analysis can be carried out when users do not rate all items, but provide a rating on each item \(t\) only with some probability \(p_i\)
Online version of the algorithm:
To deal with repeatedly applying the algorithm, authors propose an online version
Take a sample of \(T\) items to get estimated labels \(q'\)
Compute estimated accuracy of each user by \(\psi'_i = \frac{1}{2}\left(\frac{1}{T}(q' \cdot u_i) + 1\right)\). This is just the fraction of agreement between \(u_i\) and \(q'\), rescaled to lie in \([0, 1]\)
Compute \(\psi''_i\) by clipping \(\psi'_i\) to \([\alpha,1-\alpha]\) for some \(\alpha = O(1 / \sqrt{T})\) (this helps with numerical stability, I think)
Compute user weights \(w_i = \frac{1}{2}\log\frac{\psi''_i}{1 - \psi''_i}\). There are the log-odds weights that are known to be optimal if the true accuracy levels are known (e.g. Nitzan and Paroush 1982)
For a new item \(q_{T+1}\), estimate true label by weighted majority voting.
Authors show that the probability of a mistake on \(q_{T+1}\) is small given \(\alpha\) is small enough, and drops exponentially with increasing \(n\)
Manipulation:
Authors consider 3 kinds of adversarial agents.
Stochastic manipulation: users rate items according to the probabilistic model, but choose \(\psi_i\) in attempts to reduce accuracy. Since the performance results are parametrised by \(\bar{\kappa}\), the most damage a user can do is have \(\psi_i = 1 / 2\), i.e. decide on items randomly. A user with \(\psi_i < 1/2\) has \(\kappa_i > 0\), so still provides useful information (the algorithm can invert their judgements)
Adversarially choosing \(u_j\): a set of \(b\) ‘bad’ users do not produce rating according to the probabilist model, but choose them arbitrarily to degrade performance. Authors give error bounds in this case.
Final attack is the same as above, but adversarial users are only interested in degrading performance for a specific item \(t\).
Aggregating Crowdsourced Binary Ratings (2013)¶
(Dalvi, Nilesh and Dasgupta, Anirban and Kumar, Ravi and Rastogi, Vibhor)
Summary:
Binary aggregation problem: users provide binary labels for a number of items; we want to aggregate to find the true labels
Novel aspect: the user-item graph (which describes when a user provides a label for an item) can be arbitrary
Uses spectral methods similar to Who moderates the moderators?
Obtain probabilistic bounds on error
Model:
\(m\) items and \(n\) users
\(q_i \in \{1,-1\}\) is the true label for item \(i\)
\(p_j\) is accuracy for user \(j\). The reliability of \(j\) is \(w_j = 2p_j - 1 \in [-1, 1]\)
A binary \(m \times n\) matrix \(G\) (which one could alternative view as a bipartite graph) describes the answering pattern of users: \(G_{ij} = 1\) if user \(j\) provides a label for item \(i\), and \(G_{ij} = 0\) otherwise
\(U\) is the response matrix:
\[U_{ij} = \begin{cases} q_i,& G_{ij} = 1, \text{w.p. } p_j \\ -q_i,& G_{ij} = 1, \text{w.p. } 1 - p_j \\ 0,& G_{ij} = 0 \end{cases}\]Aim is to produce, given a realisation of \(U\), estimates \(\hat{w}\) and \(\hat{q}\) of \(w\) and \(q\) respectively
The error on \(\hat{w}\) is the expected value of \(\|\hat{w} - w\|^2\) (and similar for \(\hat{q}\))
Algorithms:
(Note: I have not taken the time to go through these methods in great detail)
Similar idea to spectral methods of Ghosh et. al. (?), but need to worry about the matrix \(G\) too
Authors provide probabilistic bounds on errors (for \(\hat{w}\) and \(\hat{q}\)) for \(G\) with sufficiently large eigengap. Uses matrix version of McDiarmid’s inequality amongst other things
Authors user a heuristic (which apparently works well in practise) when the eigengap is small
Spectral methods meet EM: A provably optimal algorithm for crowdsourcing (2014)¶
(Zhang, Yuchen and Chen, Xi and Zhou, Dengyong and Jordan, Michael I)
(Note: the technical parts of the paper are a bit beyond my comprehension for now, but I roughly understand the idea at a high level…)
Summary:
Crowdsourcing classification problem, with any finite number of possible labels (not just binary classification)
Two step process:
Use the spectral method to estimate confusion matrices of workers
Confusion matrix for worker \(i\) gives the probability that label \(l\) will be given when \(j\) is the true label, for each pair \((l, j)\)
Not really sure what the spectral method is, but I gather it’s a general term for methods using the eigendecomposition (or singular value decomposition?) of relevant matrices or tensors to estimate the parameters of a statistical model. I think it is also related to the method of moments. Perhaps see Tensor Decompositions for Learning Latent Variable Models, which is a highly cited paper.
Uses EM algorithm, with results from the spectral method as initial estimates, to get maximum likelihood estimates
Analysis:
Probabilistic bound on distance between estimated and true confusion matrices from step 1
Result about accuracy of label estimates from step 2
Experimental analysis
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking (2012)¶
(Demartini, Gianluca and Difallah, Djellel Eddine and Cudr’e-Mauroux, Philippe)
Summary:
System to identify entities from natural language text and create structured linked data
Entity linking: map entities in natural language text to URIs which uniquely identify the entity
Related to semantic web, RDF, etc
Crowdsourcing with (potentially unreliable) human workers helps to improve the linked data
Probabilistic framework to handle inconsistent results from human workers
Truth discovery is not the primary goal of the paper, but is used to get the ‘best of both worlds’ between automatic and manual entity extraction and linking