Crowdsourcing¶

Adaptive task assignment for crowdsourced classification (2013)
Who moderates the moderators? (2011)
Aggregating Crowdsourced Binary Ratings (2013)
Spectral methods meet EM: A provably optimal algorithm for crowdsourcing (2014)
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking (2012)

Adaptive task assignment for crowdsourced classification (2013)¶

(Ho, Chien-Ju and Jabbari, Shahin and Vaughan, Jennifer Wortman)

Summary:
- Looks at how to optimally assign workers to binary classification crowdsourcing tasks as they come online
- Aims to infer true labels and true worker reliability levels (i.e. truth discovery)
- With noisy workers, each task must be assigned to multiple workers to estimate the true label
- Authors aim to minimise the workers per tasks (since in real crowdsourcing tasks, more workers means higher costs)
Model:
- \(n\) tasks \(\{1,\ldots,n\}\)
- The true label for task \(i\) is \(\ell_i \in \{-1,+1\}\)
- \(m\) workers \(\{1,\ldots,m\}\)
- Tasks are of \(T\) distinct types
- Each worker \(j\) has a known capacity \(M_j\), and unknown skill level \(p_{i,j} \in [0,1]\) for each task \(i\)
- Skill levels only depend on the task type: \(p_{i,j} = p_{i',j}\) whenever \(i\) and \(i'\) are of the same type
- If assigned task \(i\), worker \(j\) produces a label \(\ell_{i,j}\) such that \(\ell_{i,j} = \ell_i\) w.p. \(p_{i,j}\), and \(\ell_{i,j} = -\ell_i\) w.p. \(1 - p_{i,j}\)
- Algorithm assigns tasks to workers and produces an estimate \(\hat{\ell}_i\) for each task \(i\)
- Assumptions and constraints:
  - Worker \(j\) cannot be assigned more than \(M_j\) tasks
  - The algorithm can observe worker output on each task before deciding whether to assign more tasks or move to the next worker
  - The algorithm may not return to previous workers
  - The algorithm has access to at least one gold standard task of each type; tasks for which the true label is already known
Offline, full information algorithm:
- Authors first look at the case where the skill levels \(p_{i,j}\) are already known
- Result (paraphrased): write \(w_{i,j} = 2p_{i,j} - 1\), and let \(J_i\) be the set of workers assigned to task \(i\). Then the weighted majority estimate
  
  \[\hat{\ell_i} = \text{sign}{\sum_{i \in J_i}{w_{i,j}\ell_{i,j}}}\]
  
  satisfies
  
  \[\text{Pr}(\hat{\ell_i} \ne \ell_i) \le \exp(-\frac{1}{2}\sum_{i \in J_i}{q_{i,j}})\]
  
  where \(q_{i,j} = (2p_{i,j} - 1)^2 = w_{i,j}^2\) is a measure of informativeness of worker \(j\) on task \(i\) (also see Ghosh et. al.)
- Corollary: to guarantee probability of error at most \(\epsilon\), we only need to ensure \(\sum_{i \in J_i}{q_{i,j}} \ge 2\log{1 / \epsilon} =: C_\epsilon\)

Who moderates the moderators? (2011)¶

(Arpita Ghosh and Satyen Kale and Preston McAfee)

Summary:
- Motivation: a lot of user-generated content online is spam or abuse. In large platforms we must rely on user ratings to moderate content (e.g. up/down votes), but users are not always trustworthy or reliable themselves.
- Identifies a method for estimating ground truth labels via crowdsourced judgements. Only requirement: we know the identity of one user with reliability greater than 1/2.
- Authors give probabilistic bounds on the fraction of errors the algorithm makes
- Also look at robustness to adversarial users in a few settings
- (From people at Yahoo research)
Model:
- \(T\) items \(1,\ldots,T\). Each item has a true label (‘quality’) \(q_t \in \{1,-1\}\) (intuitively, \(q_t=1\) means item \(t\) is good, and \(q_t = -1\) means bad)
- \(n\) users \(1,\ldots,n\). Each user has (unknown) accuracy \(\psi_i\) – the probability of rating an item correctly
- \(u_{ti}\) is the rating given to item \(t\) by user \(i\):
  
  \[u_{ti} = \begin{cases} q_t&, \text{with probablity } \psi_i \\ -q_t&, \text{with probablity } 1 - \psi_i \end{cases} \]
- \(U = [u_{ti}]\) is the \(T \times n\) matrix of ratings
- We assume agent 1 has \(\psi_1 > \frac{1}{2}\) and write \(\gamma = \psi_1 - \frac{1}{2} > 0\).
- The competence of \(i\) is \(\kappa_i = (2\psi_i - 1)^2\) (this is a measure of informativeness). \(\kappa = \sum_{i}{\kappa_i}\) is the total competence; \(\bar{\kappa} = \frac{1}{n}\kappa\) is average competence.
Algorithm:
- Authors show that the vector of true labels \(q\) is the top eigenvector (eigenvector with largest eigenvalue in magnitude) of the matrix \(\mathbb{E}[UU^\top]\)
- In practise we only have the realisation \(U\), and without knowing the true accuracy levels \(\psi\) and labels \(q\) we cannot compute the expectation directly
- Algorithm:
  - Compute top eigenvector \(v\) of \(UU^\top\) with \(\|v\| = 1\). Let \(s = \text{sign}(v)\).
  - At this point \(-v\) is also possible. Use \(u_1\) (vector of ratings from trusted user 1) to distinguish: set \(\sigma = \text{sign}(u_1 \cdot v)\) and output \(q' = \sigma v\) as the estimate for \(q\)
  - (Note: calculating \(\sigma\) is essentially seeing which of \(v\) and \(-v\) agrees with \(u_1\) the most. I wonder what happens if there is a tie?)
- Analysis:
  - Uses Chernoff-like bounds for random matrices to judge how far away \(q'\) is likely to be from the real \(q\)
  - Main result: Let \(\eta \in (0, 1)\). There is a constant \(c\) such that if \(T > \frac{2}{\gamma^2}\log\frac{4}{\eta}\) and \(\frac{n}{\log{n}} > \frac{128}{c\bar{\kappa}^2}\), then with probability at least \(1 - \eta\):
    
    \[\frac{1}{T}|\{t \mid q'_t \ne q_t\}| \le \frac{8}{\bar{\kappa}}\sqrt{ \frac{\log{n}}{cnT} \log{\frac{4}{\eta}} }\]
  - Important points:
    - LHS is the fraction of errors made by the algorithm
    - RHS goes to 0 as \(n, T \to \infty\)
    - The lower bounds on \(T\) and \(\frac{n}{\log{n}}\) improve with \(\gamma\) (the excess competence of user 1 above \(\frac{1}{2}\)) and \(\bar{\kappa}\) (the average informativeness) respectively
  - Similar analysis can be carried out when users do not rate all items, but provide a rating on each item \(t\) only with some probability \(p_i\)
Online version of the algorithm:
- To deal with repeatedly applying the algorithm, authors propose an online version
- Take a sample of \(T\) items to get estimated labels \(q'\)
- Compute estimated accuracy of each user by \(\psi'_i = \frac{1}{2}\left(\frac{1}{T}(q' \cdot u_i) + 1\right)\). This is just the fraction of agreement between \(u_i\) and \(q'\), rescaled to lie in \([0, 1]\)
- Compute \(\psi''_i\) by clipping \(\psi'_i\) to \([\alpha,1-\alpha]\) for some \(\alpha = O(1 / \sqrt{T})\) (this helps with numerical stability, I think)
- Compute user weights \(w_i = \frac{1}{2}\log\frac{\psi''_i}{1 - \psi''_i}\). There are the log-odds weights that are known to be optimal if the true accuracy levels are known (e.g. Nitzan and Paroush 1982)
- For a new item \(q_{T+1}\), estimate true label by weighted majority voting.
- Authors show that the probability of a mistake on \(q_{T+1}\) is small given \(\alpha\) is small enough, and drops exponentially with increasing \(n\)
Manipulation:
- Authors consider 3 kinds of adversarial agents.
- Stochastic manipulation: users rate items according to the probabilistic model, but choose \(\psi_i\) in attempts to reduce accuracy. Since the performance results are parametrised by \(\bar{\kappa}\), the most damage a user can do is have \(\psi_i = 1 / 2\), i.e. decide on items randomly. A user with \(\psi_i < 1/2\) has \(\kappa_i > 0\), so still provides useful information (the algorithm can invert their judgements)
- Adversarially choosing \(u_j\): a set of \(b\) ‘bad’ users do not produce rating according to the probabilist model, but choose them arbitrarily to degrade performance. Authors give error bounds in this case.
- Final attack is the same as above, but adversarial users are only interested in degrading performance for a specific item \(t\).

Aggregating Crowdsourced Binary Ratings (2013)¶

(Dalvi, Nilesh and Dasgupta, Anirban and Kumar, Ravi and Rastogi, Vibhor)

Summary:
- Binary aggregation problem: users provide binary labels for a number of items; we want to aggregate to find the true labels
- Novel aspect: the user-item graph (which describes when a user provides a label for an item) can be arbitrary
- Uses spectral methods similar to Who moderates the moderators?
- Obtain probabilistic bounds on error
Model:
- \(m\) items and \(n\) users
- \(q_i \in \{1,-1\}\) is the true label for item \(i\)
- \(p_j\) is accuracy for user \(j\). The reliability of \(j\) is \(w_j = 2p_j - 1 \in [-1, 1]\)
- A binary \(m \times n\) matrix \(G\) (which one could alternative view as a bipartite graph) describes the answering pattern of users: \(G_{ij} = 1\) if user \(j\) provides a label for item \(i\), and \(G_{ij} = 0\) otherwise
- \(U\) is the response matrix:
  
  \[U_{ij} = \begin{cases} q_i,& G_{ij} = 1, \text{w.p. } p_j \\ -q_i,& G_{ij} = 1, \text{w.p. } 1 - p_j \\ 0,& G_{ij} = 0 \end{cases}\]
- Aim is to produce, given a realisation of \(U\), estimates \(\hat{w}\) and \(\hat{q}\) of \(w\) and \(q\) respectively
- The error on \(\hat{w}\) is the expected value of \(\|\hat{w} - w\|^2\) (and similar for \(\hat{q}\))
Algorithms:
- (Note: I have not taken the time to go through these methods in great detail)
- Similar idea to spectral methods of Ghosh et. al. (?), but need to worry about the matrix \(G\) too
- Authors provide probabilistic bounds on errors (for \(\hat{w}\) and \(\hat{q}\)) for \(G\) with sufficiently large eigengap. Uses matrix version of McDiarmid’s inequality amongst other things
- Authors user a heuristic (which apparently works well in practise) when the eigengap is small

Spectral methods meet EM: A provably optimal algorithm for crowdsourcing (2014)¶

(Zhang, Yuchen and Chen, Xi and Zhou, Dengyong and Jordan, Michael I)

(Note: the technical parts of the paper are a bit beyond my comprehension for now, but I roughly understand the idea at a high level…)
Summary:
- Crowdsourcing classification problem, with any finite number of possible labels (not just binary classification)
- Two step process:
  1. Use the spectral method to estimate confusion matrices of workers
    - Confusion matrix for worker \(i\) gives the probability that label \(l\) will be given when \(j\) is the true label, for each pair \((l, j)\)
    - Not really sure what the spectral method is, but I gather it’s a general term for methods using the eigendecomposition (or singular value decomposition?) of relevant matrices or tensors to estimate the parameters of a statistical model. I think it is also related to the method of moments. Perhaps see Tensor Decompositions for Learning Latent Variable Models, which is a highly cited paper.
  2. Uses EM algorithm, with results from the spectral method as initial estimates, to get maximum likelihood estimates
Analysis:
- Probabilistic bound on distance between estimated and true confusion matrices from step 1
- Result about accuracy of label estimates from step 2
- Experimental analysis

ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking (2012)¶

(Demartini, Gianluca and Difallah, Djellel Eddine and Cudr’e-Mauroux, Philippe)