Crowdsourcing

Adaptive task assignment for crowdsourced classification (2013)

(Ho, Chien-Ju and Jabbari, Shahin and Vaughan, Jennifer Wortman)

  • Summary:

    • Looks at how to optimally assign workers to binary classification crowdsourcing tasks as they come online

    • Aims to infer true labels and true worker reliability levels (i.e. truth discovery)

    • With noisy workers, each task must be assigned to multiple workers to estimate the true label

    • Authors aim to minimise the workers per tasks (since in real crowdsourcing tasks, more workers means higher costs)

  • Model:

    • \(n\) tasks \(\{1,\ldots,n\}\)

    • The true label for task \(i\) is \(\ell_i \in \{-1,+1\}\)

    • \(m\) workers \(\{1,\ldots,m\}\)

    • Tasks are of \(T\) distinct types

    • Each worker \(j\) has a known capacity \(M_j\), and unknown skill level \(p_{i,j} \in [0,1]\) for each task \(i\)

    • Skill levels only depend on the task type: \(p_{i,j} = p_{i',j}\) whenever \(i\) and \(i'\) are of the same type

    • If assigned task \(i\), worker \(j\) produces a label \(\ell_{i,j}\) such that \(\ell_{i,j} = \ell_i\) w.p. \(p_{i,j}\), and \(\ell_{i,j} = -\ell_i\) w.p. \(1 - p_{i,j}\)

    • Algorithm assigns tasks to workers and produces an estimate \(\hat{\ell}_i\) for each task \(i\)

    • Assumptions and constraints:

      • Worker \(j\) cannot be assigned more than \(M_j\) tasks

      • The algorithm can observe worker output on each task before deciding whether to assign more tasks or move to the next worker

      • The algorithm may not return to previous workers

      • The algorithm has access to at least one gold standard task of each type; tasks for which the true label is already known

  • Offline, full information algorithm:

    • Authors first look at the case where the skill levels \(p_{i,j}\) are already known

    • Result (paraphrased): write \(w_{i,j} = 2p_{i,j} - 1\), and let \(J_i\) be the set of workers assigned to task \(i\). Then the weighted majority estimate

      \[\hat{\ell_i} = \text{sign}{\sum_{i \in J_i}{w_{i,j}\ell_{i,j}}}\]

      satisfies

      \[\text{Pr}(\hat{\ell_i} \ne \ell_i) \le \exp(-\frac{1}{2}\sum_{i \in J_i}{q_{i,j}})\]

      where \(q_{i,j} = (2p_{i,j} - 1)^2 = w_{i,j}^2\) is a measure of informativeness of worker \(j\) on task \(i\) (also see Ghosh et. al.)

    • Corollary: to guarantee probability of error at most \(\epsilon\), we only need to ensure \(\sum_{i \in J_i}{q_{i,j}} \ge 2\log{1 / \epsilon} =: C_\epsilon\)

Who moderates the moderators? (2011)

(Arpita Ghosh and Satyen Kale and Preston McAfee)

  • Summary:

    • Motivation: a lot of user-generated content online is spam or abuse. In large platforms we must rely on user ratings to moderate content (e.g. up/down votes), but users are not always trustworthy or reliable themselves.

    • Identifies a method for estimating ground truth labels via crowdsourced judgements. Only requirement: we know the identity of one user with reliability greater than 1/2.

    • Authors give probabilistic bounds on the fraction of errors the algorithm makes

    • Also look at robustness to adversarial users in a few settings

    • (From people at Yahoo research)

  • Model:

    • \(T\) items \(1,\ldots,T\). Each item has a true label (‘quality’) \(q_t \in \{1,-1\}\) (intuitively, \(q_t=1\) means item \(t\) is good, and \(q_t = -1\) means bad)

    • \(n\) users \(1,\ldots,n\). Each user has (unknown) accuracy \(\psi_i\) – the probability of rating an item correctly

    • \(u_{ti}\) is the rating given to item \(t\) by user \(i\):

      \[u_{ti} = \begin{cases} q_t&, \text{with probablity } \psi_i \\ -q_t&, \text{with probablity } 1 - \psi_i \end{cases} \]
    • \(U = [u_{ti}]\) is the \(T \times n\) matrix of ratings

    • We assume agent 1 has \(\psi_1 > \frac{1}{2}\) and write \(\gamma = \psi_1 - \frac{1}{2} > 0\).

    • The competence of \(i\) is \(\kappa_i = (2\psi_i - 1)^2\) (this is a measure of informativeness). \(\kappa = \sum_{i}{\kappa_i}\) is the total competence; \(\bar{\kappa} = \frac{1}{n}\kappa\) is average competence.

  • Algorithm:

    • Authors show that the vector of true labels \(q\) is the top eigenvector (eigenvector with largest eigenvalue in magnitude) of the matrix \(\mathbb{E}[UU^\top]\)

    • In practise we only have the realisation \(U\), and without knowing the true accuracy levels \(\psi\) and labels \(q\) we cannot compute the expectation directly

    • Algorithm:

      • Compute top eigenvector \(v\) of \(UU^\top\) with \(\|v\| = 1\). Let \(s = \text{sign}(v)\).

      • At this point \(-v\) is also possible. Use \(u_1\) (vector of ratings from trusted user 1) to distinguish: set \(\sigma = \text{sign}(u_1 \cdot v)\) and output \(q' = \sigma v\) as the estimate for \(q\)

      • (Note: calculating \(\sigma\) is essentially seeing which of \(v\) and \(-v\) agrees with \(u_1\) the most. I wonder what happens if there is a tie?)

    • Analysis:

      • Uses Chernoff-like bounds for random matrices to judge how far away \(q'\) is likely to be from the real \(q\)

      • Main result: Let \(\eta \in (0, 1)\). There is a constant \(c\) such that if \(T > \frac{2}{\gamma^2}\log\frac{4}{\eta}\) and \(\frac{n}{\log{n}} > \frac{128}{c\bar{\kappa}^2}\), then with probability at least \(1 - \eta\):

        \[\frac{1}{T}|\{t \mid q'_t \ne q_t\}| \le \frac{8}{\bar{\kappa}}\sqrt{ \frac{\log{n}}{cnT} \log{\frac{4}{\eta}} }\]
      • Important points:

        • LHS is the fraction of errors made by the algorithm

        • RHS goes to 0 as \(n, T \to \infty\)

        • The lower bounds on \(T\) and \(\frac{n}{\log{n}}\) improve with \(\gamma\) (the excess competence of user 1 above \(\frac{1}{2}\)) and \(\bar{\kappa}\) (the average informativeness) respectively

      • Similar analysis can be carried out when users do not rate all items, but provide a rating on each item \(t\) only with some probability \(p_i\)

  • Online version of the algorithm:

    • To deal with repeatedly applying the algorithm, authors propose an online version

    • Take a sample of \(T\) items to get estimated labels \(q'\)

    • Compute estimated accuracy of each user by \(\psi'_i = \frac{1}{2}\left(\frac{1}{T}(q' \cdot u_i) + 1\right)\). This is just the fraction of agreement between \(u_i\) and \(q'\), rescaled to lie in \([0, 1]\)

    • Compute \(\psi''_i\) by clipping \(\psi'_i\) to \([\alpha,1-\alpha]\) for some \(\alpha = O(1 / \sqrt{T})\) (this helps with numerical stability, I think)

    • Compute user weights \(w_i = \frac{1}{2}\log\frac{\psi''_i}{1 - \psi''_i}\). There are the log-odds weights that are known to be optimal if the true accuracy levels are known (e.g. Nitzan and Paroush 1982)

    • For a new item \(q_{T+1}\), estimate true label by weighted majority voting.

    • Authors show that the probability of a mistake on \(q_{T+1}\) is small given \(\alpha\) is small enough, and drops exponentially with increasing \(n\)

  • Manipulation:

    • Authors consider 3 kinds of adversarial agents.

    • Stochastic manipulation: users rate items according to the probabilistic model, but choose \(\psi_i\) in attempts to reduce accuracy. Since the performance results are parametrised by \(\bar{\kappa}\), the most damage a user can do is have \(\psi_i = 1 / 2\), i.e. decide on items randomly. A user with \(\psi_i < 1/2\) has \(\kappa_i > 0\), so still provides useful information (the algorithm can invert their judgements)

    • Adversarially choosing \(u_j\): a set of \(b\) ‘bad’ users do not produce rating according to the probabilist model, but choose them arbitrarily to degrade performance. Authors give error bounds in this case.

    • Final attack is the same as above, but adversarial users are only interested in degrading performance for a specific item \(t\).

Aggregating Crowdsourced Binary Ratings (2013)

(Dalvi, Nilesh and Dasgupta, Anirban and Kumar, Ravi and Rastogi, Vibhor)

  • Summary:

    • Binary aggregation problem: users provide binary labels for a number of items; we want to aggregate to find the true labels

    • Novel aspect: the user-item graph (which describes when a user provides a label for an item) can be arbitrary

    • Uses spectral methods similar to Who moderates the moderators?

    • Obtain probabilistic bounds on error

  • Model:

    • \(m\) items and \(n\) users

    • \(q_i \in \{1,-1\}\) is the true label for item \(i\)

    • \(p_j\) is accuracy for user \(j\). The reliability of \(j\) is \(w_j = 2p_j - 1 \in [-1, 1]\)

    • A binary \(m \times n\) matrix \(G\) (which one could alternative view as a bipartite graph) describes the answering pattern of users: \(G_{ij} = 1\) if user \(j\) provides a label for item \(i\), and \(G_{ij} = 0\) otherwise

    • \(U\) is the response matrix:

      \[U_{ij} = \begin{cases} q_i,& G_{ij} = 1, \text{w.p. } p_j \\ -q_i,& G_{ij} = 1, \text{w.p. } 1 - p_j \\ 0,& G_{ij} = 0 \end{cases}\]
    • Aim is to produce, given a realisation of \(U\), estimates \(\hat{w}\) and \(\hat{q}\) of \(w\) and \(q\) respectively

    • The error on \(\hat{w}\) is the expected value of \(\|\hat{w} - w\|^2\) (and similar for \(\hat{q}\))

  • Algorithms:

    • (Note: I have not taken the time to go through these methods in great detail)

    • Similar idea to spectral methods of Ghosh et. al. (?), but need to worry about the matrix \(G\) too

    • Authors provide probabilistic bounds on errors (for \(\hat{w}\) and \(\hat{q}\)) for \(G\) with sufficiently large eigengap. Uses matrix version of McDiarmid’s inequality amongst other things

    • Authors user a heuristic (which apparently works well in practise) when the eigengap is small

Spectral methods meet EM: A provably optimal algorithm for crowdsourcing (2014)

(Zhang, Yuchen and Chen, Xi and Zhou, Dengyong and Jordan, Michael I)

  • (Note: the technical parts of the paper are a bit beyond my comprehension for now, but I roughly understand the idea at a high level…)

  • Summary:

    • Crowdsourcing classification problem, with any finite number of possible labels (not just binary classification)

    • Two step process:

      1. Use the spectral method to estimate confusion matrices of workers

        • Confusion matrix for worker \(i\) gives the probability that label \(l\) will be given when \(j\) is the true label, for each pair \((l, j)\)

        • Not really sure what the spectral method is, but I gather it’s a general term for methods using the eigendecomposition (or singular value decomposition?) of relevant matrices or tensors to estimate the parameters of a statistical model. I think it is also related to the method of moments. Perhaps see Tensor Decompositions for Learning Latent Variable Models, which is a highly cited paper.

      2. Uses EM algorithm, with results from the spectral method as initial estimates, to get maximum likelihood estimates

  • Analysis:

    • Probabilistic bound on distance between estimated and true confusion matrices from step 1

    • Result about accuracy of label estimates from step 2

    • Experimental analysis

ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking (2012)

(Demartini, Gianluca and Difallah, Djellel Eddine and Cudr’e-Mauroux, Philippe)

  • Summary:

    • System to identify entities from natural language text and create structured linked data

      • Entity linking: map entities in natural language text to URIs which uniquely identify the entity

      • Related to semantic web, RDF, etc

    • Crowdsourcing with (potentially unreliable) human workers helps to improve the linked data

    • Probabilistic framework to handle inconsistent results from human workers

  • Truth discovery is not the primary goal of the paper, but is used to get the ‘best of both worlds’ between automatic and manual entity extraction and linking