Truth Discovery

Notes on the mainstream TD literature, which is essentially data mining.

Problem Formulation

The following lists the different truth discovery formulations in the literature.

Truth Discovery with Multiple Conflicting Information Providers on the Web (2008)

Input:

  • Websites provide facts for objects

  • \(F(w)\) is the set of facts for website \(w\)

  • \(W(f)\) is the set of websites for fact \(f\)

  • Each fact \(f\) relates to object \(o(f)\)

  • Implication level \(\text{imp}(f_j \to f_k) \in [-1, 1]\)

Output:

  • Trustworthiness \(t(w) \in \mathbb{R}\) for each website \(w\)

  • Confidence \(s(f) \in \mathbb{R}\) for each fact \(f\)

Algorithm:

  • Pseudo-probabilistic: starts with a probabilistic model but incorporates various heuristics

  • Iterative procedure

A Survey on Truth Discovery (2016)

(Li, Yaliang et. al.)

Surveys many different TD methods; their formulation should match (or at least be compatible) with quite a few methods.

Input:

  • Sources provide values for objects

  • \(v^s_o\) is the value for object \(o\) provided by source \(s\)

  • Total input is \(\{v^s_o\}_{s \in \mathcal{S}}\)

Output:

  • Source weight \(w_s\) for each source

  • Identified truth \(v^*_o\) for each object

  • Total output is \(\langle \{w_s\}_{s \in \mathcal{S}}, \{v^*_o\}_{o \in \mathcal{O}} \rangle\)

Semi-supervised Truth Discovery (2011)

(Yin, Xiaoxin and Tan, Wenzhao)

Input:

  • Sources provide facts, where each fact relates to a subject

  • A subset of facts are ground truths, which are known to be true

  • Similarity function \(\text{sim}: \mathcal{F} \times \mathcal{F} \to [-1, 1]\)

    • The model allows facts on the same subject to support each other; not always in conflict. E.g. two ‘close’ numbers for a subject whose truth is a large number…

Output:

  • Confidence score \(c_i \in [-1, 1]\) for each fact \(f_i\)

  • Interestingly, source trustworthiness is not directly considered

Algorithm:

  • Form a weighted graph from the input:

    • Node are facts

    • (Symmetric) weight from \(f_i\) to \(f_j\) is

      \[w_{ij} = \begin{cases} \alpha & f_i \text{ and } f_j \text{ are both provided by some source} \\ \text{sim}(f_i, f_j) & f_i \text{ and } f_j \text{ are for the same subject} \\ 0 & \text{ otherwise} \end{cases}\]

      where \(\alpha > 0\) is some parameter.

  • Algorithm is optimisation based:

    • Assign a confidence score to each node (fact)

    • Constraint: ground truth facts must have score 1

    • Objective function is a sum over edges which involves the \(c_i, c_j, w_{ij}\)

  • Iterative procedure converges to the optimal solution

On Truth Discovery in Social Sensing: A Maximum Likelihood Estimation Approach (2012)

(Wang, Dong and Kaplan, Lance and Le, Hieu and Abdelzaher, Tarek)

Input:

  • Participants make (binary) observations about variables

  • Represented as participant-claims matrix: row for each participant, column for each variable. We have \(SC_{ij}=1\) if participant \(i\) claims variable \(j\) is true, and \(SC_{ij}=0\) otherwise

Output:

  • True/False for each variable

  • Reliability score \(e_i \in \mathbb{R}\) for each participant \(i\)

Algorithm:

  • Iterative expectation maximisation algorithm

Truth Discovery and Copying Detection in a Dynamic World (2009)

(Dong, Xin Luna and Berti-Equille, Laure and Srivastava, Divesh)

Input:

  • Differs in a big way from other approaches: truth changes over time

  • Set of sources \(\mathcal{S}\) provide values for objects \(\mathcal{O}\) over time

  • \(\bar{U}(S, t)\) denotes the updates provided by source \(S\) at time \(t\)

    • I presume \(\bar{U}(S, t) \subseteq (\mathcal{O} \times \mathcal{V})\) or something like that, where \(\mathcal{V}\) is the domain of values for objects

Output:

  • True values for each object over time (i.e. recovers the whole history, not just final true values): a tuple \(\langle (t_1,v_1),\ldots,(t_l, v_l)\rangle\) for each object, where the \(t_i\) are the times at which the true value changed, and \(v_i\) are the true values

Algorithm (very complicated):

  • Hidden Markov Model for copying relationship state between each pair of sources

  • Bayesian probabilistic model for estimating object truths from source copying relationship etc (?)

  • Final algorithm is iterative application of the above

Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery (2016)

(Li, Yaliang and Li, Qi and Gao, Jing and Su, Lu and Zhao, Bo and Fan, Wei and Han, Jiawei)

Input:
  • \(N\) objects each have \(M\) properties

  • \(K\) sources make observations, which give a value to a property of an object. The value for property \(m\) of object \(i\) provided by source \(k\) is \(v_{im}^{(k)}\).

    • Note: sources are assumed to provide values for all properties of all objects

    • Source input can be seen as an \(N \times M\) matrix \(\mathcal{X}^{(k)}\) whose \((i, m)\)-entry is \(v_{im}^{(k)}\)

Output:
  • Source weights \(\mathcal{W}=\{w_1,\ldots,w_k\}\) are reliability degrees of sources

  • Identified truth \(v_{im}^{(*)}\) for each object and property

Algorithm:
  • Optimisation framework: find \(\mathcal{X}^{(*)}, \mathcal{W}\) which minimises

    \[\sum_{k=1}^{K}{ w_k \sum_{i=1}^{N}{ \sum_{m=1}^{M}{ d_m(v_{im}^{(k)}, v_{im}^{(*)}) } } }\]

    such that \(\delta(\mathcal{W}) = 1\).

  • Here \(\delta\) is a regularisation function which adds constraints on the possible combinations of source weights

  • \(d_m\) is a distance function specific to the data type for property \(m\)

  • Algorithm is a two-step iterative procedure

FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation (2015)

(Ma, Fenglong and Li, Yaliang and Li, Qi and Qiu, Minghui and Gao, Jing and Zhi, Shi and Su, Lu and Zhao, Bo and Ji, Heng and Han, Jiawei)

  • Input:

    • Set of questions \(\mathcal{Q}\)

    • Sources \(\mathcal{U}\)

    • Answers \(\mathcal{A} = \{a_{qu} : q \in \mathcal{Q}, u \in \mathcal{U}\}\)

    • Each question \(q\) has \(M_q\) words

    • The number of topics \(K\) (note: questions are not assigned a topic in the input)

  • Output:

    • Estimated truth \(t_q\) for each question \(q\) (note: they use the term trustworthy answers)

    • Topic \(z_q\) for each question \(q\)

    • Topical expertise matrix \(e \in \mathbb{R}^{K \times U}\), i.e. reliability score for each source for each topic

  • Algorithm:

    • Complex statistical model which I do not understand

    • Iterative algorithm

On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths (2018)

(Yi Yang and Quan Bai and Qing Liu)

  • Input:

    • A source \(s\) makes an observation \(v_o^s \in \mathbb{R}\) for an object \(o\)

    • Set of objects \(\mathcal{O} = \mathcal{O}_g \cup \mathcal{O}_u\) is comprised of ground-truth objects (\(\mathcal{O}_g\)) and unknown-truth objects (\(\mathcal{O}_u\))

    • For \(o \in \mathcal{O}_g\), \(\bar{v}_o^*\) is the ground truth for object \(o\)

  • Output:

    • A set of estimated truths \(V_u^* = \{v_o^*\}_{o \in \mathcal{O}_u}\)

    • Weights \(W=\{w^s\}_{s \in \mathcal{S}} \subseteq \mathbb{R}\) are used in the algorithm (and could be considered part of the output, although the authors do not include this in their problem formulation)

  • Algorithm: