Truth Discovery¶
Contents:
Notes on the mainstream TD literature, which is essentially data mining.
Problem Formulation¶
The following lists the different truth discovery formulations in the literature.
Truth Discovery with Multiple Conflicting Information Providers on the Web (2008)¶
Input:
Websites provide facts for objects
\(F(w)\) is the set of facts for website \(w\)
\(W(f)\) is the set of websites for fact \(f\)
Each fact \(f\) relates to object \(o(f)\)
Implication level \(\text{imp}(f_j \to f_k) \in [-1, 1]\)
Output:
Trustworthiness \(t(w) \in \mathbb{R}\) for each website \(w\)
Confidence \(s(f) \in \mathbb{R}\) for each fact \(f\)
Algorithm:
Pseudo-probabilistic: starts with a probabilistic model but incorporates various heuristics
Iterative procedure
A Survey on Truth Discovery (2016)¶
(Li, Yaliang et. al.)
Surveys many different TD methods; their formulation should match (or at least be compatible) with quite a few methods.
Input:
Sources provide values for objects
\(v^s_o\) is the value for object \(o\) provided by source \(s\)
Total input is \(\{v^s_o\}_{s \in \mathcal{S}}\)
Output:
Source weight \(w_s\) for each source
Identified truth \(v^*_o\) for each object
Total output is \(\langle \{w_s\}_{s \in \mathcal{S}}, \{v^*_o\}_{o \in \mathcal{O}} \rangle\)
Semi-supervised Truth Discovery (2011)¶
(Yin, Xiaoxin and Tan, Wenzhao)
Input:
Sources provide facts, where each fact relates to a subject
A subset of facts are ground truths, which are known to be true
Similarity function \(\text{sim}: \mathcal{F} \times \mathcal{F} \to [-1, 1]\)
The model allows facts on the same subject to support each other; not always in conflict. E.g. two ‘close’ numbers for a subject whose truth is a large number…
Output:
Confidence score \(c_i \in [-1, 1]\) for each fact \(f_i\)
Interestingly, source trustworthiness is not directly considered
Algorithm:
Form a weighted graph from the input:
Node are facts
(Symmetric) weight from \(f_i\) to \(f_j\) is
\[w_{ij} = \begin{cases} \alpha & f_i \text{ and } f_j \text{ are both provided by some source} \\ \text{sim}(f_i, f_j) & f_i \text{ and } f_j \text{ are for the same subject} \\ 0 & \text{ otherwise} \end{cases}\]where \(\alpha > 0\) is some parameter.
Algorithm is optimisation based:
Assign a confidence score to each node (fact)
Constraint: ground truth facts must have score 1
Objective function is a sum over edges which involves the \(c_i, c_j, w_{ij}\)
Iterative procedure converges to the optimal solution
On Truth Discovery in Social Sensing: A Maximum Likelihood Estimation Approach (2012)¶
(Wang, Dong and Kaplan, Lance and Le, Hieu and Abdelzaher, Tarek)
Input:
Participants make (binary) observations about variables
Represented as participant-claims matrix: row for each participant, column for each variable. We have \(SC_{ij}=1\) if participant \(i\) claims variable \(j\) is true, and \(SC_{ij}=0\) otherwise
Output:
True/False for each variable
Reliability score \(e_i \in \mathbb{R}\) for each participant \(i\)
Algorithm:
Iterative expectation maximisation algorithm
Truth Discovery and Copying Detection in a Dynamic World (2009)¶
(Dong, Xin Luna and Berti-Equille, Laure and Srivastava, Divesh)
Input:
Differs in a big way from other approaches: truth changes over time
Set of sources \(\mathcal{S}\) provide values for objects \(\mathcal{O}\) over time
\(\bar{U}(S, t)\) denotes the updates provided by source \(S\) at time \(t\)
I presume \(\bar{U}(S, t) \subseteq (\mathcal{O} \times \mathcal{V})\) or something like that, where \(\mathcal{V}\) is the domain of values for objects
Output:
True values for each object over time (i.e. recovers the whole history, not just final true values): a tuple \(\langle (t_1,v_1),\ldots,(t_l, v_l)\rangle\) for each object, where the \(t_i\) are the times at which the true value changed, and \(v_i\) are the true values
Algorithm (very complicated):
Hidden Markov Model for copying relationship state between each pair of sources
Bayesian probabilistic model for estimating object truths from source copying relationship etc (?)
Final algorithm is iterative application of the above
Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery (2016)¶
(Li, Yaliang and Li, Qi and Gao, Jing and Su, Lu and Zhao, Bo and Fan, Wei and Han, Jiawei)
- Input:
\(N\) objects each have \(M\) properties
\(K\) sources make observations, which give a value to a property of an object. The value for property \(m\) of object \(i\) provided by source \(k\) is \(v_{im}^{(k)}\).
Note: sources are assumed to provide values for all properties of all objects
Source input can be seen as an \(N \times M\) matrix \(\mathcal{X}^{(k)}\) whose \((i, m)\)-entry is \(v_{im}^{(k)}\)
- Output:
Source weights \(\mathcal{W}=\{w_1,\ldots,w_k\}\) are reliability degrees of sources
Identified truth \(v_{im}^{(*)}\) for each object and property
- Algorithm:
Optimisation framework: find \(\mathcal{X}^{(*)}, \mathcal{W}\) which minimises
\[\sum_{k=1}^{K}{ w_k \sum_{i=1}^{N}{ \sum_{m=1}^{M}{ d_m(v_{im}^{(k)}, v_{im}^{(*)}) } } }\]such that \(\delta(\mathcal{W}) = 1\).
Here \(\delta\) is a regularisation function which adds constraints on the possible combinations of source weights
\(d_m\) is a distance function specific to the data type for property \(m\)
Algorithm is a two-step iterative procedure
FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation (2015)¶
(Ma, Fenglong and Li, Yaliang and Li, Qi and Qiu, Minghui and Gao, Jing and Zhi, Shi and Su, Lu and Zhao, Bo and Ji, Heng and Han, Jiawei)
Input:
Set of questions \(\mathcal{Q}\)
Sources \(\mathcal{U}\)
Answers \(\mathcal{A} = \{a_{qu} : q \in \mathcal{Q}, u \in \mathcal{U}\}\)
Each question \(q\) has \(M_q\) words
The number of topics \(K\) (note: questions are not assigned a topic in the input)
Output:
Estimated truth \(t_q\) for each question \(q\) (note: they use the term trustworthy answers)
Topic \(z_q\) for each question \(q\)
Topical expertise matrix \(e \in \mathbb{R}^{K \times U}\), i.e. reliability score for each source for each topic
Algorithm:
Complex statistical model which I do not understand
Iterative algorithm
On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths (2018)¶
(Yi Yang and Quan Bai and Qing Liu)
Input:
A source \(s\) makes an observation \(v_o^s \in \mathbb{R}\) for an object \(o\)
Set of objects \(\mathcal{O} = \mathcal{O}_g \cup \mathcal{O}_u\) is comprised of ground-truth objects (\(\mathcal{O}_g\)) and unknown-truth objects (\(\mathcal{O}_u\))
For \(o \in \mathcal{O}_g\), \(\bar{v}_o^*\) is the ground truth for object \(o\)
Output:
A set of estimated truths \(V_u^* = \{v_o^*\}_{o \in \mathcal{O}_u}\)
Weights \(W=\{w^s\}_{s \in \mathcal{S}} \subseteq \mathbb{R}\) are used in the algorithm (and could be considered part of the output, although the authors do not include this in their problem formulation)
Algorithm:
Optimisation problem similar to others e.g. CRH
Problem: find \(V_u^*, W\) to minimise the objective function, which is a sum of two components (the relative weight controlled by a hyperparameter \(\theta\))
Weighted sum of squared differences between source claims \(v_o^s\) and estimated truths \(v_o^*\) for each \(o \in \mathcal{O}_u\)
Same thing but for ground truths \(\bar{v_o^*}\) over \(o \in \mathcal{O}_g\)
Exponential source weight constraint (c.f. Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery (2016) and Finding global optimum for truth discovery: entropy based geometric variance)