Mainstream Truth Discovery Papers¶
Towards Student Learning Ability Estimation and Truth Discovery in Japanese Online Course (2020)
Federated Truth Inference over Distributed Crowdsourcing Platforms (2020)
FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation (2015)
On Truth Discovery in Social Sensing: A Maximum Likelihood Estimation Approach (2012)
Finding global optimum for truth discovery: entropy based geometric variance
The Wisdom of Minority: Discovering and Targeting the Right Group of Workers for Crowdsourcing
A confidence-aware approach for truth discovery on long-tail data (2014)
Truth Discovery from Multi-Sourced Text Data Based on Ant Colony Optimization (2019)
Truth Discovery with Multiple Conflicting Information Providers on the Web (2008)
On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths (2018)
Truth Discovery and Copying Detection in a Dynamic World (2009)
SLiMFast: Guaranteed Results for Data Fusion and Source Reliability (2017)
Towards Student Learning Ability Estimation and Truth Discovery in Japanese Online Course (2020)¶
(Han, Lanling and Liu, Yanwei and Shi, Chunhui and Bi, Yang and Yu, Liang and Shen, Hua)
Summary:
Another optimisation-based truth discovery algorithm. The authors find motivation from a student/question situation: students of varying ability levels answer questions.
As usual, the goal is to jointly estimate the ability of students and the true answers to questions. I do not fully understand this setting: presumably whoever set the questions already knows the true answers…
Sources are split into two groups: authoritative and ordinary. The idea is that authoritative sources are better than ordinary ones, so it’s leaning closer to semi-supervised truth discovery (where claims for authoritative sources act as ground truths).
Framework:
Terminology: objects (questions), sources (students) and observations (answers given by students)
Notation: \(N\) objects, \(P\) ‘ordinary’ sources and \(Q\) ‘authoritative’ sources.
The observation from the \(p\)-th ordinary source on the \(n\)-th object is \(o_n^p\); the observation from the \(q\)-th authoritative source is \(a_n^q\). Source weights are denoted \(w_p\) and \(r_q\). Estimated truths are denoted \(t_n\).
Optimisation model: basically the same as [LLG+16] and many other papers… A difference is that a factor \(\lambda > 0\) is applied to the sum of weighted distances of observations from ordinary sources. This controls the importance of the authoritative sources.
Experimental evaluation:
An education dataset (online multiple-choice Japanese exam from the university of the authors), and a crowdsourced dataset from a “Who Wants to be a Millionaire” Android app. Doesn’t seem that these datasets are available anywhere (no mention in the paper, at least).
Authors show that their algorithm has lowest error rate on these datasets compared to a number of baselines.
Criticisms:
Very similar to other work I have seen…
How did the authors split the sources into ‘authoritative’ and ‘ordinary’ in their experiments? Seems to rely on some prior knowledge of the dataset.
More generally, the splitting of sources seems to only amount to multiplying the weights of authoritative sources by \(\lambda\). It’s not clear to me how this differs at all from CRH from Li et. al…
Federated Truth Inference over Distributed Crowdsourcing Platforms (2020)¶
(M. Yang and G. Liu and Y. -. P. Hong)
YAEMTDA (yet another expectation maximisation truth discovery algorithm)…
Summary:
Crowdsourcing where tasks are sent to different platforms. E.g. some users are on Mechanical Turk, some complete in-person surveys etc.
Truth inference is done in a distributed way (reduced communication costs and better privacy for participants)
Approach:
Workers are split up across \(K\) crowdsourcing platforms
The response of worker \(n\) to task \(m\) is
\[x_{n,m} = \sum_{l=1}^{L}{s^{(l)}_{n,m}(a_{m} + w^{(l)}_{n,m})}\]where \(s^{(l)}_{n,m} \in \{0,1\}\) is an indicator variable giving the reliability level of worker \(n\) on task \(m\), and \(w^{(l)}_{s,m}\) is the noise in the response when the reliability level is \(l\)
Model assumes that the probability of worker \(n\) having reliability level \(l\) is independent of the task \(m\), so that the worker is described by a vector of probabilities \(\mathbf{\alpha}_n \in \R^L\)
Also assumes that the noise factors \(w_{n,m}^{(l)}\) are i.i.d Gaussians for all workers and tasks: \(w_{n,m}^{(l)} \sim \mathcal{N}(0, \sigma_l^2)\) where \(\sigma_l^2\) is the noise variance for workers of level \(l\)
Note: I don’t quite understand why the model is introduced in quite a general way (e.g. permits a worker to have different reliability levels for different tasks) but then restricts things with the independence assumptions. Would be clearer to just model things that way in the first place without the need for additional assumptions
Expectation maximisation applied to estimate
True responses to tasks
Probability distributions of reliability levels for each worker (the vectors \(\mathbf{\alpha}_n\)
Noise variance levels \(\sigma_l^2\)
It turns out that \(\theta_{t+1}\) can be calculated using just aggregated statistics from each of the \(K\) platforms, without needing the raw worker responses \(x_{n,m}\) to leave the platform
This supposedly preserves worker privacy, since their information is not collected on a central server. There is no mention of the privacy characterisatics of the local statistics though (e.g. things like differential privacy…)
Where the Truth Lies: Explaining the Credibility of Emerging Claims on the Web and Social Media (2017)¶
(Popat, Kashyap and Mukherjee, Subhabrata and Strotgen, Jannik and Weikum, Gerhard)
(Doesn’t describe itself as Truth Discovery, but in terms of the categorisation on this page I consider it close enough…)
Summary:
System does the following: given a textual claim, determine its credibility and give explanation for this verdict
Credibility is affected by
Language used
Reliability of sources which support/refute the claim
Temporal aspects of how claims spread across the web
Preliminaries
Set \(C\) of claims
Set \(WS\) of (web) sources
Set of articles \(A^t\) for each timestamp \(t\)
\(a_{ij}^t\) is the article from \(ws_j\) at time \(t\) which supports/refutes claim \(c_i\)
Binary random variables \(y_i^t, y_{ij}^t \in \{T,F\}\) represent the credibility of claim \(c_i\) and the credibility of \(a_{ij}\) at the \(t\), respectively. \(y_i^t\) is supposed to be aggregation of the \(y_{ij}^t\)
Problem statement:
Given the values of a subset of the \(y_{i}^t\), predict \(y_1^t\) for each timestep \(t\) (wlog I’m assuming that \(c_1\) is the claim being assessed)
Credibility assessment
Lists some linguistic features and how they affect credibility assessment (e.g. “may” softens the degree of commitment to a proposition)
Authors introduce an algorithm to calculate support and refutation scores for each article and claim:
Generate ‘snippets’ of the article: all subsequences of up to four consecutive sentences long
For each snippet \(s\), compute the ‘unigram and bigram overlap’ score \(o_s\) (this is not explained in any detail); discard snippets whose percentage snippet is below a fixed threshold
Calculate the stance of each snippet using logistic regression, trained on data from popular fact-checking sites (I don’t have the background to really understand this step). This gives support and refute probabilities \(p_s^+, p_s^-\)
Order the snippets according to \(\max\{p_s^+ \cdot o_s, p_s^- \cdot o_s\}\)
Select the top \(k\) snippets (these will later be used for the evidence) and return the averages of \(p_s^+\) and \(p_s^-\)
Source reliability is computed wrt a labelling of the credibility of claims:
Take number of articles from \(ws_j\) which supports a true claim
Add the number of articles which refute a false claim
Divide by total number of articles from \(ws_j\)
C.f. ‘proportion’ operator in question-answering with abstentions…
I don’t quite understand what the authors do here: in the previous section an article has support and refutation scores, not an absolute status
FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation (2015)¶
(Ma, Fenglong and Li, Yaliang and Li, Qi and Qiu, Minghui and Gao, Jing and Zhi, Shi and Su, Lu and Zhao, Bo and Ji, Heng and Han, Jiawei)
Summary:
FaitCrowd: Fine Grained Crowdsourced data aggregation
Challenges the widespread TD assumption that source reliability is uniform across objects (‘tasks’ in the crowdsourcing parlance)
A source might be an expert (high reliability) in one area, but useless in another (low reliability)
(I seem to recall this issue also being discussed in Pasternack’s thesis)
Instead, need topic-aware reliability measures
Automatically learns topics, reliability levels and true answers
Formulation:
Input:
Set of questions \(\mathcal{Q}\)
Sources \(\mathcal{U}\)
Answers \(\mathcal{A} = \{a_{qu} : q \in \mathcal{Q}, u \in \mathcal{U}\}\)
Each question \(q\) has \(M_q\) words
The number of topics \(K\) (note: questions are not assigned a topic in the input)
Output:
Estimated truth \(t_q\) for each question \(q\) (note: they use the term trustworthy answers)
Topic \(z_q\) for each question \(q\)
Topical expertise matrix \(e \in \R^{K \times U}\), i.e. reliability score for each source for each topic
Model:
Complex statistical model which I do not understand
Iterative algorithm
On Truth Discovery in Social Sensing: A Maximum Likelihood Estimation Approach (2012)¶
(Wang, Dong and Kaplan, Lance and Le, Hieu and Abdelzaher, Tarek)
Summary:
Widely cited MLE-based TD method. I think it is actually the first MLE TD paper
Reliability of a source is the probability that they report correct observations
Just looks at binary measurements
Provides optimal procedure (in the MLE sense) for truth discovery
Preliminaries:
\(M\) sources \(S_1,\ldots,S_M\) and \(N\) variables \(C_1,\ldots,C_N\) each with two possible values: true or false
\(S_iC_j\) denotes that source \(i\) claims \(C_j\) is true. Note that sources never claim variables are false: this is the default position in the absence of \(S_iC_j\)
\(P(C^t_j)\) and \(P(C^f_j)\) denote the probability that \(C_j\) is true or false respectively [isn’t it the case that \(P(C^f_j)=1-P(C^t_j)\)?]
\(s_i\) is the probability that \(S_i\) will claim a variable is true. The reliability \(t_i\) is the probability that \(S_i\) makes the correct observation: \(t_i = P(C^t_j \mid S_iC_j)\)
Some other notions defined…
Expectation maximisation
Rough outline is as follows…
We have an observed dataset \(X\), latent variables \(Z\) and parameters \(\theta\)
The likelihood function is \(L(\theta; X, Z) = P(X, Z \mid \theta) = \sum_{Z}P(X \mid \theta, Z)\)
That is, the probability the we would observe \(X\) and latent variables \(Z\) if the parameters of the model were \(\theta\)
In the TD case \(X\) is the TD dataset, represented as a binary source-claims matrix. The latent variable \(Z\) consists of the truth status of each variable. The parameters \(\theta\) include reliability of sources and background bias \(d\) (i.e. \(d\) is the probability that a variable is true generally).
EM algorithm is iterative:
Initialise an estimate for \(\theta^{(1)}\)
While convergence criteria not satisfied:
(E step): Define \(Q(\theta \mid \theta^{(t)}) = E_{Z \mid X, \theta^{(t)}}[\log{L(\theta; X, Z)}]\). That is, we look at the expected value of the log-likelihood, taken with respect to the probability distribution over \(Z\) conditioned on the observed data \(X\) and the current estimated parameters \(\theta^{(t)}\).
(M step): Set \(\theta^{(t+1)} = \text{argmax}_{\theta}{Q(\theta \mid \theta^{(t)})}\).
Applied to this TD model, the authors obtain an explicit form for \(Q\) and the argmax, leading to a simple iterative algorithm
Evaluation:
The algorithm is compared against others in the literature on simulated data with respect to various metrics
Also runs the algorithm on a real world dataset from Twitter
Finding global optimum for truth discovery: entropy based geometric variance¶
(Hu Ding, Jing Gao, and Jinhui Xu)
Source claims are given as vectors \(\{p_1,\ldots,p_n\}\subseteq \R^d\) (i.e. there are \(n\) sources, \(d\) objects and facts are real-valued).
This seems to require that a source makes a claim for every object, which seems weird…
Formulates truth discovery as an optimisation problem: find source reliability weights \(w_1,\ldots,w_n\) and the truth vector \(p^*\) which minimise \(\sum_{i=1}^n{w_i \|p_i - p^*\|^2}\) subject to the normalisation constraint \(\sum_{i=1}^n{e^{-w_i}} = 1\).
Some justification is given for the exponential normalisation function. Firstly, it prevents one from taking \(w_1=1\), say, and \(w_i=0\) for \(i > 1\). I do not understand the other reasons, but it is linked with entropy…
Finds an equivalent formulation of the optimisation problem, which is non-convex in general
Provides (pretty complicated) approximate algorithms for the optimisation problem
Very opinionated view of truth discovery. They define truth discovery as their particular optimisation problem, and for this problem there is a fixed link between the source weights and truth vector, so that it is enough to find just one of them (I think)..
Seems to assume the optimal vector \(p^*\) exists without justification, unless I’m reading it wrong… Furthermore there are two algorithms presented for two different cases, but knowing which case to apply seems to rely on already knowing \(p^*\).
A Faster Algorithm for Truth Discovery via Range Cover¶
(Ziyun Huang, Hu Ding, Jinhui Xu)
Seems to be very much an extension of the above paper
TODO: finish reading
The Wisdom of Minority: Discovering and Targeting the Right Group of Workers for Crowdsourcing¶
(Hongwei Li, Bo Zhao, Ariel Fuxman)
Addresses a crowdsourcing quality problem
Closely related to TD, but the references are in crowdsourcing and seems disjoint from the TD literature
Goal is to identify characteristics of crowdsourcing workers (e.g. education level), and, for a given task, identify groups of workers whose characteristics mean they are likely to provide high-quality data. These groups are then targeted for future tasks.
Three stages…
Probing stage:
Worker characteristics are gather from the workers’ profile (e.g. age, location – the available data is platform-specific) or by a questionnaire
Part of the task is sent to all workers
Discovery stage:
From the probing stage, work out if there are groups of workers whose data is of high ‘quality’
This is the step related to TD
Targeting stage:
Send the rest of the task to the high-quality workers
Talks about EM for the ‘TD’ part, but I don’t think TD plays a role in their algorithms
Uses statistics beyond my level, so I have not read the technical parts in detail
A confidence-aware approach for truth discovery on long-tail data (2014)¶
(Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, and Jiawei Han)
A note on terminology: speaks of reliability of sources and trustworthiness of information
“Trust” is referring to the data, rather than sources! This is the other way around to our framework
Deals with the “long-tail phenomenon”:
Most sources only provide a few claims (“small sources”) and only a few sources make plenty of claims
Number of claims per source roughly follows a power law distribution in example datasets: something like \(y(x)=ax^{-b}\) where \(y(x)\) is the number of sources making \(x\) claims. That is, a small number of sources make a large number of claims
Aims to discount the effect of small sources while still taking them into consideration (ignoring them is not possible since the claims from small sources can form a significant portion of the dataset)
Gives confidence interval of reliability estimation, not just number
Formalism:
We have a set of entities \(\mathcal{N}\) (i.e. objects), sources \(\mathcal{S}\), and claims \(\mathcal{C}\)
Each claim is of the form \((n, s, x_n^s)\) for \(n \in \mathcal{N}\), \(s \in \mathcal{S}\). That is, source \(s\) claims that \(x_n^s\) is the true value for entity \(n\)
Output of TD is an estimated truth for each entity: \(\mathcal{X}=\{(n, x_n^*) : n \in \mathcal{N}\}\)
Note that sources weights are not considered to be part of the output, and are only used ‘internally’ to calculate the \(x_n^*\)
TD model:
Each source has an error distribution \(\epsilon_s\). The idea is that when making a claim, a source draws an error level \(e\) from this distribution and makes a claim \(e\) away from the truth
Error distributions are assumed to be normal with mean 0 (this means that sources do not have a tendency to be wrong on purpose, an assumption which may not always hold!)
Source reliability is inversely proportional to the variance of the error distribution \(\sigma_s^2\)
Combined error is \(\epsilon_{\text{combined}} = \sum_{s \in \mathcal{S}}{w_s \epsilon_s}\) for some weights \(w_s \in [0, 1]\) with all weights summing to 1
Aim is to choose weights to minimise the variance of the combined error. This can be expressed in terms of the \(\sigma_s^2\) (which are unknown)
Problem is formulated as an optimisation problem: find weights, estimate variances to minimise (estimated) variance of \(\epsilon_\text{combined}\)
(The objective function is modified a bit as they go into details, but above is the gist of it…)
Algorithm:
A closed-form solution for the optimisation problem is found
“Initial truths” are calculated through averaging or voting
Initial source weights are calculated (according to the optimisation problem solution) from initial truths and some probabilistic component relating to the number of claims made
Truths are updated as a weighted sum of claims
Repeat until stopping criterion
Experiments:
Authors obtained datasets with ground truths
For continuous data, they confirm experimentally that source errors follow a normal distribution
They show improvement over other TD methods
Truth Discovery from Multi-Sourced Text Data Based on Ant Colony Optimization (2019)¶
(C. Chang and J. Cao and G. Lv and N. Weng)
Aims to develop a TD algorithm for unstructured raw text data
Model (note: I have modified some notation to make it a bit more sane):
Set of questions \(\mathcal{Q}\)
Set of users \(\mathcal{U}\)
Set of answers \(\mathcal{A} = \{a_q^u : q \in \mathcal{Q}, u \in \mathcal{U}\}\)
Answers take the form of unstructured text
I guess users can abstain from a question by providing an empty string…
Approach:
Preprocessing: split answers into keywords, remove stopwords, and canonicalise synonyms
For each question \(q\), let \(\mathscr{V}_q\) be set of all such keywords across all users
Finding the truth amounts to finding a subset \(V_q^* \subseteq \mathscr{V}_q\) of ‘true’ keywords
An optimisation problem is formed, which seems similar to CRH.
They aim to solve the optimisation problem using ant colony optimisation.
An ant network is formed for each question. The keywords in \(\mathscr{V}_q\) form the edges (?). What the nodes represent is not explained very clearly.
This section is not laid out very well, considering it’s the main contribution of the approach
Summary:
The paper is very poorly written
The ‘unstructured text’ contribution is just a preprocessing step which converts text into a bag of keywords… from then on it’s just a standard optimisation TD problem solved using
I only read it because it mentioned any colony, expecting something cool…
Truth Discovery with Multiple Conflicting Information Providers on the Web (2008)¶
(Yin, Xiaoxin and Han, Jiawei and Yu, Philip S.)
The original TD paper
See notes on TD formulation page: Truth Discovery with Multiple Conflicting Information Providers on the Web (2008).
Semi-supervised Truth Discovery (2011)¶
(Yin, Xiaoxin and Tan, Wenzhao)
Truth discovery where a subset of facts are ground truths which are known to be true.
See notes on TD formulation page: Semi-supervised Truth Discovery (2011).
On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths (2018)¶
(Yi Yang and Quan Bai and Qing Liu)
Summary:
Optimisation-based semi-supervised truth discovery algorithm for continuous data
Theoretical analysis of time complexity and convergence to an optimal solution
§2 Related Work is really nice – gives a good overview of the whole truth discovery field
In particular, notes the differences in outputs of TD algorithms:
Scoring method: assigns a score to each fact. A post-hoc process selects a ‘true’ fact for each object, based on the scores
Labelling: assigns a true fact to each object without giving all possible facts a score
Labelling is perhaps more suitable for continuous data, since it allows the estimated truth to be a fact not given by any source (e.g. could do weighted averaging of real numbers)
This mirrors my remarks in undergrad dissertation, §2.2.
Input:
A source \(s\) makes an observation \(v_o^s \in \R\) for an object \(o\)
Set of objects \(\mathcal{O} = \mathcal{O}_g \cup \mathcal{O}_u\) is comprised of ground-truth objects (\(\mathcal{O}_g\)) and unknown-truth objects (\(\mathcal{O}_u\))
For \(o \in \mathcal{O}_g\), \(\bar{v}_o^*\) is the ground truth for object \(o\)
Output:
A set of estimated truths \(V_u^* = \{v_o^*\}_{o \in \mathcal{O}_u}\)
Weights \(W=\{w^s\}_{s \in \mathcal{S}} \subseteq \R\) are used in the algorithm (and could be considered part of the output, although the authors do not include this in their problem formulation)
Algorithm:
Optimisation problem similar to others e.g. CRH
Problem: find \(V_u^*, W\) to minimise the objective function, which is a sum of two components (the relative weight controlled by a hyperparameter \(\theta\))
Weighted sum of squared differences between source claims \(v_o^s\) and estimated truths \(v_o^*\) for each \(o \in \mathcal{O}_u\)
Same thing but for ground truths \(\bar{v_o^*}\) over \(o \in \mathcal{O}_g\)
Exponential source weight constraint (c.f. Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery (2016) and Finding global optimum for truth discovery: entropy based geometric variance)
Standard iterative procedure: fix weights and update estimated truths to minimise objective, then vice versa
Truth update step: \(v_o^*\) is updated to a weighted sum of the \(v_o^s\):
\[v_o^* = \frac{ \sum_{s \in S_o}{w^s v_o^s} }{ \sum_{s \in S_o}{w^s} }\]where \(S_o\) is the set of sources providing an observation for object \(o \in \mathcal{O}_u\).
Source weight update step: see the paper for a derivation of the update step using Lagrange multipliers. Amounts to the following, for a source \(s\): i) look at the squared differences between claims and estimated truths (and ground truths, weighted by \(\theta\) as in the objective function), ii) look at this quantity as a proportion of the summed quantities across all sources, iii) take the negative logarithm
Analysis:
Proof of convergence to optimal solution (invokes a result about coordinate descent and fills in the technicalities)
Time complexity analysis
Parameter sensitivity analysis:
Experimental analysis of the effect of \(\theta\) and the proportion of ground truths on error rate
Error decreases as proportion of ground truths increase. Hard to extract anything on the nature of the error decreases, since only five data points are given
As \(\theta\) increases from 0 the error decreases suddenly, but increases again after a while. High error for high \(\theta\) may be ‘overfitting’, since the algorithm practically ignores non-ground truth objects
Truth Discovery and Copying Detection in a Dynamic World (2009)¶
(Dong, Xin Luna and Berti-Equille, Laure and Srivastava, Divesh)
Summary:
True information often changes over time (contact details, addresses, restaurants and close over time…)
Data sources make copy from one another
Copying form unreliable sources results in the propagation of misinformation
To copy or not to copy is not a dichotomy: sources may copy at times and act independently at others
All together, this papers aims to look at source claims over time and determine:
The evolving truths
The evolving copying relationship between sources
Problem formulation
Set of objects \(\mathcal{O}\), sources \(\mathcal{S}\)
An object \(O\) has an associated life span \(\langle (t_1,v_1),\ldots,(t_l, v_l)\rangle\) which describes the true value \(v_i\) at transition time \(t_i\) as this true value evolves
\(\bar{U}(S, t_i)\) denotes the updates source \(S\) provides in the time period \((t_{i-1}, t_i]\)
Slices:
Consider an object \(O\) and source \(S\). Look at all the time points where either the true value for \(O\) changes, or where \(S\) provides an update on \(O\); the disjoint time intervals formed from these points are called slices
A slice is captured if it ends with \(S\) updating to the true value; it is capturable if \(S\) provided an incorrect value at the start of the slice
Similarly, a slice is miscaptured if it ends with \(S\) updating to the wrong value; it is miscapturable if it is possible for \(S\) to update to the wrong value (always possible when there are more than two possible values for \(O\))
Source quality measures:
Coverage: The coverage of \(S\) is the proportion of capturable slices which \(S\) captures (across all objects)
Exactness: \(E(S)\) is 1 minus the proportion of miscapturable slices which are miscaptured.
Freshsness: For \(\Delta \ge 0\), \(F(S, \Delta)\) is the proportion of captured slices which have length no greater than \(\Delta\) (i.e. instances where \(S\) updated to the correctly value less than \(\Delta\) time after the true value changed)
Copying model:
Uses a Hidden Markov Model
Fix two sources \(S_1\), \(S_2\). The Markov states are
\(I\): represents that the sources are indepenent
\(C1_c\): represents that \(S_1\) is a copier of \(S_2\), and is actively copying at the current time
\(C1_{\neg c}\): represents that \(S_1\) is a copier of \(S_2\) but is not actively copying
\(C2_c, C2_{\neg c}\) defined similarly
Transition probabilities are detailed; they depend on three parameters
\(f\): probability that a copier actively copies at any particular time
\(t_c\): probability that a copier remains a copier
\(t_i\): probability that independent sources remain independent
Later on this model is extended to consider copying relations over time by adding states for each timestamp…
Summary of the rest (skimmed):
The whole model is very large and complex
Final algorithm is iterative:
Compute CEF measure of sources based on estimated object lifespans
Use HMM somehow to work out copying relationships (?)
Use this to work out estimated object lifespans
Repeat until lifespans converge
Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery (2016)¶
(Li, Yaliang and Li, Qi and Gao, Jing and Su, Lu and Zhao, Bo and Fan, Wei and Han, Jiawei)
Looks at TD with heterogeneous data types (e.g. continuous, categorical etc. all in the same dataset)
Formulation:
\(N\) objects each have \(M\) properties
\(K\) sources make observations, which give a value to a property of an object. The value for property \(m\) of object \(i\) provided by source \(k\) is \(v_{im}^{(k)}\).
Note: sources are assumed to provide values for all properties of all objects. The authors state this is done for simplicity, and their framework can easily be adapted to situations where sources comment on only a subset of object properties.
Source input can be seen as an \(N \times M\) matrix \(\mathcal{X}^{(k)}\) whose \((i, m)\)-entry is \(v_{im}^{(k)}\)
I am not really sure why the authors have a ‘2D’ formulation, where objects have properties. Seems it would be just as well to flatten it out to \(N \times M\) variables
Source weights \(\mathcal{W}=\{w_1,\ldots,w_k\}\) are reliability degrees of sources
Identified truths are given in a matrix \(\mathcal{X}^{(*)}\)
Optimisation framework: find \(\mathcal{X}^{(*)}, \mathcal{W}\) which minimises
\[\sum_{k=1}^{K}{ w_k \sum_{i=1}^{N}{ \sum_{m=1}^{M}{ d_m(v_{im}^{(k)}, v_{im}^{(*)}) } } }\]such that \(\delta(\mathcal{W}) = 1\).
Idea is to minimise the weighted distances between source claims and truths. When this sum is low, sources who deviate from the ‘truth’ will necessarily have low weight, whereas sources close to the truth (for whom the \(d_m\) term is small) can have high weight
In other words, it aims to find groups of sources with strong consensus, assigning high weight to those strongly in the group and low weight to others
Missing ingredients:
\(d_m(x, y)\) gives the distance between values \(x\) and \(y\) for the \(m\)-th property of objects. It is specific to the datatype of the property in question
\(\delta\) is a regularisation function. It excludes trivial weight assignments, e.g. \(w_k = 0\) for all \(k\).
Algorithm:
Two-step iterative procedure: fix \(\mathcal{X}^{(*)}\) and update \(\mathcal{W}\), then the other way around. The values to update to are chosen so that they minimise the relevant parts of the objective function
Example instances:
For \(\delta\):
Exponential regularisation function: \(\delta(w_1,\ldots,w_k)=\sum_{k=1}^{K}{\exp(-w_k)}\). There is an explicit form for the optimal weights here (see the paper).
\(\ell_p\) norm regularisation. Here optimal solution is to take a single source as the most reliable (\(w_k=1\)) and discard all others – this is not desirable.
For \(d_m\):
Categorical values: 0-1 loss function, or probability-based idea (see paper)
Continuous values: ‘normalised square loss’, and others…
Empirical performance measures:
Error rate: for categorical data, look at percentage of predicted truths which are false
Mean normalised absolute distance: for continuous data, look at distance between predicted and true values for each object property, divide each by the variance (across different objects for the same property), and finally take the mean across properties
SLiMFast: Guaranteed Results for Data Fusion and Source Reliability (2017)¶
Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran†, Christopher Ré
Summary:
Statistical method
Can make use of ground truth
Can make use of “domain-specific features”: additional properties of sources (unrelated to their reports) which may provide a signal of their reliability
“True” source accuracy is modelled as the log odds of the probability the source provides a correct value (look like this is constant for all objects)
Parameters to optimise over are a vector of source weights \({\langle}w_s{\rangle}_{s \in S}\) and feature weights \({\langle}w_k{\rangle}_{k \in K}\); these are combined to give an estimate of the source accuracy.
If enough ground truth is available, parameters are learned by empirical risk minimisation: maximise likelihood by looking just at ground truth objects, then take most likely value for non-ground-truth objects
Otherwise, EM is used
Authors give theoretical results giving guarantees on the algorithm’s accuracy wrt finding true values and finding true source accuracies
Theoretical results require quite strong assumptions on source accuracies, but authors (claim to) show empirically that the method performs well without these assumptions