Mainstream Truth Discovery Papers

Towards Student Learning Ability Estimation and Truth Discovery in Japanese Online Course (2020)

(Han, Lanling and Liu, Yanwei and Shi, Chunhui and Bi, Yang and Yu, Liang and Shen, Hua)

  • Summary:

    • Another optimisation-based truth discovery algorithm. The authors find motivation from a student/question situation: students of varying ability levels answer questions.

    • As usual, the goal is to jointly estimate the ability of students and the true answers to questions. I do not fully understand this setting: presumably whoever set the questions already knows the true answers…

    • Sources are split into two groups: authoritative and ordinary. The idea is that authoritative sources are better than ordinary ones, so it’s leaning closer to semi-supervised truth discovery (where claims for authoritative sources act as ground truths).

  • Framework:

    • Terminology: objects (questions), sources (students) and observations (answers given by students)

    • Notation: \(N\) objects, \(P\) ‘ordinary’ sources and \(Q\) ‘authoritative’ sources.

      • The observation from the \(p\)-th ordinary source on the \(n\)-th object is \(o_n^p\); the observation from the \(q\)-th authoritative source is \(a_n^q\). Source weights are denoted \(w_p\) and \(r_q\). Estimated truths are denoted \(t_n\).

    • Optimisation model: basically the same as [LLG+16] and many other papers… A difference is that a factor \(\lambda > 0\) is applied to the sum of weighted distances of observations from ordinary sources. This controls the importance of the authoritative sources.

  • Experimental evaluation:

    • An education dataset (online multiple-choice Japanese exam from the university of the authors), and a crowdsourced dataset from a “Who Wants to be a Millionaire” Android app. Doesn’t seem that these datasets are available anywhere (no mention in the paper, at least).

    • Authors show that their algorithm has lowest error rate on these datasets compared to a number of baselines.

  • Criticisms:

    • Very similar to other work I have seen…

    • How did the authors split the sources into ‘authoritative’ and ‘ordinary’ in their experiments? Seems to rely on some prior knowledge of the dataset.

    • More generally, the splitting of sources seems to only amount to multiplying the weights of authoritative sources by \(\lambda\). It’s not clear to me how this differs at all from CRH from Li et. al…

Federated Truth Inference over Distributed Crowdsourcing Platforms (2020)

(M. Yang and G. Liu and Y. -. P. Hong)

  • YAEMTDA (yet another expectation maximisation truth discovery algorithm)…

  • Summary:

    • Crowdsourcing where tasks are sent to different platforms. E.g. some users are on Mechanical Turk, some complete in-person surveys etc.

    • Truth inference is done in a distributed way (reduced communication costs and better privacy for participants)

  • Approach:

    • Workers are split up across \(K\) crowdsourcing platforms

    • The response of worker \(n\) to task \(m\) is

      \[x_{n,m} = \sum_{l=1}^{L}{s^{(l)}_{n,m}(a_{m} + w^{(l)}_{n,m})}\]

      where \(s^{(l)}_{n,m} \in \{0,1\}\) is an indicator variable giving the reliability level of worker \(n\) on task \(m\), and \(w^{(l)}_{s,m}\) is the noise in the response when the reliability level is \(l\)

    • Model assumes that the probability of worker \(n\) having reliability level \(l\) is independent of the task \(m\), so that the worker is described by a vector of probabilities \(\mathbf{\alpha}_n \in \R^L\)

    • Also assumes that the noise factors \(w_{n,m}^{(l)}\) are i.i.d Gaussians for all workers and tasks: \(w_{n,m}^{(l)} \sim \mathcal{N}(0, \sigma_l^2)\) where \(\sigma_l^2\) is the noise variance for workers of level \(l\)

    • Note: I don’t quite understand why the model is introduced in quite a general way (e.g. permits a worker to have different reliability levels for different tasks) but then restricts things with the independence assumptions. Would be clearer to just model things that way in the first place without the need for additional assumptions

    • Expectation maximisation applied to estimate

      • True responses to tasks

      • Probability distributions of reliability levels for each worker (the vectors \(\mathbf{\alpha}_n\)

      • Noise variance levels \(\sigma_l^2\)

    • It turns out that \(\theta_{t+1}\) can be calculated using just aggregated statistics from each of the \(K\) platforms, without needing the raw worker responses \(x_{n,m}\) to leave the platform

    • This supposedly preserves worker privacy, since their information is not collected on a central server. There is no mention of the privacy characterisatics of the local statistics though (e.g. things like differential privacy…)

Where the Truth Lies: Explaining the Credibility of Emerging Claims on the Web and Social Media (2017)

(Popat, Kashyap and Mukherjee, Subhabrata and Strotgen, Jannik and Weikum, Gerhard)

  • (Doesn’t describe itself as Truth Discovery, but in terms of the categorisation on this page I consider it close enough…)

  • Summary:

    • System does the following: given a textual claim, determine its credibility and give explanation for this verdict

    • Credibility is affected by

      • Language used

      • Reliability of sources which support/refute the claim

      • Temporal aspects of how claims spread across the web

  • Preliminaries

    • Set \(C\) of claims

    • Set \(WS\) of (web) sources

    • Set of articles \(A^t\) for each timestamp \(t\)

    • \(a_{ij}^t\) is the article from \(ws_j\) at time \(t\) which supports/refutes claim \(c_i\)

    • Binary random variables \(y_i^t, y_{ij}^t \in \{T,F\}\) represent the credibility of claim \(c_i\) and the credibility of \(a_{ij}\) at the \(t\), respectively. \(y_i^t\) is supposed to be aggregation of the \(y_{ij}^t\)

    • Problem statement:

      • Given the values of a subset of the \(y_{i}^t\), predict \(y_1^t\) for each timestep \(t\) (wlog I’m assuming that \(c_1\) is the claim being assessed)

  • Credibility assessment

    • Lists some linguistic features and how they affect credibility assessment (e.g. “may” softens the degree of commitment to a proposition)

    • Authors introduce an algorithm to calculate support and refutation scores for each article and claim:

      • Generate ‘snippets’ of the article: all subsequences of up to four consecutive sentences long

      • For each snippet \(s\), compute the ‘unigram and bigram overlap’ score \(o_s\) (this is not explained in any detail); discard snippets whose percentage snippet is below a fixed threshold

      • Calculate the stance of each snippet using logistic regression, trained on data from popular fact-checking sites (I don’t have the background to really understand this step). This gives support and refute probabilities \(p_s^+, p_s^-\)

      • Order the snippets according to \(\max\{p_s^+ \cdot o_s, p_s^- \cdot o_s\}\)

      • Select the top \(k\) snippets (these will later be used for the evidence) and return the averages of \(p_s^+\) and \(p_s^-\)

    • Source reliability is computed wrt a labelling of the credibility of claims:

      • Take number of articles from \(ws_j\) which supports a true claim

      • Add the number of articles which refute a false claim

      • Divide by total number of articles from \(ws_j\)

      • C.f. ‘proportion’ operator in question-answering with abstentions…

      • I don’t quite understand what the authors do here: in the previous section an article has support and refutation scores, not an absolute status

FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation (2015)

(Ma, Fenglong and Li, Yaliang and Li, Qi and Qiu, Minghui and Gao, Jing and Zhi, Shi and Su, Lu and Zhao, Bo and Ji, Heng and Han, Jiawei)

  • Summary:

    • FaitCrowd: Fine Grained Crowdsourced data aggregation

    • Challenges the widespread TD assumption that source reliability is uniform across objects (‘tasks’ in the crowdsourcing parlance)

      • A source might be an expert (high reliability) in one area, but useless in another (low reliability)

      • (I seem to recall this issue also being discussed in Pasternack’s thesis)

      • Instead, need topic-aware reliability measures

    • Automatically learns topics, reliability levels and true answers

  • Formulation:

    • Input:

      • Set of questions \(\mathcal{Q}\)

      • Sources \(\mathcal{U}\)

      • Answers \(\mathcal{A} = \{a_{qu} : q \in \mathcal{Q}, u \in \mathcal{U}\}\)

      • Each question \(q\) has \(M_q\) words

      • The number of topics \(K\) (note: questions are not assigned a topic in the input)

    • Output:

      • Estimated truth \(t_q\) for each question \(q\) (note: they use the term trustworthy answers)

      • Topic \(z_q\) for each question \(q\)

      • Topical expertise matrix \(e \in \R^{K \times U}\), i.e. reliability score for each source for each topic

    • Model:

      • Complex statistical model which I do not understand

      • Iterative algorithm

On Truth Discovery in Social Sensing: A Maximum Likelihood Estimation Approach (2012)

(Wang, Dong and Kaplan, Lance and Le, Hieu and Abdelzaher, Tarek)

  • Summary:

    • Widely cited MLE-based TD method. I think it is actually the first MLE TD paper

    • Reliability of a source is the probability that they report correct observations

    • Just looks at binary measurements

    • Provides optimal procedure (in the MLE sense) for truth discovery

  • Preliminaries:

    • \(M\) sources \(S_1,\ldots,S_M\) and \(N\) variables \(C_1,\ldots,C_N\) each with two possible values: true or false

    • \(S_iC_j\) denotes that source \(i\) claims \(C_j\) is true. Note that sources never claim variables are false: this is the default position in the absence of \(S_iC_j\)

    • \(P(C^t_j)\) and \(P(C^f_j)\) denote the probability that \(C_j\) is true or false respectively [isn’t it the case that \(P(C^f_j)=1-P(C^t_j)\)?]

    • \(s_i\) is the probability that \(S_i\) will claim a variable is true. The reliability \(t_i\) is the probability that \(S_i\) makes the correct observation: \(t_i = P(C^t_j \mid S_iC_j)\)

    • Some other notions defined…

  • Expectation maximisation

    • Rough outline is as follows…

    • We have an observed dataset \(X\), latent variables \(Z\) and parameters \(\theta\)

    • The likelihood function is \(L(\theta; X, Z) = P(X, Z \mid \theta) = \sum_{Z}P(X \mid \theta, Z)\)

    • That is, the probability the we would observe \(X\) and latent variables \(Z\) if the parameters of the model were \(\theta\)

    • In the TD case \(X\) is the TD dataset, represented as a binary source-claims matrix. The latent variable \(Z\) consists of the truth status of each variable. The parameters \(\theta\) include reliability of sources and background bias \(d\) (i.e. \(d\) is the probability that a variable is true generally).

    • EM algorithm is iterative:

      • Initialise an estimate for \(\theta^{(1)}\)

      • While convergence criteria not satisfied:

        • (E step): Define \(Q(\theta \mid \theta^{(t)}) = E_{Z \mid X, \theta^{(t)}}[\log{L(\theta; X, Z)}]\). That is, we look at the expected value of the log-likelihood, taken with respect to the probability distribution over \(Z\) conditioned on the observed data \(X\) and the current estimated parameters \(\theta^{(t)}\).

        • (M step): Set \(\theta^{(t+1)} = \text{argmax}_{\theta}{Q(\theta \mid \theta^{(t)})}\).

    • Applied to this TD model, the authors obtain an explicit form for \(Q\) and the argmax, leading to a simple iterative algorithm

  • Evaluation:

    • The algorithm is compared against others in the literature on simulated data with respect to various metrics

    • Also runs the algorithm on a real world dataset from Twitter

Finding global optimum for truth discovery: entropy based geometric variance

(Hu Ding, Jing Gao, and Jinhui Xu)

  • Source claims are given as vectors \(\{p_1,\ldots,p_n\}\subseteq \R^d\) (i.e. there are \(n\) sources, \(d\) objects and facts are real-valued).

    • This seems to require that a source makes a claim for every object, which seems weird…

  • Formulates truth discovery as an optimisation problem: find source reliability weights \(w_1,\ldots,w_n\) and the truth vector \(p^*\) which minimise \(\sum_{i=1}^n{w_i \|p_i - p^*\|^2}\) subject to the normalisation constraint \(\sum_{i=1}^n{e^{-w_i}} = 1\).

    • Some justification is given for the exponential normalisation function. Firstly, it prevents one from taking \(w_1=1\), say, and \(w_i=0\) for \(i > 1\). I do not understand the other reasons, but it is linked with entropy

  • Finds an equivalent formulation of the optimisation problem, which is non-convex in general

  • Provides (pretty complicated) approximate algorithms for the optimisation problem

  • Very opinionated view of truth discovery. They define truth discovery as their particular optimisation problem, and for this problem there is a fixed link between the source weights and truth vector, so that it is enough to find just one of them (I think)..

  • Seems to assume the optimal vector \(p^*\) exists without justification, unless I’m reading it wrong… Furthermore there are two algorithms presented for two different cases, but knowing which case to apply seems to rely on already knowing \(p^*\).

A Faster Algorithm for Truth Discovery via Range Cover

(Ziyun Huang, Hu Ding, Jinhui Xu)

  • Seems to be very much an extension of the above paper

  • TODO: finish reading

The Wisdom of Minority: Discovering and Targeting the Right Group of Workers for Crowdsourcing

(Hongwei Li, Bo Zhao, Ariel Fuxman)

  • Addresses a crowdsourcing quality problem

    • Closely related to TD, but the references are in crowdsourcing and seems disjoint from the TD literature

  • Goal is to identify characteristics of crowdsourcing workers (e.g. education level), and, for a given task, identify groups of workers whose characteristics mean they are likely to provide high-quality data. These groups are then targeted for future tasks.

  • Three stages…

  • Probing stage:

    • Worker characteristics are gather from the workers’ profile (e.g. age, location – the available data is platform-specific) or by a questionnaire

    • Part of the task is sent to all workers

  • Discovery stage:

    • From the probing stage, work out if there are groups of workers whose data is of high ‘quality’

    • This is the step related to TD

  • Targeting stage:

    • Send the rest of the task to the high-quality workers

  • Talks about EM for the ‘TD’ part, but I don’t think TD plays a role in their algorithms

  • Uses statistics beyond my level, so I have not read the technical parts in detail

A confidence-aware approach for truth discovery on long-tail data (2014)

(Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, and Jiawei Han)

  • A note on terminology: speaks of reliability of sources and trustworthiness of information

    • “Trust” is referring to the data, rather than sources! This is the other way around to our framework

  • Deals with the “long-tail phenomenon”:

    • Most sources only provide a few claims (“small sources”) and only a few sources make plenty of claims

    • Number of claims per source roughly follows a power law distribution in example datasets: something like \(y(x)=ax^{-b}\) where \(y(x)\) is the number of sources making \(x\) claims. That is, a small number of sources make a large number of claims

    • Aims to discount the effect of small sources while still taking them into consideration (ignoring them is not possible since the claims from small sources can form a significant portion of the dataset)

  • Gives confidence interval of reliability estimation, not just number

  • Formalism:

    • We have a set of entities \(\mathcal{N}\) (i.e. objects), sources \(\mathcal{S}\), and claims \(\mathcal{C}\)

    • Each claim is of the form \((n, s, x_n^s)\) for \(n \in \mathcal{N}\), \(s \in \mathcal{S}\). That is, source \(s\) claims that \(x_n^s\) is the true value for entity \(n\)

    • Output of TD is an estimated truth for each entity: \(\mathcal{X}=\{(n, x_n^*) : n \in \mathcal{N}\}\)

    • Note that sources weights are not considered to be part of the output, and are only used ‘internally’ to calculate the \(x_n^*\)

  • TD model:

    • Each source has an error distribution \(\epsilon_s\). The idea is that when making a claim, a source draws an error level \(e\) from this distribution and makes a claim \(e\) away from the truth

    • Error distributions are assumed to be normal with mean 0 (this means that sources do not have a tendency to be wrong on purpose, an assumption which may not always hold!)

    • Source reliability is inversely proportional to the variance of the error distribution \(\sigma_s^2\)

    • Combined error is \(\epsilon_{\text{combined}} = \sum_{s \in \mathcal{S}}{w_s \epsilon_s}\) for some weights \(w_s \in [0, 1]\) with all weights summing to 1

    • Aim is to choose weights to minimise the variance of the combined error. This can be expressed in terms of the \(\sigma_s^2\) (which are unknown)

    • Problem is formulated as an optimisation problem: find weights, estimate variances to minimise (estimated) variance of \(\epsilon_\text{combined}\)

    • (The objective function is modified a bit as they go into details, but above is the gist of it…)

  • Algorithm:

    • A closed-form solution for the optimisation problem is found

    • “Initial truths” are calculated through averaging or voting

    • Initial source weights are calculated (according to the optimisation problem solution) from initial truths and some probabilistic component relating to the number of claims made

    • Truths are updated as a weighted sum of claims

    • Repeat until stopping criterion

  • Experiments:

    • Authors obtained datasets with ground truths

    • For continuous data, they confirm experimentally that source errors follow a normal distribution

    • They show improvement over other TD methods

Truth Discovery from Multi-Sourced Text Data Based on Ant Colony Optimization (2019)

(C. Chang and J. Cao and G. Lv and N. Weng)

  • Aims to develop a TD algorithm for unstructured raw text data

  • Model (note: I have modified some notation to make it a bit more sane):

    • Set of questions \(\mathcal{Q}\)

    • Set of users \(\mathcal{U}\)

    • Set of answers \(\mathcal{A} = \{a_q^u : q \in \mathcal{Q}, u \in \mathcal{U}\}\)

    • Answers take the form of unstructured text

      • I guess users can abstain from a question by providing an empty string…

  • Approach:

    • Preprocessing: split answers into keywords, remove stopwords, and canonicalise synonyms

    • For each question \(q\), let \(\mathscr{V}_q\) be set of all such keywords across all users

    • Finding the truth amounts to finding a subset \(V_q^* \subseteq \mathscr{V}_q\) of ‘true’ keywords

    • An optimisation problem is formed, which seems similar to CRH.

    • They aim to solve the optimisation problem using ant colony optimisation.

      • An ant network is formed for each question. The keywords in \(\mathscr{V}_q\) form the edges (?). What the nodes represent is not explained very clearly.

      • This section is not laid out very well, considering it’s the main contribution of the approach

  • Summary:

    • The paper is very poorly written

    • The ‘unstructured text’ contribution is just a preprocessing step which converts text into a bag of keywords… from then on it’s just a standard optimisation TD problem solved using

    • I only read it because it mentioned any colony, expecting something cool…

Truth Discovery with Multiple Conflicting Information Providers on the Web (2008)

(Yin, Xiaoxin and Han, Jiawei and Yu, Philip S.)

Semi-supervised Truth Discovery (2011)

(Yin, Xiaoxin and Tan, Wenzhao)

On the Discovery of Continuous Truth: A Semi-supervised Approach with Partial Ground Truths (2018)

(Yi Yang and Quan Bai and Qing Liu)

  • Summary:

    • Optimisation-based semi-supervised truth discovery algorithm for continuous data

    • Theoretical analysis of time complexity and convergence to an optimal solution

  • §2 Related Work is really nice – gives a good overview of the whole truth discovery field

    • In particular, notes the differences in outputs of TD algorithms:

      • Scoring method: assigns a score to each fact. A post-hoc process selects a ‘true’ fact for each object, based on the scores

      • Labelling: assigns a true fact to each object without giving all possible facts a score

      • Labelling is perhaps more suitable for continuous data, since it allows the estimated truth to be a fact not given by any source (e.g. could do weighted averaging of real numbers)

      • This mirrors my remarks in undergrad dissertation, §2.2.

  • Input:

    • A source \(s\) makes an observation \(v_o^s \in \R\) for an object \(o\)

    • Set of objects \(\mathcal{O} = \mathcal{O}_g \cup \mathcal{O}_u\) is comprised of ground-truth objects (\(\mathcal{O}_g\)) and unknown-truth objects (\(\mathcal{O}_u\))

    • For \(o \in \mathcal{O}_g\), \(\bar{v}_o^*\) is the ground truth for object \(o\)

  • Output:

    • A set of estimated truths \(V_u^* = \{v_o^*\}_{o \in \mathcal{O}_u}\)

    • Weights \(W=\{w^s\}_{s \in \mathcal{S}} \subseteq \R\) are used in the algorithm (and could be considered part of the output, although the authors do not include this in their problem formulation)

  • Algorithm:

    • Optimisation problem similar to others e.g. CRH

    • Problem: find \(V_u^*, W\) to minimise the objective function, which is a sum of two components (the relative weight controlled by a hyperparameter \(\theta\))

    • Standard iterative procedure: fix weights and update estimated truths to minimise objective, then vice versa

      • Truth update step: \(v_o^*\) is updated to a weighted sum of the \(v_o^s\):

        \[v_o^* = \frac{ \sum_{s \in S_o}{w^s v_o^s} }{ \sum_{s \in S_o}{w^s} }\]

        where \(S_o\) is the set of sources providing an observation for object \(o \in \mathcal{O}_u\).

      • Source weight update step: see the paper for a derivation of the update step using Lagrange multipliers. Amounts to the following, for a source \(s\): i) look at the squared differences between claims and estimated truths (and ground truths, weighted by \(\theta\) as in the objective function), ii) look at this quantity as a proportion of the summed quantities across all sources, iii) take the negative logarithm

  • Analysis:

    • Proof of convergence to optimal solution (invokes a result about coordinate descent and fills in the technicalities)

    • Time complexity analysis

    • Parameter sensitivity analysis:

      • Experimental analysis of the effect of \(\theta\) and the proportion of ground truths on error rate

      • Error decreases as proportion of ground truths increase. Hard to extract anything on the nature of the error decreases, since only five data points are given

      • As \(\theta\) increases from 0 the error decreases suddenly, but increases again after a while. High error for high \(\theta\) may be ‘overfitting’, since the algorithm practically ignores non-ground truth objects

Truth Discovery and Copying Detection in a Dynamic World (2009)

(Dong, Xin Luna and Berti-Equille, Laure and Srivastava, Divesh)

  • Summary:

    • True information often changes over time (contact details, addresses, restaurants and close over time…)

    • Data sources make copy from one another

    • Copying form unreliable sources results in the propagation of misinformation

    • To copy or not to copy is not a dichotomy: sources may copy at times and act independently at others

    • All together, this papers aims to look at source claims over time and determine:

      • The evolving truths

      • The evolving copying relationship between sources

  • Problem formulation

    • Set of objects \(\mathcal{O}\), sources \(\mathcal{S}\)

    • An object \(O\) has an associated life span \(\langle (t_1,v_1),\ldots,(t_l, v_l)\rangle\) which describes the true value \(v_i\) at transition time \(t_i\) as this true value evolves

    • \(\bar{U}(S, t_i)\) denotes the updates source \(S\) provides in the time period \((t_{i-1}, t_i]\)

  • Slices:

    • Consider an object \(O\) and source \(S\). Look at all the time points where either the true value for \(O\) changes, or where \(S\) provides an update on \(O\); the disjoint time intervals formed from these points are called slices

    • A slice is captured if it ends with \(S\) updating to the true value; it is capturable if \(S\) provided an incorrect value at the start of the slice

    • Similarly, a slice is miscaptured if it ends with \(S\) updating to the wrong value; it is miscapturable if it is possible for \(S\) to update to the wrong value (always possible when there are more than two possible values for \(O\))

  • Source quality measures:

    • Coverage: The coverage of \(S\) is the proportion of capturable slices which \(S\) captures (across all objects)

    • Exactness: \(E(S)\) is 1 minus the proportion of miscapturable slices which are miscaptured.

    • Freshsness: For \(\Delta \ge 0\), \(F(S, \Delta)\) is the proportion of captured slices which have length no greater than \(\Delta\) (i.e. instances where \(S\) updated to the correctly value less than \(\Delta\) time after the true value changed)

  • Copying model:

    • Uses a Hidden Markov Model

    • Fix two sources \(S_1\), \(S_2\). The Markov states are

      • \(I\): represents that the sources are indepenent

      • \(C1_c\): represents that \(S_1\) is a copier of \(S_2\), and is actively copying at the current time

      • \(C1_{\neg c}\): represents that \(S_1\) is a copier of \(S_2\) but is not actively copying

      • \(C2_c, C2_{\neg c}\) defined similarly

    • Transition probabilities are detailed; they depend on three parameters

      • \(f\): probability that a copier actively copies at any particular time

      • \(t_c\): probability that a copier remains a copier

      • \(t_i\): probability that independent sources remain independent

    • Later on this model is extended to consider copying relations over time by adding states for each timestamp…

  • Summary of the rest (skimmed):

    • The whole model is very large and complex

    • Final algorithm is iterative:

      • Compute CEF measure of sources based on estimated object lifespans

      • Use HMM somehow to work out copying relationships (?)

      • Use this to work out estimated object lifespans

      • Repeat until lifespans converge

Conflicts to Harmony: A Framework for Resolving Conflicts in Heterogeneous Data by Truth Discovery (2016)

(Li, Yaliang and Li, Qi and Gao, Jing and Su, Lu and Zhao, Bo and Fan, Wei and Han, Jiawei)

  • Looks at TD with heterogeneous data types (e.g. continuous, categorical etc. all in the same dataset)

  • Formulation:

    • \(N\) objects each have \(M\) properties

    • \(K\) sources make observations, which give a value to a property of an object. The value for property \(m\) of object \(i\) provided by source \(k\) is \(v_{im}^{(k)}\).

      • Note: sources are assumed to provide values for all properties of all objects. The authors state this is done for simplicity, and their framework can easily be adapted to situations where sources comment on only a subset of object properties.

      • Source input can be seen as an \(N \times M\) matrix \(\mathcal{X}^{(k)}\) whose \((i, m)\)-entry is \(v_{im}^{(k)}\)

      • I am not really sure why the authors have a ‘2D’ formulation, where objects have properties. Seems it would be just as well to flatten it out to \(N \times M\) variables

    • Source weights \(\mathcal{W}=\{w_1,\ldots,w_k\}\) are reliability degrees of sources

    • Identified truths are given in a matrix \(\mathcal{X}^{(*)}\)

    • Optimisation framework: find \(\mathcal{X}^{(*)}, \mathcal{W}\) which minimises

      \[\sum_{k=1}^{K}{ w_k \sum_{i=1}^{N}{ \sum_{m=1}^{M}{ d_m(v_{im}^{(k)}, v_{im}^{(*)}) } } }\]

      such that \(\delta(\mathcal{W}) = 1\).

      • Idea is to minimise the weighted distances between source claims and truths. When this sum is low, sources who deviate from the ‘truth’ will necessarily have low weight, whereas sources close to the truth (for whom the \(d_m\) term is small) can have high weight

        • In other words, it aims to find groups of sources with strong consensus, assigning high weight to those strongly in the group and low weight to others

      • Missing ingredients:

        • \(d_m(x, y)\) gives the distance between values \(x\) and \(y\) for the \(m\)-th property of objects. It is specific to the datatype of the property in question

        • \(\delta\) is a regularisation function. It excludes trivial weight assignments, e.g. \(w_k = 0\) for all \(k\).

  • Algorithm:

    • Two-step iterative procedure: fix \(\mathcal{X}^{(*)}\) and update \(\mathcal{W}\), then the other way around. The values to update to are chosen so that they minimise the relevant parts of the objective function

  • Example instances:

    • For \(\delta\):

      • Exponential regularisation function: \(\delta(w_1,\ldots,w_k)=\sum_{k=1}^{K}{\exp(-w_k)}\). There is an explicit form for the optimal weights here (see the paper).

      • \(\ell_p\) norm regularisation. Here optimal solution is to take a single source as the most reliable (\(w_k=1\)) and discard all others – this is not desirable.

    • For \(d_m\):

      • Categorical values: 0-1 loss function, or probability-based idea (see paper)

      • Continuous values: ‘normalised square loss’, and others…

  • Empirical performance measures:

    • Error rate: for categorical data, look at percentage of predicted truths which are false

    • Mean normalised absolute distance: for continuous data, look at distance between predicted and true values for each object property, divide each by the variance (across different objects for the same property), and finally take the mean across properties

SLiMFast: Guaranteed Results for Data Fusion and Source Reliability (2017)

Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran†, Christopher Ré

  • Summary:

    • Statistical method

    • Can make use of ground truth

    • Can make use of “domain-specific features”: additional properties of sources (unrelated to their reports) which may provide a signal of their reliability

    • “True” source accuracy is modelled as the log odds of the probability the source provides a correct value (look like this is constant for all objects)

    • Parameters to optimise over are a vector of source weights \({\langle}w_s{\rangle}_{s \in S}\) and feature weights \({\langle}w_k{\rangle}_{k \in K}\); these are combined to give an estimate of the source accuracy.

    • If enough ground truth is available, parameters are learned by empirical risk minimisation: maximise likelihood by looking just at ground truth objects, then take most likely value for non-ground-truth objects

    • Otherwise, EM is used

    • Authors give theoretical results giving guarantees on the algorithm’s accuracy wrt finding true values and finding true source accuracies

    • Theoretical results require quite strong assumptions on source accuracies, but authors (claim to) show empirically that the method performs well without these assumptions