Stochastic record linkage
Given the features of two records/entities, the job of stochastic record linkage is to give a measure of the closeness of the two entities. The final job is to find if the two records refer to the same entity. This can be accomplished by building a threshold-based classifier based on the weights.
We will show how to leverage two methods, emWeights
and epiWeights
, implemented in the RecordLinkage
package.
Expectation maximization method
The method, emWeights
, is based on the expectation maximization algorithm to derive from the weights, a measure of the closeness of two entities. According to this method, two conditional probabilities, one for match and an other for no match, has to be derived.
P (features | match = 0) and P (features | match = 1) are estimated using the expectation maximization algorithm. The weights are calculated as the ratio of these two probabilities. This approach is called the Fellegi-Sunter model.
> library(RecordLinkage) > data("RLdata500...