Title: | Create Scale Linkage Scores |
---|---|
Description: | Perform a 'probabilistic' linkage of two data files using a scaling procedure using the methods described in Goldstein, H., Harron, K. and Cortina-Borja, M. (2017) <doi:10.1002/sim.7287>. |
Authors: | Chris Charlton [aut, cre], Harvey Goldstein [aut] |
Maintainer: | Chris Charlton <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0 |
Built: | 2024-11-06 03:56:01 UTC |
Source: | https://github.com/cran/Scalelink |
Builds the A* matrix
buildAstar(foinew, ldfnew, grainsize, debug)
buildAstar(foinew, ldfnew, grainsize, debug)
foinew |
numeric |
ldfnew |
numeric |
grainsize |
integer determining minimum grain size for parallisation |
debug |
Boolean indicating whether to output additional debugging information |
buildAstar
takes a matrix representing the file of interest and
a matrix representing the linking data file and creates a matrix that
can then be used to generating linking scores. Reporting frequency as this
occurs can be specified via the nreport option. This is implemented in C++
to provide a speed increase over implementing it directly in the R equivalent.
This function calculates a score from two files, the file of interest (FOI) and linkage data file (LDF).
calcScores(FOI, LDF, missing.value = NA, min.parallelblocksize = 1, output.varnames = NULL, debug = FALSE)
calcScores(FOI, LDF, missing.value = NA, min.parallelblocksize = 1, output.varnames = NULL, debug = FALSE)
FOI |
A |
LDF |
A |
missing.value |
Value used to represent missing data; Defaults to NA |
min.parallelblocksize |
The minimum block size when splitting up the data accross processors. You may wish to change this to optimise the allocation of processors. see (https://rcppcore.github.io/RcppParallel/#tuning). |
output.varnames |
Labels to apply to function output; Defaults to column names
of the FOI |
debug |
Boolean indicating whether to output additional debugging information |
A list containing: An numeric vector
of scores, one for
each of the identifiers of interest and a matrix
containing A*.
Goldstein H., and Charlton, C.M.J., (2017) Centre for Multilevel Modelling, University of Bristol.
File of interest data data with 7742 records and 5 variables.
A data frame with 7742 observations on the following 5 variables:
id
Record Identifier (not used for linking).
Day
Day of Week.
Month
Month of Year.
Year
Year.
Sex
Gender: with codes 1
Male and 2
Female.
The FOI
dataset is one of the sample datasets provided with this package for demonstration purposes.
Synthetic data created by Harvey Goldstein
data(FOI, package = "Scalelink") summary(FOI)
data(FOI, package = "Scalelink") summary(FOI)
Linking data file data with 10000 records and 5 variables.
A data frame with 10000 observations on the following 5 variables:
id
Record Identifier (not used for linking).
Day
Day of Week.
Month
Month of Year.
Year
Year.
Sex
Gender: with codes 1
Male and 2
Female.
The LDF
dataset is one of the sample datasets provided with this package for demonstration purposes.
This version include records with missing data
Synthetic data created by Harvey Goldstein
data(LDF, package = "Scalelink") summary(LDF)
data(LDF, package = "Scalelink") summary(LDF)
File of interest data data with 8142 records and 5 variables.
A data frame with 8142 observations on the following 5 variables:
id
Record Identifier (not used for linking).
Day
Day of Week.
Month
Month of Year.
Year
Year.
Sex
Gender: with codes 1
Male and 2
Female.
The LDFCOMP
dataset is one of the sample datasets provided with this package for demonstration purposes.
This version has records containing missing data removed
Synthetic data created by Harvey Goldstein
data(LDFCOMP, package = "Scalelink") summary(LDFCOMP)
data(LDFCOMP, package = "Scalelink") summary(LDFCOMP)
Scalelink is an R command to perform 'probabilistic' linkage of two data files using a scaling procedure.
With increasing availability of large data sets derived from administrative and other sources, there is an increasing demand for the successful linking of these to provide rich sources of data for further analysis. Variation in the quality of identifiers used to carry out linkage means that existing approaches are often based upon 'probabilistic' models, which are based on a number of assumptions, and can make heavy computational demands. This package implements the method proposed in Goldstein, H., Harron, K. and Cortina-Borja, M. (2017). In this paper we suggest a new approach to classifying record pairs in linkage, based upon weights (scores) derived using a scaling algorithm. The proposed method does not rely on training data, is computationally fast, requires only moderate amounts of storage and has intuitive appeal.
Goldstein, H., Charlton, C.M.J. (2017) Scalelink: A Package to link data via scaling.
Goldstein, H., Harron, K. and Cortina-Borja, M. (2017). A scaling approach to record linkage. Statistics in Medicine. DOI: 10.1002/sim.7287
Chris Charlton [email protected]
Charlton, C.M.J., Goldstein H (2017) Centre for Multilevel Modelling, University of Bristol.
library(Scalelink) ## Set the number of CPU cores to use (omit to use all available) RcppParallel::setThreadOptions(numThreads = 2) data(FOI, package = "Scalelink") data(LDFCOMP, package = "Scalelink") idcols <- c("Day", "Month", "Year", "Sex") result <- calcScores(FOI[, idcols], LDFCOMP[, idcols]) print(result$scores) ## Scalelink package provides two examples using synthetic data ## one with complete data and one containing missing values ## Not run: ## For a list of demo titles demo(package = 'Scalelink') ## To run a demo demo(Example1) ## Using your own data ##If you had the following files in your working directory: ##FOI: ##A space-delimited file of interest (NFOI x PFOI). NFOI is number of records ##IDENTIFIERS_FOI: ##A space-delimited file containing a row vector length PFOI with a 1 where it is an identifier ##LDF: ##A space-delimited linking data file (NLDF x PLDF). NLDF is number of records ##IDENTIFIERS_LDF: ##A space-delimited file containing a row vector length PLDF with a 1 where it is an identifier ##Then you can calculate scores as follows: FOI<-read.table("FOI") LDF<-read.table("LDF") IDENTIFIERS_FOI<-read.table('IDENTIFIERS_FOI') IDENTIFIERS_LDF<-read.table('IDENTIFIERS_LDF') result <- calcScores(FOI[, which(IDENTIFIERS_FOI == 1)], LDF[, which(IDENTIFIERS_LDF == 1)], missing.value=-9.999e+029) ##To view the scores: print(round(result$scores, 2)) ##To view the A* matrix: print(result$astar) ## End(Not run)
library(Scalelink) ## Set the number of CPU cores to use (omit to use all available) RcppParallel::setThreadOptions(numThreads = 2) data(FOI, package = "Scalelink") data(LDFCOMP, package = "Scalelink") idcols <- c("Day", "Month", "Year", "Sex") result <- calcScores(FOI[, idcols], LDFCOMP[, idcols]) print(result$scores) ## Scalelink package provides two examples using synthetic data ## one with complete data and one containing missing values ## Not run: ## For a list of demo titles demo(package = 'Scalelink') ## To run a demo demo(Example1) ## Using your own data ##If you had the following files in your working directory: ##FOI: ##A space-delimited file of interest (NFOI x PFOI). NFOI is number of records ##IDENTIFIERS_FOI: ##A space-delimited file containing a row vector length PFOI with a 1 where it is an identifier ##LDF: ##A space-delimited linking data file (NLDF x PLDF). NLDF is number of records ##IDENTIFIERS_LDF: ##A space-delimited file containing a row vector length PLDF with a 1 where it is an identifier ##Then you can calculate scores as follows: FOI<-read.table("FOI") LDF<-read.table("LDF") IDENTIFIERS_FOI<-read.table('IDENTIFIERS_FOI') IDENTIFIERS_LDF<-read.table('IDENTIFIERS_LDF') result <- calcScores(FOI[, which(IDENTIFIERS_FOI == 1)], LDF[, which(IDENTIFIERS_LDF == 1)], missing.value=-9.999e+029) ##To view the scores: print(round(result$scores, 2)) ##To view the A* matrix: print(result$astar) ## End(Not run)