Many privacy related concerns are raised while working on the record linkage problem, especially with the record linkage of personal data. Personal data contains attributes like SSN, DOB, and ZIP CODE which can be used to uniquely identify the person during linkage process. But at the same time, these attributes help towards identity disclosure or identity inference of a person.
As the real public data sets hide DOB or SSN information, those data sets are not much useful for the research on record linkage. We realized the need of “Synthetic data generation” which will help us add realistic attributes in the available public data sets. This will help graduate students to continue working on record linkage and data visualization problems without compromising the privacy while maintaining the accuracy and complexity of the problem. However, synthetic data generation can also be as complex as the record linkage problem because of the characteristics of real data such as missing data, erroneous data, changing data.
To simulate the real world data, we started with the voters registration data of one of the large counties in the US. This public data obviously does not have DOB and SSN of voters, but it has the age of voters. Our goal is to be able to generate realistic data for DOBs of voters in this dataset. We did this in two steps:
1. Populate a DOB column:
1.1. As we have age of a voter in public data set, we used that age to get the year of birth.
Year of Birth (YYYY) = Current year (2012) – Age(XXX)
1.2 We extracted only DD and MM values from the DOB column of an existing real data set (which is far different from voter’s registration data set.) We randomly attached these DD and MM values to the YYYY generated in step 1.1.
2. Perturb DOB column to make it realistic:
2.1 Make some DOB values missing.
2.2 Introduce substitution errors by replacing some digits in DOB by other digits.
2.3 Introduce transposition errors.
We need to investigate and do more research on following:
1. Try to figure out various data error patterns for DOB in real world data.
2. The average approximate percentage of different data errors such as missing data, transposed data.
3. Way to find out how closer the synthetic data model is to the real data.