Record Linkage using Probabilistic Methods and Data Mining Techniques
Nowadays corporations and organizations acquire large amounts of information daily which is stored in many large databases (DB). These databases mostly are heterogeneous and the data are represented differently. Data in these DB may simply be inaccurate and there is a need to clean these DB. The record linkage process is considered to be part of the data cleaning phase when working with big scale surveys considered as a data mining step. Record linkage is an important process in data integration, which consists in finding duplication records and finding matched records too. This process can be divided in two main steps Exact Record Linkage, which founds all the exact matches between two records and Probabilistic Record Linkage, which matches records that are not exactly equal but have a high probability of being equal. In recent years, the record linkage becomes an important process in data mining task. As the databases are becoming more and more complex, finding matching records is a crucial task. Comparing each possible pair of records in large DB is impossible via manual/automatic procedures. Therefore, special algorithms (blocking methods) have to be used to reduce computational complexity of comparison space among records. The paper will discuss the deterministic and probabilistic methods used for record linkage. Also, different supervised and unsupervised techniques will be discussed. Results of a real world datasets linkage (Albanian Population and Housing Census 2011 and farmers list registered by Food Safety and Veterinary Institute) will be presented.
This work is licensed under Creative Commons Attribution 3.0 License.
Mediterranean Journal of Social Sciences ISSN 2039-9340(Print) ISSN 2039-2117(Online)
Copyright © MCSER-Mediterranean Center of Social and Educational Research
To make sure that you can receive messages from us, please add the 'mcser.org' domain to your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders..