MS, University of Cincinnati, 2001, Engineering : Computer Engineering
Organizations are increasingly experiencing the necessity and benefits of integrated access to multiple data sources. Database integration has two aspects: schema integration and data integration. Schema integration arrives at a common schema representing the elements of the source schemas. Data integration involves detecting and merging multiple instances of the same real world entities from different databases. Entity identification is necessary when there is no common means of identification such as primary keys, and it is usually solved manually. This thesis focuses on solving the entity identification problem in an automated way using data mining techniques. We use automated learning techniques to identify characteristics or patterns found in entities and apply this knowledge to detect multiple instances of the same entity. The data mining techniques that we use are decision trees and k-nearest neighbors (k-NN). Our approach preprocesses the data before employing the data mining techniques. The preprocessing forms clusters on the data and entity identification is performed on each cluster. To study the performance of the proposed algorithms, we use a small database of 2500 records and vary different parameters such as training set size and number of unique entities in our experiments. Our experiments study the impact of our preprocessing algorithm on both a decision tree implementation and a k-NN implementation as the classification techniques. We examine whether accuracy and processing speed are improved, unaffected or adversely affected. For our testbed, there is a significant savings in the processing time of the clustered data sets with decision trees when compared to the unclustered data sets with decision trees for both small and large training set sizes. On the other hand, the accuracy when using clustering is always less than that obtained without clustering, but the clustering accuracy approaches the accuracy of the non-clustered approach as the number (open full item for complete abstract)
Committee: Dr. Karen C. Davis (Advisor)
Subjects: Computer Science