Pre-trained models for linking process in data washing machine
Abstract
Entity Resolution (ER) has been investigated for decades in various domains as a fundamental task in data integration and data quality. The emerging volume of heterogeneously structured data and even unstructured data challenges traditional ER methods. This research mainly focuses on the Data Washing Machine (DWM). The DWM was developed in the NSF DART Data Life Cycle and Curation research theme, which helps to detect and correct certain types of data quality errors automatically. It also performs unsupervised entity resolution to identify duplicate records. However, it uses traditional methods that are driven by algorithmic pattern rules such as Levenshtein Edit Distances and Matrix comparators. The goal of this research is to assess the replacement of rule-based methods with machine learning and deep learning methods to improve the effectiveness of the processes using 18 sample datasets. The DWM has different processes to improve data quality, and we are currently focusing on working with the scoring and linking processes. To integrate the machine model into the DWM, different pre-trained models were tested to find the one that helps to produce accurate vectors that can be used to calculate the similarity between the records. After trying different pre-trained models, distilroberta was chosen to get the embeddings, and cosine similarity metrics were later used to get the similarity scores, which helped us assess the machine learning model into DWM and gave us closer results to what the scoring matrix is giving. The model performed well and gave closer results overall, and the reason can be that it helped to pick up the important features and helped at the entity matching process.
References
Jurek-Loughrey A, P. D. Semi-supervised and unsupervised approaches to reference pairs classification in multi-source data linkage. In: Linking and Mining Heterogeneous and Multi-view Data. Springer, Cham; 2019. pp. 55-78.
Talburt JR. Entity Resolution Models. Entity Resolution and Information Quality. 2011: 63-101. doi: 10.1016/b978-0-12-381972-7.00003-8
Wang P, Pullen D, Wu N, Talburt JR. Iterative approach to weight calculation in probabilistic entity resolution. In: Proceedings: International Conference on Information Quality (ICIQ-19); August 2014; Xi'an, China. pp. 245-258.
Bhattacharya I, Getoor L. Iterative record linkage for cleaning and integration. Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery. 2004: 11-18. doi: 10.1145/1008694.1008697
Pasula H, Marthi B, Milch B, et al. Identity uncertainty and citation matching. Available online: https://papers.nips.cc/paper_files/paper/2002/hash/d30960ce77e83d896503d43ba249caf7-Abstract.html (accessed on 3 May 2024).
Bilenko M, Mooney RJ. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining; 2003. doi: 10.1145/956750.956759
Jonas J, Jim H. Effective counterterrorism and the limited role of predictive data mining. Washington DC: Cato Institute; 2006.
Christen P. Data Matching. Springer Berlin Heidelberg; 2012. doi: 10.1007/978-3-642-31164-2
Wang, P., Pullen, D., Talburt, J. R., Chen, C., “A method for match key blocking in probabilistic matching”, In Information Technology: New Generations, pp. 847-857, 2016.
Li, Lingli, Jianzhong Li, and Hong Gao. "Rule-based method for entity resolution." IEEE Transactions on Knowledge and Data Engineering 27, no. 1 (2015): 250-263.
Hou, Boyi, Qun Chen, Jiquan Shen, Xin Liu, Ping Zhong, Yanyan Wang, Zhaoqiang Chen, and Zhanhuai Li. "Gradual machine learning for entity resolution." In The World Wide Web Conference, pp. 3526-3530. ACM, 2019.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Chapter of the Association for Computational Linguistics. 2018.
Ruder S, Peters ME, Swayamdipta S, et al. Transfer Learning in Natural Language Processing. In: Proceedings of the 2019 Conference of the North; 2019. doi: 10.18653/v1/n19-5004
Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; 2020. doi: 10.18653/v1/2020.emnlp-demos.6
Al Sarkhi A, Talburt J. A scalable, hybrid entity resolution process for unstandardized entity references. J. Comput. Sci. Coll.2020; 35: 19-29.
Manning CD, Raghavan P, and Schütze H. Introduction to information retrieval. Cambridge University Press; 2008.
Jaro MA. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association. 1989; 84(406): 414-420. doi: 10.1080/01621459.1989.10478785
William EW. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Reference Linkage. Available online: https://eric.ed.gov/?id=ED325505 (accessed on 3 May 2024).
Jaccard P. Distribution of alpine flora in the Dranses basin and some neighbouring regions (French). Bulletin de la Societe Vaudoise des Sciences Naturelles. 1901; 37(140): 241-72. doi:10.5169/seals-266440
Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948; 27(3): 379-423. doi: 10.1002/j.1538-7305.1948.tb01338.x
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady. 1966; 10(8): 707-710.
Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering. 2007; 19(1): 1-16. doi: 10.1109/tkde.2007.250581
Hou B, Chen Q, Shen J, et al. Gradual Machine Learning for Entity Resolution. In: Proceedings of the World Wide Web Conference; 2019. doi: 10.1145/3308558.3314121
Al Sarkhi A, Talburt J. A scalable, hybrid entity resolution process for unstandardized entity references. J. Comput. Sci. Coll. 2020; 35: 19-29.
Al Sarkhi A, Talburt JR. Estimating the parameters for linking unstandardized references with the matrix comparator. J. Inform. Technol. Manag. 2018; 10: 12-26.
Zeakis A, Papadakis G, Skoutas D, et al. Pre-Trained Embeddings for Entity Resolution: An Experimental Analysis. In: Proceedings of the VLDB Endowment; 2023; pp. 2225-2238. doi: 10.14778/3598581.3598594
Ahmadi N, Sand H, Papotti P. Unsupervised Matching of Data and Text. In: Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE); 2022; pp. 1058-1070. doi: 10.1109/icde53745.2022.00084
Li Y, Li J, Suhara Y, et al. Effective entity matching with transformers. The VLDB Journal. 2023; 32(6): 1215-1235. doi: 10.1007/s00778-023-00779-z
Wu R, Chaba S, Sawlani S, et al. ZeroER: Entity Resolution using Zero Labeled Examples. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data; 2020. doi: 10.1145/3318464.3389743
Sanh V, Debut L, Chaumond J, Wolf T. Distilbert, a distilled version of Bert: Smaller, faster, cheaper and lighter. Available online: https://arxiv.org/abs/1910.01108 (accessed on 3 May 2024).
Talburt JR, Al Sarkhi AK. (n.d.). An Iterative, Self-Assessing Entity Resolution System: First Steps toward a Data Washing Machine. Available online: https://par.nsf.gov/servlets/purl/10219479 (accessed on 3 May 2024).
Copyright (c) 2024 Bushra Sajid, Ahmed Abu-Halimeh, Nuh Jakoet
This work is licensed under a Creative Commons Attribution 4.0 International License.