An automated data collection process for constructing graph data relying on LLMs
Main Article Content
Abstract
This paper introduces a process that is designed to harvest data automatically from a variety of online sources. The core of this process lies in its data-handling techniques, which include drawing, cleaning, deduplicating, extracting, and categorizing of raw data to convert unstructured data into a structured format represented and imported in a graph database. The data extraction step utilizes Large Language Model (LLMs) for Named Entity Recognition (NER). A case study on deploying course data collection illustrates the enhancements brought about by this automation, showcasing improvements in the accuracy, completeness, and timeliness of updates in the course data. An evaluation carried out on the extraction and matching methods shows that the F1-score and precision rates are high. Overall, this study contributes to advancement of the field by providing a methodology for automating the collection and processing of online data sources, significantly improving the quality of data collection from online sources.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
References
Akbar, M., Ahmad, I., Mirza, Ali, M., & Barmavatu, P. (2023). Enhanced authentication for deduplication of big data on cloud storage system using machine learning approach. Cluster Comput, 27(3), 3683–3702. https://doi.org/10.1007/s10586-023-04171-y
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. Association for Computing Machinery, New York, NY, USA, 313–324. https://doi.org/10.1145/872757.872796
Cuong, N. D., Dung, D. N. H., Pham-Nguyen, C., Le Dinh, T., & Nam, L. N. H. (2022). Itcareerbot: A personalized career counselling chatbot. In Asian Conference on Intelligent Information and Database Systems (pp. 423-436). Singapore: Springer Nature Singapore. https://doi.org/10.1007/978-981-19-8234-7_33
Eftimov, T., Koroušić Seljak, B., Korošec, P. (2017). A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PLoS One, 12(6), e0179488. https://doi.org/10.1371/journal.pone.0179488
Hien, P.T. X., Nam, L.N.H., & Pham-Nguyen, C. (2024). Framework for a knowledge-based course recommender system focused on IT career needs. 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KEOD 2024).
Huang, Y., Tang, K., & Chen, M. (2024). Distilling Large Language Models into Tiny Models for Named Entity Recognition. arXiv:2402.09282v3 [cs.CL].
Jiang, Y., Lin, C., Meng, W., Yu, C., Cohen, A. M., & Smalheiser, N. R. (2014). Rule-based deduplication of article records from bibliographic databases. Database: The Journal of Biological Databases and Curation, 2014.
Kranz, G., & Bigelow, S.J. (2019). Data deduplication. https://www.techtarget.com/searchstorage/definition/data-deduplication.
Ma, J., Stones, R.J, Ma, Y., Wang, J., Ren, J., Wang, G., & Liu, X. (2017). Lazy Exact Deduplication. ACM Trans on Storage (TOS). 13(2), 1-26. https://doi.org/10.1145/3078837
Nguyen, T. M. T., Vu, N., & Ly, B. (2022). An approach to constructing a graph data repository for course recommendation based on IT career goals in the context of big data. 2022 IEEE International Conference on Big Data, December 17-20, Osaka, Japan (pp. 301-308). 10.1109/BigData55660.2022.10020436
Ranjith, V., Dhananjaya, M.K., Sahukar, P.Y., Akshara, M., & Biswas, P.S. (2022). A Review of Deduplicate and Significance of Using Fuzzy Logic. In: Fong, S., Dey, N., Joshi, A. (Eds.) ICT Analysis and Applications. Lecture Notes in Networks and Systems, 314. Springer, Singapore. https://doi.org/10.1007/978-981-16-5655-2_27
Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases (2nd Ed.). O'Reilly Media, Inc.
Saha, S. (2020). Biomedical Named entity recognition - Pros and cons of rule-based and deep learning methods. https://www.cineca-project.eu/blog-all/biomedical-named-entity-recognition-pros-and-cons-of-rule-based-and-deep-learning-methods
Singh, A. (2022). Graph Database Modeling With Neo4j (2nd ed.). Independently published.
Thi, P.Q., Diep, H. T., Thao, N.D., Pham-Nguyen, C., Le Dinh, T., & Nam, L.N.H. (2020). Towards An Ontology-Based Knowledge Base for Job Postings. In: 7th NAFOSTED Conference on Information and Computer Science (NICS). 267-272. VNUHCM-University of Science, Vietnam. November 26-27. https://doi.org/10.1109/NICS51282.2020.9335876
Tin, L.V. (2023). Towards a context-aware ontology-based approach for knowledge discovery in intelligent systems. University of Science, Ho Chi Minh city, Viet Nam (Thesis report).
Villena, F., Miranda, L., & Aracena, C. (2024). llmNER: (Zero|Few)-Shot Named Entity Recognition, Exploiting the Power of Large Language Models. arXiv:2406.04528 [cs.CL].
Wang, S., Sun X., Li X., Ouyang R., Wu F., Zhang T., Li J., & Wang G. (2023). GPT-NER: Named Entity Recognition via Large Language Models. arXiv:2304.10428[cs.CL]. https://arxiv.org/abs/2304.10428
Zhao, B. (2017). Web Scraping. In: Schintler, L., McNeely, C. (eds) Encyclopedia of Big Data. Springer, Cham. https://doi.org/10.1007/978-3-319-32001-4_483-1
Xie, Z. (2023). The benefit and risks for scraping based on Python. Highlights in Science, Engineering and Technology, 49, 232-236. https://doi.org/10.54097/hset.v49i.8511
Yuefan, F., & Xiaolong, X. (2023). GFMRC: A machine reading comprehension model for named entity recognition. Pattern Recognition Letters, 172, 97-105. https://doi.org/10.1016/j.patrec.2023.06.011
Yujian, L. & Bo, L. (2007). A Normalized Levenshtein Distance Metric. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091-1095. 10.1109/TPAMI