Ho Ngoc Ton , Nguyen Hoang Son , Nguyen Ngoc Minh Chau and Pham-Nguyen Cuong *

* Corresponding author (pncuong@fit.hcmus.edu.vn)

Main Article Content

Abstract

This paper introduces a process that is designed to harvest data automatically from a variety of online sources. The core of this process lies in its data-handling techniques, which include drawing, cleaning, deduplicating, extracting, and categorizing of raw data to convert unstructured data into a structured format represented and imported in a graph database. The data extraction step utilizes Large Language Model (LLMs) for Named Entity Recognition (NER). A case study on deploying course data collection illustrates the enhancements brought about by this automation, showcasing improvements in the accuracy, completeness, and timeliness of updates in the course data. An evaluation carried out on the extraction and matching methods shows that the F1-score and precision rates are high. Overall, this study contributes to advancement of the field by providing a methodology for automating the collection and processing of online data sources, significantly improving the quality of data collection from online sources.

Keywords: Data collection process, graph data, large language model

Article Details

References

Akbar, M., Ahmad, I., Mirza, Ali, M., & Barmavatu, P. (2023). Enhanced authentication for deduplication of big data on cloud storage system using machine learning approach. Cluster Comput, 27(3), 3683–3702. https://doi.org/10.1007/s10586-023-04171-y

Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. Association for Computing Machinery, New York, NY, USA, 313–324. https://doi.org/10.1145/872757.872796

Cuong, N. D., Dung, D. N. H., Pham-Nguyen, C., Le Dinh, T., & Nam, L. N. H. (2022). Itcareerbot: A personalized career counselling chatbot. In Asian Conference on Intelligent Information and Database Systems (pp. 423-436). Singapore: Springer Nature Singapore. https://doi.org/10.1007/978-981-19-8234-7_33

Eftimov, T., Koroušić Seljak, B., Korošec, P. (2017). A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PLoS One, 12(6), e0179488. https://doi.org/10.1371/journal.pone.0179488

Hien, P.T. X., Nam, L.N.H., & Pham-Nguyen, C. (2024). Framework for a knowledge-based course recommender system focused on IT career needs. 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KEOD 2024).

Huang, Y., Tang, K., & Chen, M. (2024). Distilling Large Language Models into Tiny Models for Named Entity Recognition. arXiv:2402.09282v3 [cs.CL].

Jiang, Y., Lin, C., Meng, W., Yu, C., Cohen, A. M., & Smalheiser, N. R. (2014). Rule-based deduplication of article records from bibliographic databases. Database: The Journal of Biological Databases and Curation, 2014.

Kranz, G., & Bigelow, S.J. (2019). Data deduplication. https://www.techtarget.com/searchstorage/definition/data-deduplication.

Ma, J., Stones, R.J, Ma, Y., Wang, J., Ren, J., Wang, G., & Liu, X. (2017). Lazy Exact Deduplication. ACM Trans on Storage (TOS). 13(2), 1-26. https://doi.org/10.1145/3078837

Nguyen, T. M. T., Vu, N., & Ly, B. (2022). An approach to constructing a graph data repository for course recommendation based on IT career goals in the context of big data. 2022 IEEE International Conference on Big Data, December 17-20, Osaka, Japan (pp. 301-308). 10.1109/BigData55660.2022.10020436

Ranjith, V., Dhananjaya, M.K., Sahukar, P.Y., Akshara, M., & Biswas, P.S. (2022). A Review of Deduplicate and Significance of Using Fuzzy Logic. In: Fong, S., Dey, N., Joshi, A. (Eds.) ICT Analysis and Applications. Lecture Notes in Networks and Systems, 314. Springer, Singapore. https://doi.org/10.1007/978-981-16-5655-2_27

Robinson, I., Webber, J., & Eifrem, E. (2015). Graph Databases (2nd Ed.). O'Reilly Media, Inc.

Saha, S. (2020). Biomedical Named entity recognition - Pros and cons of rule-based and deep learning methods. https://www.cineca-project.eu/blog-all/biomedical-named-entity-recognition-pros-and-cons-of-rule-based-and-deep-learning-methods

Singh, A. (2022). Graph Database Modeling With Neo4j (2nd ed.). Independently published.

Thi, P.Q., Diep, H. T., Thao, N.D., Pham-Nguyen, C., Le Dinh, T., & Nam, L.N.H. (2020). Towards An Ontology-Based Knowledge Base for Job Postings. In: 7th NAFOSTED Conference on Information and Computer Science (NICS). 267-272. VNUHCM-University of Science, Vietnam. November 26-27. https://doi.org/10.1109/NICS51282.2020.9335876

Tin, L.V. (2023). Towards a context-aware ontology-based approach for knowledge discovery in intelligent systems. University of Science, Ho Chi Minh city, Viet Nam (Thesis report).

Villena, F., Miranda, L., & Aracena, C. (2024). llmNER: (Zero|Few)-Shot Named Entity Recognition, Exploiting the Power of Large Language Models. arXiv:2406.04528 [cs.CL].

Wang, S., Sun X., Li X., Ouyang R., Wu F., Zhang T., Li J., & Wang G. (2023). GPT-NER: Named Entity Recognition via Large Language Models. arXiv:2304.10428[cs.CL]. https://arxiv.org/abs/2304.10428

Zhao, B. (2017). Web Scraping. In: Schintler, L., McNeely, C. (eds) Encyclopedia of Big Data. Springer, Cham. https://doi.org/10.1007/978-3-319-32001-4_483-1

Xie, Z. (2023). The benefit and risks for scraping based on Python. Highlights in Science, Engineering and Technology, 49, 232-236. https://doi.org/10.54097/hset.v49i.8511

Yuefan, F., & Xiaolong, X. (2023). GFMRC: A machine reading comprehension model for named entity recognition. Pattern Recognition Letters, 172, 97-105. https://doi.org/10.1016/j.patrec.2023.06.011

Yujian, L. & Bo, L. (2007). A Normalized Levenshtein Distance Metric. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091-1095. 10.1109/TPAMI