Nguyen Tran Diem Hanh *

* Corresponding author (diemhanh_tvu@tvu.edu.vn)

Main Article Content

Abstract

Information Filtering (IF), which has been popularly studied in recent years, is one of the areas that applies document retrieval techniques for dealing with the huge amount of information. In IF systems, modelling user’s interest and filtering relevant documents are major parts of the systems. Various approaches have been proposed for modelling the first component. In this study, we utilized a topic-modelling technique, Latent Dirichlet Topic Modelling, to model user’s interest for IFs. In particular, an extended model of it to represent user’s interest named Latent Dirichlet Topic Modelling with high Frequency Occurrences, shorted as LDA_HF, was proposed with the intention to enhance retrieving performance of IFs. The new model was then compared to the existing methods in modelling user’s interest such as BM25, pLSA, and LDA_IF over the big benchmark datasets, RCV1 and R8. The results of extensive experiments showed that the new proposed model outperformed all the state-of-the-art baseline models in user modelling such as BM25, pLSA and LDA_IF according to 4 major measurement metrics including Top20, B/P, MAP, and F1. Hence, the model LDA_HF promises one of the reliable methods of enhancing performance of IFs.

Keywords: Information filtering, information retrieval, topic models, topic modelling

Article Details

References

Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., & Spyropoulos, C. D. (2000). An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 160–167).

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77. doi: 10.1145/2133806.2133826

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research 3, 993-1022.

Debole, F., & Sebastiani, F. (2005). An analysis of the relative hardness of Reuters 21578 subsets. Journal of the American Society for Information Science and technology, 56(6), 584-596.

Foltz, P. W. (1990). Using latent semantic indexing for information filtering. In ACM sigois bulletin (Vol. 11, pp. 40–47).

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (p. 50-57). ACM.

Hofmann, T. (2017). Probabilistic latent semantic indexing. In ACM SIGIR forum (Vol. 51, p. 211-218). ACM.

Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In 2008 eighth IEEE international conference on data mining (pp. 263–272). Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R., & Riedl, J. (1997). Grouplens: applying collaborative filtering to usenet news. Communications of the ACM, 40(3), 77–87.

Lai, C.-C. (2007). An empirical study of three machine learning methods for spam filtering. Knowledge-Based Systems, 20(3), 249–254.

Lee, T. Q., Park, Y., & Park, Y.-T. (2008). A time-based approach to effective recommender systems using implicit feedback. Expert systems with applications, 34(4), 3055–3062.

Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr), 361–397.

Manning, C. D., Raghavan, P., & Schu¨tze, H. (2008). Introduction to information retrieval. Cambridge University Press.

Morita, M., & Shinoda, Y. (1994). Information filtering based on user behavior analysis and best match text retrieval. In SIGIR’94 (pp. 272–281).

Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple bm25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on information and knowledge management (pp. 42–49).

Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail. In Learning for text categorization: Papers from the 1998 workshop (Vol. 62, pp. 98–105).

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1–47.

Thomas, C. G., & Fischer, G. (1996). Using agents to improve the usability and usefulness of the world-wide web. In Fifth international conference on user modeling (pp. 5–12).

Valdiviezo-Diaz, P., Ortega, F., Cobos, E., & Lara-Cabrera, R. (2019). A collaborative filtering approach based on Na¨ıve Bayes classifier. IEEE Access, 7, 108581–108592.

Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 448–456).

Yan, T. W., & Garcia-Molina, H. (1999). The sift information dissemination system. ACM Transactions on Database Systems (TODS), 24(4), 529–565.