Ngo-Ho Anh-Khoa , Vo Khuong-Duy and Ngo-Ho Anh-Khoi *

* Corresponding author (ngohoanhkhoi@gmail.com)

Main Article Content

Abstract

Currently, the application of generative Artificial Intelligence for developing specialized chatbots in Vietnamese is an inevitable trend. However, one of the most challenging aspects of assessing the quality of Vietnamese chatbot products is creating a specialized benchmark in a question-and-answer format. Typically, this benchmark is manually crafted by industry experts, which can be extremely costly. In contrast, for English, we can use bag-of-words model toolkits and grammatical structure architectures to generate appropriate questions automatically based on pre-existing answers from the original data. However, there is almost no complete model available for this task in Vietnamese. Regarding quality assessment, this is usually performed manually by experts using Human Evaluation (HE) indicators, which is also costly. Therefore, the aim of this study is to propose an algorithmic architecture specifically designed for the Vietnamese language. This architecture will automatically generate a set of question-and-answer queries to create a benchmark, as well as facilitate the development of a mechanism for automatic, straightforward, cost-effective, and accurate quality assessment for Vietnamese chatbots. We refer to this system as the Vietnamese Question/Answers Benchmark Generator (VQABG) and propose an innovative evaluation indicator called the Exact Match with Numeric Information (EMINI).

Keywords: Automatic question answering generator, Chatbot, Generative Artificial Intelligence, Vietnamese language

Article Details

References

Casas, J., Tricot, M.-O., Abou Khaled, O., Mugellini, E., & Cudré-Mauroux, P. (2021). Trends & methods in chatbot evaluation. In Companion publication of the 2020 International Conference on Multimodal Interaction (ICMI '20 Companion) (pp. 280–286). Association for Computing Machinery. https://doi.org/10.1145/3395035.3425319

Chen, J., Lin, H., Han, X., & Sun, L. (2024, March). Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 16, pp. 17754–17762).

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217.

Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A. A., ... & Welty, C. (2010). Building Watson: An overview of the DeepQA project. AI magazine, 31(3), 59-79.

Hill, F., Bordes, A., Chopra, S., & Weston, J. (2015). The goldilocks principle: Reading children's books with explicit memory representations. arXiv preprint arXiv:1511.02301.

Hirschman, L., Light, M., Breck, E., & Burger, J. D. (1999). Deep read: A reading comprehension system. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 325–332). Association for Computational Linguistics.

Islam, P., Kannappan, A., Kiela, D., Qian, R., Scherrer, N., & Vidgen, B. (2023). FinanceBench: A new benchmark for financial question answering. arXiv preprint arXiv:2311.11944.

Kenneweg, T., Kenneweg, P., & Hammer, B. (2024). Retrieval augmented generation systems: Automatic dataset creation, evaluation and Boolean agent setup. arXiv preprint arXiv:2403.00820.

Le-Hong, P., & Bui, D. T. (2018). A factoid question answering system for Vietnamese. In Companion Proceedings of the Web Conference 2018 (pp. 1049–1055).

Le, K., Nguyen, H., Le Thanh, T., & Nguyen, M. (2022, June). VIMQA: A Vietnamese dataset for advanced reasoning and explainable multi-hop question answering. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 6521–6529).

Lyu, Y., Li, Z., Niu, S., Xiong, F., Tang, B., Wang, W., Wu, H., Liu, H., Xu, T., & Chen, E. (2024). Crudrag: A comprehensive Chinese benchmark for retrieval-augmented generation of large language models. arXiv preprint arXiv:2401.17043.

Ngo Ho, A. K. (2021). Generative probabilistic alignment models for words and sub-words: A systematic exploration of the limits and potentials of neural parametrizations (Master's thesis, Université Paris-Saclay). Document and Text Processing. https://tel-03210116

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. (pp. ages 2383–2392). Austin, Texas. Association for Computational Linguistics.

Richardson, M., Burges, C. J. C., & Renshaw, E. (2013). MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 193–203). Association for Computational Linguistics.

Voorhees, E. M., & Tice, D. M. (2000). Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR '00). (pp. 200–207). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/345508.345577

Yang, Y., Yih, W.-t., & Meek, C. (2015). WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (pp. 2013–2018). Lisbon, Portugal. Association for Computational Linguistics.