Naviblind: A multimodal AI assistant for visually impaired users to identify product information from images and speech

Minh-Quan Tran; Duy Truong; Duy-Tan Pham; Minh-Anh Nguyen; Duc-Tung Le; Di-Hao Le; Quang-Huy Duong

doi:10.22144/ctujoisd.2025.057

Minh-Quan Tran ^* , Duy Truong , Duy-Tan Pham , Minh-Anh Nguyen , Duc-Tung Le , Di-Hao Le and Quang-Huy Duong

* Corresponding author: Minh-Quan Tran (email: 22521191@gm.uit.edu.vn)

Full Text: PDF

Received: 30 Jun 2025

Revised: 18 Aug 2025

Accepted: 27 Sep 2025

Published: 16 Oct 2025

DOI: 10.22144/ctujoisd.2025.057

Views

142

Downloads

57

How to Cite

Tran, M.-Q., Truong, D., Pham, D.-T., Nguyen, M.-A., Le, D.-T., Le, D.-H., & Duong, Q.-H. (2025). Naviblind: A multimodal AI assistant for visually impaired users to identify product information from images and speech. CTU Journal of Innovation and Sustainable Development, 17(Special issue: ISDS), 97-105. https://doi.org/10.22144/ctujoisd.2025.057

Issue

Vol. 17 No. Special issue: ISDS (2025)

Section

Intelligent Systems and Data Science (ISDS 2025)

Abstract

People with visual impairments often face significant challenges in identifying and accessing product information in their daily lives, particularly when visual cues such as packaging details, labels, or expiration dates are inaccessible. In this paper, we present NaviBlind, a multimodal AI-powered assistive system designed to help visually impaired individuals understand key product details through natural interactions. Our system combines image understanding using Gemini Flash vision models with Vietnamese speech recognition powered by PhoWhisper for extracting information needs directly from user voice commands. By uploading an image of the product and speaking what kind of information is needed, such as name, color, type, or expiry date, the system analyzes the image and returns a concise, structured textual description, which is then converted into Vietnamese speech. To ensure reliability, we incorporate mechanisms to detect uncertain or hallucinated outputs from the vision model, especially in cases of low-quality images. The system is deployed as a user-friendly web application, enabling real-time accessibility for users with limited visual capabilities. Experimental evaluation demonstrates the potential of NaviBlind in promoting autonomy and independence for the visually impaired in everyday shopping and product recognition tasks.

Keywords: Human-centered design, multimodal assistive AI, product accessibility, text-to-speech, Vietnamese speech recognition, vision-language models

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

References

Banerjee, S., & Lavie, A. (2005, June). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (pp. 65–72).

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 4171–4186).

Doan, K. T., Huynh, B. G., Hoang, D. T., Pham, T. D., Pham, N. H., Nguyen, Q. T. M., Vo, B. Q., & Hoang, S. N. (2024). Vintern-1B: An efficient multimodal large language model for Vietnamese. arXiv preprint arXiv:2408.12480.

Google. (2018). Use Lookout to explore your surroundings. Android Accessibility Help. Google. https://support.google.com/accessibility/android/answer/9031274?hl=en

Google Cloud. (2018). Speech-to-Text AI: Speech recognition and transcription. Google. https://cloud.google.com/speech-to-text

Le, T. T., Nguyen, L. T., & Nguyen, D. Q. (2024). Phowhisper: Automatic speech recognition for vietnamese. arXiv preprint arXiv:2406.02555.

Li, J., Li, D., Savarese, S., & Hoi, S. (2023, July). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning (pp. 19730-19742). PMLR.

Microsoft Garage. (2024). Seeing AI. Microsoft. https://www.microsoft.com/en-us/garage/wall-of-fame/seeing-ai/

OpenAI. (2022, September 21). Introducing Whisper. https://openai.com/index/whisper/

OpenCompass. (2024). Open VLM Leaderboard. https://huggingface.co/spaces/opencompass/open_vlm_leaderboard/

Pndurette. (2025, January 15). gTTS: Python library and CLI tool to interface with Google Translate’s text-to-speech API. GitHub. https://github.com/pndurette/gTTS

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.

Team, G., Anil, R., Borgeaud, S., Alayrac, J. B., Yu, J., Soricut, R., ... & Blanco, L. (2023). Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., ... & Batsaikhan, B. O. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.

Tran, C., & Thanh, H. L. (2024). Lavy: Vietnamese multimodal large language model. arXiv preprint arXiv:2404.07922.

Be My Eyes. https://en.wikipedia.org/wiki/Be_My_Eyes

Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). Open Review. https://openreview.net/forum?id=1tZbq88f27

Article Sidebar

Main Article Content

Abstract

Article Details

References