Minh-Quan Tran * , Duy Truong , Duy-Tan Pham , Minh-Anh Nguyen , Duc-Tung Le , Di-Hao Le and Quang-Huy Duong

* Corresponding author: Minh-Quan Tran (email: 22521191@gm.uit.edu.vn)

Main Article Content

Abstract

People with visual impairments often face significant challenges in identifying and accessing product information in their daily lives, particularly when visual cues such as packaging details, labels, or expiration dates are inaccessible. In this paper, we present NaviBlind, a multimodal AI-powered assistive system designed to help visually impaired individuals understand key product details through natural interactions. Our system combines image understanding using Gemini Flash vision models with Vietnamese speech recognition powered by PhoWhisper for extracting information needs directly from user voice commands. By uploading an image of the product and speaking what kind of information is needed, such as name, color, type, or expiry date, the system analyzes the image and returns a concise, structured textual description, which is then converted into Vietnamese speech. To ensure reliability, we incorporate mechanisms to detect uncertain or hallucinated outputs from the vision model, especially in cases of low-quality images. The system is deployed as a user-friendly web application, enabling real-time accessibility for users with limited visual capabilities. Experimental evaluation demonstrates the potential of NaviBlind in promoting autonomy and independence for the visually impaired in everyday shopping and product recognition tasks.

Keywords: Human-centered design, multimodal assistive AI, product accessibility, text-to-speech, Vietnamese speech recognition, vision-language models

Article Details

References

Banerjee, S., & Lavie, A. (2005, June). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (pp. 65–72).

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 4171–4186).

Doan, K. T., Huynh, B. G., Hoang, D. T., Pham, T. D., Pham, N. H., Nguyen, Q. T. M., Vo, B. Q., & Hoang, S. N. (2024). Vintern-1B: An efficient multimodal large language model for Vietnamese. arXiv preprint arXiv:2408.12480.

Google. (2018). Use Lookout to explore your surroundings. Android Accessibility Help. Google. https://support.google.com/accessibility/android/answer/9031274?hl=en

Google Cloud. (2018). Speech-to-Text AI: Speech recognition and transcription. Google. https://cloud.google.com/speech-to-text

Le, T. T., Nguyen, L. T., & Nguyen, D. Q. (2024). Phowhisper: Automatic speech recognition for vietnamese. arXiv preprint arXiv:2406.02555.

Li, J., Li, D., Savarese, S., & Hoi, S. (2023, July). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning (pp. 19730-19742). PMLR.

Microsoft Garage. (2024). Seeing AI. Microsoft. https://www.microsoft.com/en-us/garage/wall-of-fame/seeing-ai/

OpenAI. (2022, September 21). Introducing Whisper. https://openai.com/index/whisper/

OpenCompass. (2024). Open VLM Leaderboard. https://huggingface.co/spaces/opencompass/open_vlm_leaderboard/

Pndurette. (2025, January 15). gTTS: Python library and CLI tool to interface with Google Translate’s text-to-speech API. GitHub. https://github.com/pndurette/gTTS

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.

Team, G., Anil, R., Borgeaud, S., Alayrac, J. B., Yu, J., Soricut, R., ... & Blanco, L. (2023). Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., ... & Batsaikhan, B. O. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.

Tran, C., & Thanh, H. L. (2024). Lavy: Vietnamese multimodal large language model. arXiv preprint arXiv:2404.07922.

Be My Eyes. https://en.wikipedia.org/wiki/Be_My_Eyes

Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). Open Review. https://openreview.net/forum?id=1tZbq88f27