Nguyen Phuoc Thanh * , Nguyen Thanh Hoang , Hoang Ngoc Xuan Nguyen , Phan Huynh Thanh Binh , Vu Hoang Son Hai and Huynh Hieu Nhan

* Corresponding author (ngpthanh15@gmail.com)

Main Article Content

Abstract

The present study meticulously investigates optimization strategies for real-time sign language recognition (SLR) employing the MediaPipe framework. We introduce an innovative multi-modal methodology, amalgamating four distinct Long Short-Term Memory (LSTM) models dedicated to processing skeletal coordinates ascertained from the MediaPipe framework. Rigorous evaluations were executed on esteemed sign language datasets. Empirical findings underscore that the multi-modal approach significantly elevates the accuracy of the SLR model while preserving its real-time capabilities. In comparative analyses with prevalent MediaPipe-based models, our multi-modal strategy consistently manifested superior performance metrics. A distinguishing characteristic of this approach is its inherent adaptability, facilitating modifications within the LSTM layers, rendering it apt for a myriad of challenges and data typologies. Integrating the MediaPipe framework with real-time SLR markedly amplifies recognition precision, signifying a pivotal advancement in the discipline.

Keywords: LSTM, MediaPipe, How2Sign, Indian Sign Language, ISL

Article Details

References

Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2020). Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing, 29, 9532-9545.

Dardas, N. H., & Georganas, N. D. (2011). Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques. IEEE Transactions on Instrumentation and Measurement, 60(11), 3592-3607.

Velmathi, G., & Goyal, K. (2023). Indian Sign Language Recognition Using Mediapipe Holistic. arXiv preprint.
https://arxiv.org/abs/2304.10256

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.-L., Yong, M. G., Lee, J., Chang, W.-T., Hua, W., Georg, M., & Grundmann, M. (2019). MediaPipe: A Framework for Building Perception Pipelines. arXiv preprint.
https://arxiv.org/abs/1906.08172

Staudemeyer, R. C., & Morris, E. R. (2019). Understanding LSTM -- a tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv preprint.
https://arxiv.org/abs/1909.09586

Emmorey, K. (2001). Language, cognition, and the brain: Insights from sign language research. Psychology Press.

Huang, J., Zhou, W., Li, H., & Li, W. (2018). Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29(9), 2822-2832.

Sofianos, T., Sampieri, A., Franco, L., & Galasso, F. (2021). Space-Time-Separable Graph Convolutional Network for Pose Forecasting. CoRR, abs/2110.04573.
https://arxiv.org/abs/2110.04573

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56), 1929-1958. http://jmlr.org/papers/v15/srivastava14a.html