Le Gia Kiet , Le Quoc Khanh , Nguyen Minh Nhut and Nguyen Dinh Thuan *

* Corresponding author: Nguyen Dinh Thuan (email: thuannd@uit.edu.vn)

Main Article Content

Abstract

Translating natural language into SQL is essential for intuitive database access, yet open-source small language models (SLMs) still lag behind larger systems when faced with complex schemas and tight context windows. This paper introduces a two-phase workflow designed to enhance the Text-to-SQL capabilities of SLMs. Phase 1 (offline) transforms the database schema into a graph, partitions it with Louvain community detection, and enriches each component in a cluster with metadata, relationships, and sample rows. Phase 2 (at runtime) selects the relevant tables, generates SQL queries, and iteratively refines the SQL through an execution-driven feedback loop until the query executes successfully. Evaluated on the Spider test set, our pipeline raises Qwen-2.5-Coder-14B to 86.2% Execution Accuracy (EX), surpassing its zero-shot baseline and outperforming all contemporary SLM + ICL approaches and narrowing the gap to GPT-4-based systems all while running on consumer-grade hardware. Ablation studies confirm that both schema enrichment and self-correction contribute significantly to the improvement. The study concludes that this workflow provides a practical methodology for deploying resource-efficient open-source SLMs in Text-to-SQL applications, effectively mitigating common challenges. An open-source implementation is released to support further research.

Keywords: Database schema context enrichment, graph clustering, natural language processing, open-source models, small language models (SLMs), text-to-SQL

Article Details

References

Banda, F., & Motik, B. (2020). Community-based RDF graph partitioning. SSWS 2020: Scalable Semantic Web Knowledge Base Systems, 2757, 33–48. https://ora.ox.ac.uk/objects/uuid:8835ec45-cf2e-4706-8dac-808f007caa60

Cai, R., Yuan, J., Xu, B., & Hao, Z. (2021). SADGA: Structure-Aware Dual Graph Aggregation Network for Text-to-SQL. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems (Vol. 34, pp. 7664–7676). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2021/file/3f1656d9668dffcf8119e3ecff873558-Paper.pdf

Cao, Z., Zheng, Y., Fan, Z., Zhang, X., Chen, W., & Bai, X. (2024). RSL-SQL: Robust Schema Linking in Text-to-SQL Generation (No. arXiv:2411.00073). arXiv. https://doi.org/10.48550/arXiv.2411.00073

Chen, J., Gan, L., Zhao, Z., Wang, Z., Wang, D., & Zhuang, C. (2025). SQLCritic: Correcting Text-to-SQL Generation via Clause-wise Critic (No. arXiv:2503.07996). arXiv. https://doi.org/10.48550/arXiv.2503.07996

Chen, X., Wang, T., Qiu, T., Qin, J., & Yang, M. (2024). Open-SQL Framework: Enhancing Text-to-SQL on Open-source Large Language Models (No. arXiv:2405.06674). arXiv. https://doi.org/10.48550/arXiv.2405.06674

Choi, D., Shin, M. C., Kim, E., & Shin, D. R. (2021). RYANSQL: Recursively Applying Sketch-based Slot Fillings for Complex Text-to-SQL in Cross-Domain Databases. Computational Linguistics, 47(2), 309–332. https://doi.org/10.1162/coli_a_00403

Gan, Y., Chen, X., Xie, J., Purver, M., Woodward, J. R., Drake, J., & Zhang, Q. (2021). Natural SQL: Making SQL Easier to Infer from Natural Language Specifications. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 2030–2042). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.174

Gao, D., Wang, H., Li, Y., Sun, X., Qian, Y., Ding, B., & Zhou, J. (2024). Text-to-SQL empowered by large language models: A benchmark evaluation. Proc. VLDB Endow., 17(5), 1132–1145. https://doi.org/10.14778/3641204.3641221

Gao, Y., & Luo, Z. (2025). Automatic database description generation for Text-to-SQL (No. arXiv:2502.20657). arXiv. https://doi.org/10.48550/arXiv.2502.20657

Gorti, S. K., Gofman, I., Liu, Z., Wu, J., Vouitsis, N., Yu, G., Cresswell, J. C., & Hosseinzadeh, R. (2025). MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation. In L. Chiruzzo, A. Ritter, & L. Wang (Eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 2145–2160). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.naacl-long.107

Hong, Z., Yuan, Z., Zhang, Q., Chen, H., Dong, J., Huang, F., & Huang, X. (2025). Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL (No. arXiv:2406.08426). arXiv. https://doi.org/10.48550/arXiv.2406.08426

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., Dang, K., Fan, Y., Zhang, Y., Yang, A., Men, R., Huang, F., Zheng, B., Miao, Y., Quan, S., … Lin, J. (2024). Qwen2.5-Coder Technical Report (No. arXiv:2409.12186). arXiv. https://doi.org/10.48550/arXiv.2409.12186

Li, B., Zhang, Y., Bubeck, S., Pathuri, J., & Menache, I. (2024). Small Language Models for Application Interactions: A Case Study. https://doi.org/10.48550/ARXIV.2405.20347

Li, C., Shao, Y., Li, Y., & Liu, Z. (2025). SEA-SQL: Semantic-Enhanced Text-to-SQL with Adaptive Refinement (No. arXiv:2408.04919). arXiv. https://doi.org/10.48550/arXiv.2408.04919

Li, H., Zhang, J., Li, C., & Chen, H. (2023). Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 13067–13075. https://ojs.aaai.org/index.php/AAAI/article/view/26535

Li, H., Zhang, J., Liu, H., Fan, J., Zhang, X., Zhu, J., Wei, R., Pan, H., Li, C., & Chen, H. (2024). CodeS: Towards Building Open-source Language Models for Text-to-SQL. Proceedings of the ACM on Management of Data, 2(3), 1–28. https://doi.org/10.1145/3654930

Mohammadjafari, A., Maida, A. S., & Gottumukkala, R. (2025). From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems (No. arXiv:2410.01066). arXiv. https://doi.org/10.48550/arXiv.2410.01066

Nan, L., Zhao, Y., Zou, W., Ri, N., Tae, J., Zhang, E., Cohan, A., & Radev, D. (2023). Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 14935–14956). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.996

OpenAI. (2025, June 13). SQL translation with GPT models. OpenAI Platform Documentation. https://platform.openai.com/docs/examples/default-sql-translate

Pourreza, M., & Rafiei, D. (2023). DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction (No. arXiv:2304.11015). arXiv. https://doi.org/10.48550/arXiv.2304.11015

Pourreza, M., & Rafiei, D. (2024). DTS-SQL: Decomposed Text-to-SQL with Small Large Language Models. In Y. Al-Onaizan, M. Bansal, & Y.-N. Chen (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 8212–8220). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-emnlp.481

Qi, J., Tang, J., He, Z., Wan, X., Cheng, Y., Zhou, C., Wang, X., Zhang, Q., & Lin, Z. (2022). RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL. In Y. Goldberg, Z. Kozareva, & Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 3215–3229). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.211

Tai, C.-Y., Chen, Z., Zhang, T., Deng, X., & Sun, H. (2023). Exploring Chain of Thought Style Prompting for Text-to-SQL. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 5376–5393). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.327

Wang, B., Ren, C., Yang, J., Liang, X., Bai, J., Chai, L., Yan, Z., Zhang, Q.-W., Yin, D., Sun, X., & Li, Z. (2025). MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL (No. arXiv:2312.11242). arXiv. https://doi.org/10.48550/arXiv.2312.11242

Wong, A., Pham, L., Lee, Y., Chan, S., Sadaya, R., Khmelevsky, Y., Clement, M., Cheng, F. W. Y., Mahony, J., & Ferri, M. (2024). Translating Natural Language Queries to SQL Using the T5 Model. 2024 IEEE International Systems Conference (SysCon), 1–7. https://ieeexplore.ieee.org/abstract/document/10553509/

Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., Zhang, Z., & Radev, D. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In E. Riloff, D. Chiang, J. Hockenmaier, & J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 3911–3921). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1425