The Reliability Evaluating of ChatGPT-Generated Code in Software Engineering Tasks

Authors

  • Faris Sattar Hadi Information Technology Research and Development Center, University of KUFA, Iraq

Keywords:

Software Engineering, ChatGPT, Code Reliability, Large Language Models, Empirical Study

Abstract

Autocode new generations of AI-driven coding assistants such as GitHub Copilot and many large language models (LLM) have made vast sectors of the incumbent software development workflow obsolete. One such model, popularly known as ChatGPT, has been in the limelight because of its capacity to generate code that can be executed as well as assist with debugging and other software engineering tasks. As reliance on the technology grows in industry, many are asking whether code written by ChatGPT is trustworthy and can be decent enough quality for production-grade software systems. Whereas previous work focused only on functional correctness and therefore offered little insight into general software quality. More generally from a software engineering perspective, this paper provides a thorough empirical characterization of the reliability of code generated by ChatGPT. The above are components of the proposed multidimensional reliability framework Accuracy maintainability (not efficiency) cleanliness & security readability testability uniformity We apply controlled empirical studies on 500 code artifacts derived from five software engineering tasks samples and two programming languages. The output code was analyzed by a range of scanners, automated static analysis tools, statistical significance tests and common software quality measures. These results suggest that for fast coding assistance, ChatGPT is a suitable programming assistant in terms of readability and basic functional correctness. The results do also show, however, some significant reliability problems. In particular, the quality of directly generated AI code is lacking significantly due to inconsistent results over repeated executions and decreased simplicity for non-toy programs in addition to security vulnerabilities easily exploited. Evidence from the Field There is much variation in if, where, and how reliability is variable for kinds of work and quality characteristics. The paper concludes that although ChatGPT brings productivity benefits, it is not yet a dependable software development assistant. Instead, a QA method should validate the results. We believe that our proposed evaluation framework, along with the empirical evidences presented in this paper is a guide for researchers and practitioners who want to adopt big language models into existing software engineering practices.

References

[1] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Computing Surveys, 2018.

[2] M. Chen et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.

[3] T. Nijkamp et al., “Code generation with large language models,” Communications of the ACM, 2023.

[4] Y. Fan, T. Yu, Y. Zhang, and S. Wang, “Large language models for software engineering: A systematic literature review,” Empirical Software Engineering, 2023.

[5] M. Fu, C. Tantithamthavorn, and A. E. Hassan, “The impact of automated program repair on software maintainability,” IEEE Transactions on Software Engineering, 2022.

[6] J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley, 2010.

[7] Y. Li, H. Zhang, and Y. Zhou, “An empirical study of ChatGPT for software engineering tasks,” Information and Software Technology, 2023.

[8] NIST, Software Assurance Metrics and Models, National Institute of Standards and Technology, 2018.

[9] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions,” IEEE Symposium on Security and Privacy, 2022.

[10] D. Zhang, S. Wang, and D. Lo, “Trustworthy AI for software engineering: Foundations and future directions,” IEEE Transactions on Software Engineering, 2024.

[11] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Computing Surveys, 2018.

[12] S. Wang et al., “Learning semantic representations for source code,” IEEE Transactions on Software Engineering, 2018.

[13] A. Vaswani et al., “Attention is all you need,” NeurIPS, 2017.

[14] M. Chen et al., “Evaluating large language models trained on code,” arXiv:2107.03374, 2021.

[15] Y. Fan et al., “Large language models for software engineering: A systematic literature review,” Empirical Software Engineering, 2023.

[16] M. Fu, C. Tantithamthavorn, and A. E. Hassan, “The impact of automated program repair on software maintainability,” IEEE Transactions on Software Engineering, 2022.

[17] T. Nijkamp et al., “Code generation with large language models,” Communications of the ACM, 2023.

[18] Y. Li, H. Zhang, and Y. Zhou, “An empirical study of ChatGPT for software engineering tasks,” Information and Software Technology, 2023.

[19] D. Sobania et al., “An empirical study of ChatGPT in software testing,” ICSE, 2023.

[20] H. Tian et al., “Evaluating large language models for code intelligence tasks,” ACM Transactions on Software Engineering and Methodology, 2023.

[21] M. Fu et al., “Automated program repair and maintainability,” IEEE Transactions on Software Engineering, 2022.

[22] H. Kang, J. Lou, and T. Zhang, “Maintainability of AI-generated code,” Journal of Systems and Software, 2024.

[23] S. Feng et al., “Documentation quality of AI-generated code,” Information and Software Technology, 2024.

[24] H. Pearce et al., “Assessing the security of AI-generated code,” IEEE Symposium on Security and Privacy, 2022.

[25] H. Pearce et al., “Security implications of large language models for code,” IEEE Security & Privacy, 2023.

[26] A. Arcuri and L. Briand, “A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering,” STVR, 2014.

[27] D. Zhang, S. Wang, and D. Lo, “Trustworthy AI for software engineering: Foundations and future directions,” IEEE Transactions on Software Engineering, 2024.

Downloads

Published

2026-04-20

How to Cite

Hadi, F. S. (2026). The Reliability Evaluating of ChatGPT-Generated Code in Software Engineering Tasks. Vital Annex: International Journal of Novel Research in Advanced Sciences (2751-756X), 5(2), 116–128. Retrieved from https://journals.innoscie.com/index.php/ijnras/article/view/186

Issue

Section

Articles

Similar Articles

<< < 2 3 4 5 6 7 8 > >> 

You may also start an advanced similarity search for this article.