1

New Research Report - Exploring the 2024 State of Software Quality

Group 370
2

Codacy Product Showcase: January 2025 - Learn About Platform Updates

Group 370
3

Join us at Manchester Tech Festival on October 30th

Group 370

We Analyzed AI-Generated Code: Here's What You Should Know

In this article:
Subscribe to our blog:

Artificial intelligence (AI) coding tools offer a huge productivity boost to developers. The ability to write the logic you need in natural language and have hundreds of lines of code generated in seconds in any language is an exceptional opportunity. According to Codacy's 2024 State of Software Quality report, 64.3% of respondents have integrated AI into their code production workflows, either extensively or for certain aspects like smart autocomplete and boilerplate code generation.

But is that code any good?

This is the fundamental question currently being asked as we weigh the pros and cons of AI coding. AI coding assistants can create code quickly. But how good is the code that's created? Does it actually work well? Is it secure? 

The easiest way to answer these questions is to test it out yourself. And that's exactly what I did. 

My AI Coding Experiment

The plan was simple. Ask ChatGPT to generate code from a simple prompt, then run static code analysis using Codacy on the generated code. Codacy uses an assortment of different linters (52+), giving us a comprehensive aggregate view of quality.

This is the prompt I wrote:

“Write 5 python functions that receive plain text, hashes it, encrypts it, opens a file, and saves the resulting output to the file. 

         Just output the code, nothing else. 

It's meant to be used as an output to a file to be executed.  

Please don't have a conversation. It should only be code output. please output in this format: '#!/usr/bin/env python \n' followed by the code.

Don't include any ```python because this code is meant to be executed, not displayed.”

I performed this entire experiment via the OpenAI API. You can see the entire code I used to generate our test code here.

Here are a couple of examples of the generated code:

#!/usr/bin/env python

import hashlib
from cryptography.fernet import Fernet
import os

# Function to hash the input plain text
def hash_input(plain_text):
hashed_text = hashlib.sha256(plain_text.encode()).hexdigest()
return hashed_text

# Function to generate key for encryption
def generate_key():
    key = Fernet.generate_key()
    return key

# Function to encrypt the hashed text
def encrypt_text(hashed_text, key):
cipher_suite = Fernet(key)
cipher_text = cipher_suite.encrypt(hashed_text.encode())
return cipher_text

# Function to open a file
def open_file(filename):
if os.path.exists(filename):
with open(filename, 'rb') as f:
     file_data = f.read()
else:
   file_data = None
return file_data

# Function to write the encrypted hash into a file
def write_to_file(cipher_text, filename):
with open(filename, 'wb') as f:
    f.write(cipher_text)

The script is designed to handle encryption of hashed text and file operations associated with storing the encrypted content. This code imports the necessary libraries for hashing and encryption. It then defines five functions:

  1.  hash_input that takes plain text as an argument and returns its SHA-256 hash
  2. generate_key which generates an encryption key using the Fernet symmetric encryption method
  3. encrypt_text that encrypts the provided hashed text using the provided encryption key
  4. open_file which attempts to read and return the content of a specified file, returning None if the file doesn't exist
  5. write_to_file which writes the provided encrypted text to a specified file. 

Another example that has a different approach:

#!/usr/bin/env python

import hashlib

from Crypto.Cipher import AES
import os
from base64 import b64encode, b64decode

def hash_text(text):
hashed = hashlib.sha256(text.encode()).hexdigest()
return hashed

def generate_key():
return os.urandom(16)

def encrypt_text(text, key):
cipher = AES.new(key, AES.MODE_EAX)
nonce = cipher.nonce
ciphered_data, _ = cipher.encrypt_and_digest(text.encode())
return b64encode(ciphered_data).decode(), b64encode(nonce).decode()

def decrypt_text(ciphered_data, nonce, key):
ciphered_data = b64decode(ciphered_data)
nonce = b64decode(nonce)
cipher = AES.new(key, AES.MODE_EAX, nonce=nonce)
decrypted_data = cipher.decrypt(ciphered_data)
return decrypted_data.decode()

def save_to_file(filename, data):
with open(filename, 'w') as file:
   file.write(data)

The given Python script provides utilities for hashing, encrypting, and decrypting text using the SHA-256 hashing algorithm and the AES encryption algorithm in EAX mode. The code makes use of the `Crypto.Cipher` library for encryption and decryption operations and the `base64` module for encoding and decoding processes.

  1. hash_text which hashes a provided text using SHA-256
  2. generate_key which produces a random 16-byte key for AES encryption
  3. encrypt_text which encrypts text using the AES algorithm and returns the base64-encoded encrypted data and its associated nonce
  4. decrypt_text which decrypts the base64-encoded encrypted data using the provided nonce and key, returning the original text
  5. save_to_file which saves the given data to a specified file. 

ChatGPT generated ten code and issue_output files. Let's go through what we found with these results.

Design Decisions Are Implicit Until You Make Them Explicit

Have a look again at the two generated files above. Notice anything missing?

Error handling. Only one out of the ten generated code files had any error handling:

 

File Name

Error Handling

code_1.py

⛔️

code_2.py

⛔️

code_3.py

⛔️

code_4.py

⛔️

code_5.py

⛔️

code_6.py

⛔️

code_7.py

code_8.py

⛔️

code_9.py

⛔️

code_10.py

⛔️

Even within code_7.py, the error handling wasn't robust:

#!/usr/bin/env python 

import hashlib
from cryptography.fernet import Fernet
import os

def hash_text(text):
return hashlib.sha256(text.encode()).hexdigest()

def generate_key():
return Fernet.generate_key()

def encrypt_text(key, text):
cipher_suite = Fernet(key)
cipher_text = cipher_suite.encrypt(text.encode())
return cipher_text

def open_file(path, mode):
try:
   with open(path, mode) as file:
     return file.read()
except IOError:
  print("File not accessible")

def save_to_file(path, data):
with open(path, 'w') as file:
   file.write(data)

It has just added a try…except IOError for the file opening. But what happens if we try to hash a non-string input? What happens if we have an invalid key? What happens if there is an error saving the file?

None of these cases are dealt with. Why? Because I didn't explicitly state I wanted error handling. AI code/text generation tools like ChatGPT are based on large language models (LLMs). The whole concept of LLMs is that they are trying to predict the next most likely word in a string, given the context. 

 

They are not trying to generate the best code. They are just trying to get the best code to match what you asked for. In this case I asked for hashing, encryption, opening, and saving functions. I didn't ask for error handling. So I didn't get error handling.

ChatGPT did add error handling once, suggesting that sometimes the best “next word” in the code is some try…except statement. But this is rare (This is an interesting corollary from this: if ChatGPT is just generating this code based on what it has seen before in millions of lines of other code, does that mean that 9 out of 10 times that other code also doesn't include error handling?)

With AI tools, you must explicitly state what you want in the code. One good thing with ChatGPT is that you can ask for improvements post-hoc. If we feed it the code in code_7.py with the question: “what error handling might you add to this code?”, we get this:

#!/usr/bin/env python 

import hashlib
from cryptography.fernet import Fernet
import os

def hash_text(text):
if not isinstance(text, str):
  raise ValueError("Text must be a string.")
return hashlib.sha256(text.encode()).hexdigest()

def encrypt_text(key, text):
if not isinstance(key, bytes) or len(key) != 44: # Fernet keys are base64url encoded and have a length of 44
   raise ValueError("Invalid key.")
if not isinstance(text, str):
   raise ValueError("Text must be a string.")
try:
   cipher_suite = Fernet(key)
   cipher_text = cipher_suite.encrypt(text.encode())
   return cipher_text
except Exception as e:
   print(f"Error encrypting text: {e}")
        return None

def open_file(path, mode):
  try:
       with open(path, mode) as file:
           return file.read()
   except (IOError, PermissionError) as e:
        print(f"Error opening file: {e}")
       return None
   except Exception as e:
        print(f"Unexpected error: {e}")
       return None

def save_to_file(path, data):
   if not isinstance(data, str):
        raise ValueError("Data must be a string.")
   try:
       with open(path, 'w') as file:
           file.write(data)
   except (IOError, PermissionError) as e:
        print(f"Error saving to file: {e}")
   except Exception as e:
        print(f"Unexpected error: {e}")

Much more robust.

Prompt Engineer vs. Software Engineer

Here's the problem. Could we write a prompt that gives us well-designed code? Probably. But then we come to two issues:

  1. We have to already know what well-designed code looks like.
  2. We have to extend our prompts to contain all of the information about our design decisions.

So, for a senior engineer that knows what well-structured code is, is writing the prompt (and iterating on it) going to be faster than just writing the code?

No.

A good security engineer could probably put the code above together quicker themselves than asking ChatGPT to do it properly. It might be that you can produce the bare-bones code faster and then improve yourself, but, without explicitly stating your design decisions to the AI everytime, the code outputted by an AI code generation tool is never going to be high-quality enough to use in production.

Quality and Security Issues Are Abundant 

On average, the Codacy CLI found 27.7 issues per file. Most issues related to formatting, but a few were related to critical security vulnerabilities.

codacy issues list security vulnerabilities

Let's start with quality. Quality code is clean, consistent, and easy to understand. Here, we're missing docstrings, whitespaces, have unused imports, and bad variable names. This code is just a few lines long, so it should be extremely easy to grok, but ChatGPT is making elementary quality mistakes that would frustrate any developer having to review this code.

Again, prompt engineering could come to the rescue. Here's a histogram of the number of issues per file. Code_6.py and code_8.py seem to be the biggest culprits:

codacy histogram of issues

Let's ask ChatGPT to fix code_8.py by asking: how would you improve the code quality of this code?

Along with some recommendations, this is what is outputted:

#!/usr/bin/env python

import hashlib
from Crypto.Cipher import AES
import os
from base64 import b64encode, b64decode

IV_LENGTH = 16
KEY_LENGTHS = [16, 24, 32]  # AES supports 128, 192 and 256 bit keys.

def hash_text(text):
"""Generate a SHA256 hash of the given text."""
return hashlib.sha256(text.encode()).hexdigest()

def validate_key_length(key):
    """Ensure the key is of a valid length for AES."""
    if len(key) not in KEY_LENGTHS:
     raise ValueError(f"Key must be one of the following lengths: {KEY_LENGTHS}")

def encrypt_text(text, key):
    """Encrypt the text using AES encryption with the given key."""
   validate_key_length(key)

   IV = os.urandom(IV_LENGTH)
   cipher = AES.new(key.encode(), AES.MODE_CFB, IV)
   encrypted_text = cipher.encrypt(text.encode())
   return b64encode(IV + encrypted_text).decode('utf-8')

def decrypt_text(encrypted_text, key):
    """Decrypt the AES encrypted text using the given key."""
   validate_key_length(key)

   decoded_text = b64decode(encrypted_text)
   IV = decoded_text[:IV_LENGTH]
   cipher = AES.new(key.encode(), AES.MODE_CFB, IV)
   return cipher.decrypt(decoded_text[IV_LENGTH:]).decode('utf-8')

def open_file(file_path):

    """Open and read the content of the file at the given path."""

   try:
       with open(file_path, 'r') as f:
           return f.read()
   except IOError:
       print(f"Error reading file at {file_path}")
        return None

def save_to_file(text, file_path):
    """Save the text to the file at the given path."""
    try:
        with open(file_path, 'w') as f:
          f.write(text)
    except IOError:
        print(f"Error writing to file at {file_path}")

def main():
   # Please replace 'your_secret_key' with your actual key. It should be 16, 24, or 32 characters long.
   key = 'your_secret_key'
    text = "Your text here"
    hashed_text = hash_text(text)
    encrypted_text = encrypt_text(text, key)
    decrypted_text = decrypt_text(encrypted_text, key)

    file_path_read = '/path/to/your/file.txt'
    file_content = open_file(file_path_read)

    file_path_write = '/path/to/save/your/file.txt'
    save_to_file(encrypted_text, file_path_write)

if __name__ == "__main__":

    main()

It is still by no means great, but it has some docstrings now and the proper imports. It could still use some better variable names!

We come back to the previous problem, though. If we want good quality code, we have to ask for it. Otherwise, we're at the mercy of the training set, with it billions of lines of poor-quality code.

The training set brings us to what is the critical flaw with using ChatGPT to write code: security.

Here are some of the security issues found by the Bandit linter running in Codacy:

  • The pyCrypto library and its module AES are no longer actively maintained and have been deprecated. Consider using pyca/cryptography library: 6 occurrences
  • The pyCrypto library and its module Random are no longer actively maintained and have been deprecated. Consider using pyca/cryptography library: 1 occurrence
  • The pyCrypto library and its module pad are no longer actively maintained and have been deprecated. Consider using pyca/cryptography library: 1 occurrence

The code you get is anchored at the latest date of the training model, which in the case of ChatGPT is in 2021. If any vulnerabilities are found in the libraries, unless the training data is post-dated, you can be exposed.

I could form an opinion here, but why not ask ChatGPT what it thinks of these issues found in the code that it generated? Just for giggles. Here's what the AI model said:

When a library or a module is no longer actively maintained, it means that no one is patching potential vulnerabilities, improving its functionality, or ensuring its compatibility with newer versions of related software. This can lead to a variety of issues, including:

  • Security Vulnerabilities: Since no one is addressing potential vulnerabilities, the library can become a target for malicious actors. Over time, as more vulnerabilities are discovered and left unpatched, the risk increases.
  • Compatibility Issues: The library might not work properly with newer versions of Python or other dependencies, leading to software malfunctions or crashes.
  • Lack of Features: As the field of cryptography evolves, new and more secure algorithms and methods are developed. Using an outdated library means missing out on these advancements.

Danger Level: On a scale of 1 to 10, I'd rate the danger of using a deprecated and unmaintained cryptographic library at 8. Cryptography is a critical component of security, and using an outdated library can expose applications to severe vulnerabilities. While the actual risk might vary based on the specific use-case and how the library is employed, it's generally a best practice to use well-maintained and up-to-date cryptographic libraries.

That sounds about right. (Fun fact: You'll see comments like this one using Quality AI, which not only suggests fixes but gives color on problems we find).

“You Are a Paranoid Security Engineer”

If you are not paying attention, you will actively introduce poor code and vulnerabilities into your codebase via this method. One “trick” here is that you can ask ChatGPT to adopt specific personas and answer the question as they think that person would. So if we pass it code_8.py and say, “you are a paranoid security engineer. how might you improve this code?”, you get:

#!/usr/bin/env python

import hashlib
from Crypto.Cipher import AES
import os
from base64 import b64encode, b64decode

IV_LENGTH = 16
KEY_LENGTHS = [16, 24, 32]  # AES supports 128, 192 and 256 bit keys.

def hash_text(text):
    """Generate a SHA256 hash of the given text."""
    return hashlib.sha256(text.encode()).hexdigest()

def validate_key_length(key):
    """Ensure the key is of a valid length for AES."""
    if len(key) not in KEY_LENGTHS:
        raise ValueError(f"Key must be one of the following lengths: {KEY_LENGTHS}")

def encrypt_text(text, key):
    """Encrypt the text using AES encryption with the given key."""
   validate_key_length(key)
   IV = os.urandom(IV_LENGTH)
   cipher = AES.new(key.encode(), AES.MODE_CFB, IV)
   encrypted_text = cipher.encrypt(text.encode())
   return b64encode(IV + encrypted_text).decode('utf-8')

def decrypt_text(encrypted_text, key):
    """Decrypt the AES encrypted text using the given key."""
   validate_key_length(key)
   decoded_text = b64decode(encrypted_text)
   IV = decoded_text[:IV_LENGTH]
   cipher = AES.new(key.encode(), AES.MODE_CFB, IV)
   return cipher.decrypt(decoded_text[IV_LENGTH:]).decode('utf-8')

def open_file(file_path):
    """Open and read the content of the file at the given path."""
   try:
       with open(file_path, 'r') as f:
           return f.read()
   except IOError:
       print(f"Error reading file at {file_path}")
        return None

def save_to_file(text, file_path):
    """Save the text to the file at the given path."""
    try:
        with open(file_path, 'w') as f:
            f.write(text)
    except IOError:
        print(f"Error writing to file at {file_path}")

def main():
   # Please replace 'your_secret_key' with your actual key. It should be 16, 24, or 32 characters long.
   key = 'your_secret_key'
    text = "Your text here"
    hashed_text = hash_text(text)
    encrypted_text = encrypt_text(text, key)
    decrypted_text = decrypt_text(encrypted_text, key)

    file_path_read = '/path/to/your/file.txt'
    file_content = open_file(file_path_read)

    file_path_write = '/path/to/save/your/file.txt'
    save_to_file(encrypted_text, file_path_write)

if __name__ == "__main__":
    main()

This is better. BUT, it is still using the deprecated library! Why? Because it has no idea it shouldn't. You can tell it to not use this library, but what if you didn't know? There is no easy “prompt” answer to this problem.

BTW: you can further add this role and context prefix to your prompts by using Custom Instructions.

Code Standardization Is Almost Nonexistent 

Another important aspect of code quality is code standardization. If you asked ten senior engineers to perform this task, you'd probably get ten very similar files. There is a canonical, pythonic way to do this, and they would conform to those standards

The code created by ChatGPT was all over the place. Every file handled the task differently.

A fundamental concept to think about here is temperature. Temperature is a setting you can control when using ChatGPT via the OpenAI API (but not when using the ChatGPT UI) that basically controls the “creativity” of the generated text.  

It ranges from 0 to 1. If you set it to 0, the output will be entirely deterministic. This means you should get basically the same answer each time you call the API. If you set it to 1, well, you'll definitely get some eclectic code. Here, I kept the default (0.7) as that is what most people would use.

If we set it to 0, we'd get the same code each time, but, importantly, not standardized code. It wouldn't conform to PEP 8 standards. As I pointed out above, the AI is just trying to guess the best next word based on its training set, and unfortunately most Python code out in the wild is not up to PEP 8 standards.

Finally, I asked ChatGPT's code interpreter function to statically and dynamically analyze the code.

All files (code_1.py to code_10.py) contain patterns associated with: 

  • Hashing: They have constructs related to hashing.
  • Encryption: They have constructs related to encryption.
  • File Operations: They have constructs related to opening, reading, writing, or closing files.

The dynamic analysis performed (which was based on the output of ChatGPT, so it might not be accurate anyway) showed the following: 

  • code_1.py: Successfully hashed and encrypted the sample text.
  • code_2.py through code_5.py and code_7.py through code_10.py: No visible hashing or encryption results.
  • code_6.py: Encountered an error indicating that the type of object passed to the C code was invalid.
  • code_8.py: Encountered an error related to incorrect AES key length.

This analysis shows that only code_1.py successfully performed hashing and encryption on the sample text without errors. The other files either did not produce any results or encountered errors during execution.

Empirical findings back this up. A recent study found that only 65% of ChatGPT code was correct (better than our results).

Use with Well-Defined, Low-Risk Problems on Mature Stacks

Python is probably one of the best languages to use with AI tools. It is a mature language with a lot of training data. But ChatGPT still got a lot wrong. It isn't made to write great code. It is going to write code the way it has seen the most–which is probably not that good. 

Truthfully, this is perfectly fine if what you need is a quick way to perform some mundane computing task. Need to rename 10,000 files? Great, here's the code to do that? Need to upload those files to Google Drive? No problem, here's the code you need (let's hope the API hasn't changed in the last two years).

But think about what bad code standards and correctness means in an organizational environment. What if you have 20 junior developers all using this every day, pushing code and opening PRs? The codebase would be a mess within hours and production would grind to a halt.

Git Blame ChatGPT

The increase in productivity when coding with AI is undeniable. In just a few seconds, I was given a mountain of functional code that I could then start to analyze. Arguably, very few human beings can write this amount of code this quickly. 


The problem is that you can't trust the code:

  • There's a good chance that ChatGPT gives you different answers every time you ask it for help.
  • When you ask AI to modify code, it often takes shortcuts and is limited by token context windows, which means that you risk losing functionality. 
  • It can also give you outdated answers that could turn out insecure. 

This risk increases significantly if you work with dozens of developers, each writing dozens of prompts a day. 

Overall, using a tool like ChatGPT to code is a crapshoot. It could choose to give you amazing, secure code or awful code. The only control you have over the code output is how you design your prompt. 

It is your name in the PR. You have the responsibility to push good, clean, quality code to your codebase, both for your teammates to use and for the product to actually work. You cannot use ChatGPT in a production environment without a good understanding of the serious limitations.

Sure, the code quality in the example could have been improved by enhancing the prompt, giving it more detail and context. But if you have to do that with depth, at a certain point, you might as well be writing the code directly. If you look at it this way, then even AI's clearest benefit, productivity, might be somewhat overstated. 

Still, despite some of these shortcomings, the productivity gains of using AI to code are so significant that mass adoption is all but unavoidable. The use of AI-generated code will increase, and while some companies might be reluctant to adopt it or let their developers adopt it, there is no point in trying to stop it. 

We need to embrace AI coding assistants, but we also need the right tools to ensure that we manage the associated risks. That's where Codacy comes in.

AI coding provides the perfect use case for a platform like Codacy. Static code analysis provides a safety net that allows developers to use AI coding tools to work faster. 

We welcome you to see for yourself. If your team is using AI coding assistants, sign up for a free 14-day Codacy trial and analyze the code your team is producing with these tools. 

With Codacy, you can truly have your cake and eat it, too—increasing efficiency while continuing to ship code of the utmost quality and security.

RELATED
BLOG POSTS

Is your code secure with Codacy?
If you have been in the development business, you are well aware of the fact that data breaches are a part of the development experience, and while...
Code Quality Explained
Ask 10 developers what code quality means, and you’ll get 15 different answers.
10 Common Programming Errors and How to Avoid Them
Famed American computer scientist Alan J. Perlis once joked, "There are two ways to write error-free programs; only the third one works."

Automate code
reviews on your commits and pull request

Group 13