Upcoming Bootcamp: Rapid Threat Modeling with GenAI and LLM  | 24 - 25 July | Book your seat now

How to Evaluate AI-Generated Code for Security

PUBLISHED:
April 21, 2025
|
BY:
Abhay Bhargav
Ideal for
AI Engineer
Security Engineer

On November 30, 2022, ChatGPT burst onto the scene and captivated the world in an instant. Fast forward two years, and the AI universe is evolving at lightning speed—with fresh, groundbreaking innovations popping up almost daily. Among all its incredible uses, AI-driven code generation has consistently taken center stage. In fact, Google recently revealed that over 25% of its code is now crafted by AI, signaling a transformative leap in the future of automated programming.

Today, any John and Jane can generate hundreds of lines of code, but the real challenge lies in its security. With vulnerabilities on the rise—especially in AI-written code—the need for secure coding practices has never been greater.But here's where it gets interesting: AI comes to its own rescue. Today's language models are not only generate code—they also act as vigilant security guards, meticulously scanning and correcting vulnerabilities as they arise.

In this article, we'll dive into a few strategies for crafting secure code, walk through the actual code snippets that bring these methods to life, and even shed some light on the math behind the scenes (without getting too deep into the details!).

Table of Contents

  1. Coder Reviewer Reranking
  2. Modular Code Generation Through Chain Of Revisions
  3. Dual-critic agents for helpfulness and security of code
  4. Key Players – Two Autonomous Critic agents
  5. Conclusion

Coder Reviewer Reranking

Problem

Traditional methods for code generation primarily depend on coder models that translate natural language instructions into code. These models typically:

  • Generate Multiple Candidates: They sample several potential programs.
  • Rank by Likelihood: They use internal likelihood estimates  (how likely a program is to be correct for a given prompt) to rank these candidates.

However, this approach suffers from a significant drawback—it often favors degenerate solutions. For example:

  • Overly Short Code: Functions that contain only a simple return statement are frequently preferred due to inherent biases in likelihood calculations.
  • Repetitive Patterns: Code that includes redundant loops (like printing numbers 1–50) may score highly, even though it often fails to address the task effectively.

Solution

 A dual-model framework that mimics human code review processes:

  1. Coder Model: Generates code candidates from instructions.
  2. Reviewer Model: Evaluates how well the generated code aligns with the original instruction by estimating  p(x∣y) (the likelihood of the instruction given the code). In layman terms p(x∣y) can be read as “what is the probability of the prompt (x) being correct given a particular code candidate (y).”

By combining the scores of both models— p(y∣x)⋅p(x∣y)—the method identifies solutions that satisfy both code correctness and instruction fidelity.( the product of the probability of the code being correct given the prompt and the prompt being correct given the code).This approach is a specific implementation of the Maximum Mutual Information (MMI) objective,  a criterion that aims to maximize the amount of information shared between two variables and emphasizes solutions with high mutual agreement between the instruction and generated code.

Code Snippet

Modular Code Generation Through Chain Of Revisions

Problem

Monolithic Code Generation

LLMs typically output solutions as single-block implementations rather than decomposed logical units. This approach:

  • Increases error propagation: A flaw in one component corrupts the entire solution.
  • Reduces interpretability: Debugging becomes challenging without clear functional boundaries.
  • Limits reuse: Models cannot leverage proven sub-components across iterations

Example: For palindrome detection, GPT-3.5 might write a brute-force checker rather than modular functions for string reversal and symmetry analysis.

Ineffective Error Correction

Traditional self-repair methods:

  • Rely on per-program feedback (compiler errors/test results) without cross-solution insights.
  • Treat each solution as independent, ignoring shared patterns across attempts.

Lack of Human-Like Abstraction

Experienced developers instinctively:

  1. Decompose problems into logical sub-tasks
  2. Create reusable functions/classes
  3. Iteratively refine components

Solution

Intuition!!

Planning the Process (CoT)
The model begins by thinking through the problem in steps, just as you might outline a plan before starting a project. It breaks down the overall task into smaller parts, making it easier to manage.

Building the Pieces (Sub-Modules):
Each part of the task is handled separately. Like assembling a toy, each sub-module (or piece) is developed independently. This modular approach means if one piece doesn’t work, you can fix or replace it without redoing everything.

Translating to a Common Language (Encoder):
Once the pieces are built, they’re translated into a numerical form. This conversion lets us compare different versions of the same sub-module to find similarities and differences.

Grouping Similar Pieces (K-Means Clustering):
The model then groups these similar pieces together. Just as you might sort similar items into piles, this grouping helps identify which versions of a sub-module are most alike and likely the most reliable.

Choosing the Best Example (Centroid Selection):
From each group, the version that best represents the group is chosen—the centroid. This is considered the most generic and reusable version, much like selecting the best model of a gadget for future reference.

Improving Future Work (Prompt Augmentation):
The successful pieces are then used to refine the instructions for creating new code. By telling the model, "Use this great version as a base," we ensure that future code builds on past successes, making each iteration better than the last.

Dual-critic agents for code-gen?

Problem:

Large language models (LLMs) for code generation struggle to balance being secure with being helpful, especially when given complicated or potentially harmful instructions. Traditional methods often fall short in this regard—studies have shown that as many as 40% of GitHub Copilot's outputs contain security vulnerabilities.

This issue arises from two main challenges:

  1. Ambiguity in Code: Code often has a dual purpose. For example, an encryption algorithm can be used to protect data but can also be misused to create ransomware.

  2. Limitations of Single-Focused Approaches:
    • Safety-First Methods: Techniques like fine-tuning for safety are not flexible enough to handle new or creative attack methods.
    • Overemphasis on Helpfulness: Trying too hard to align with what the user wants can inadvertently open up security loopholes.

Solution

The Goal

Improve code generation by balancing security and helpfulness. Traditional methods tend to focus on one at the expense of the other, which can lead to weak or inefficient code.

Key Players – Two Autonomous Critic agents

  1. Safety-Driven Critic
    • What It Does: Checks the code for security risks (e.g., spotting potential injection attacks).
    • How It Works:
      • Uses external resources (like secure coding guides found via web search) to stay updated on best practices.
      • For example, it might replace an insecure function (like subprocess.Popen) with a safer alternative (shlex).
  2. Helpfulness-Driven Critic
    • What It Does: Ensures the code is correct, efficient, and meets what the user wants.
    • How It Works:
      • Reviews the code for performance issues, logical errors, and overall alignment with the intended functionality.

Collaboration Through Multi-Agent Synergy:

  • The two critics engage in an internal dialogue—essentially a debate—where they exchange insights about the code.
  • This teamwork helps refine the code iteratively by combining their unique perspectives on security and functionality.

Two-Stage Operation of the Framework

  1. Preemptive Feedback:
    • During Code Drafting:
      • The critics perform proactive checks, catching security risks and logical errors early on, much like a human proofreads their work as they write.
  1. Post-Hoc Feedback:
    • After Code Execution:
      • The critics analyze how the code runs (using runtime results) to further fine-tune and improve it.
      • This stage ensures any issues that weren’t caught during drafting are addressed based on real-world behavior.

Representative code-snippet

Conclusion

We have outlined three innovative methods to generate and improve code using AI.

  1. Coder-Reviewer Reranking uses a dual-model approach that mimics human code reviews, cutting poor solutions by 62%.
  2. Chain-of-Revisions Architecture breaks code into parts and refines them like an expert developer, reducing errors by 41% and boosting code reuse.
  3. Dual-Critic Agents involve two collaborating models that balance security and functionality, lowering vulnerabilities by 89% without harming quality.

Looking ahead, combining these techniques into hybrid systems could allow AI models to self-check, cross-verify, and fine-tune their code more effectively. 

Abhay Bhargav

Blog Author
Abhay builds AI-native infrastructure for security teams operating at modern scale. His work blends offensive security, applied machine learning, and cloud-native systems focused on solving the real-world gaps that legacy tools ignore. With over a decade of experience across red teaming, threat modeling, detection engineering, and ML deployment, Abhay has helped high-growth startups and engineering teams build security that actually works in production, not just on paper.

Ready to Elevate Your Security Training?

Empower your teams with the skills they need to secure your applications and stay ahead of the curve.
Get Started Now
X
X
Copyright AppSecEngineer © 2025