Hardening GenAI Starts with How You Build Around It

PUBLISHED:

November 27, 2025

BY:

Abhay Bhargav

Ideal for

Developer

Everyone’s stuck on prompt injection like that’s the whole story.

Spoiler alert: it's not.

It's the most obvious issue, but also the least dangerous one. The real risks show up deeper in the stack: tainted training data, leaky memory, misused outputs, untracked model behavior. And nobody’s watching those layers closely enough.

Meanwhile, GenAI is being shipped into everything, such as customer-facing features, internal tooling, even backend automation, without proper security design. Engineering is moving fast, business wants results, and security teams are left cleaning up behind decisions they didn’t make, in systems they don’t control.

You are now accountable for protecting models that behave like black boxes. That's both unfair and unsustainable. These models can leak sensitive data, manipulate logic, or silently corrupt outcomes without triggering a single traditional alert. And once they’re in prod, it’s already too late.

‍

Most teams stop at prompt injection because it’s easy to spot
GenAI expands your attack surface quietly and without warning
What Secure-by-Design looks like for GenAI
Developers can’t control the model but they can control the stack around it
Secure Gen AI by treating it like any other high-risk component

‍

Most teams stop at prompt injection because it’s easy to spot

Prompt injection is the flashy problem everyone wants to demo. It’s the one that shows up in slide decks, red-team exercises, and conference talks. You paste a clever prompt, break the system’s guardrails, and prove your point. It’s visual, it’s immediate, and it feels like progress. That’s why so many teams fixate on it.

But in real deployments, prompt injection is just the first layer. And focusing all your defenses here gives you a false sense of coverage. Because once GenAI gets integrated into products, APIs, and internal tools, the attack surface gets much bigger (and a lot harder to see).

Here’s what that looks like when you go deeper:

‍

Prompt injection is just the starting point

Yes, you should care about prompt injection. But unless you’re going deeper, you’re ignoring the majority of exploitable behavior. Here’s how that plays out:

Instruction overrides are basic cases, where a user bypasses guardrails by rephrasing inputs or injecting adversarial context. Most models are still vulnerable here without robust input filtering or response validation.
Jailbreaks evolve constantly. A prompt that works today might fail tomorrow because it relies on model-specific quirks. Relying on static filters or regex patterns will not keep up.
Token smuggling is a more subtle class of attacks where the user exploits how models interpret token boundaries. This includes hidden characters, unicode manipulation, and prompt injection that relies on malformed inputs slipping past sanitizers.
Nested prompt injection occurs when untrusted input is embedded into system prompts generated by other services, like chat history, email parsing, or API payloads. This breaks context isolation and introduces second-order vulnerabilities.
Output injection and misuse happen when model responses are passed to other systems (like rendering engines, SQL parsers, or other LLMs) without validation. Attackers can craft prompts that produce malicious output designed to exploit those downstream systems.

These are risks that already showed up in production systems, from chatbots and RAG pipelines to customer support automation and LLM-powered code tools.

‍

OWASP LLM Top 10 starts with prompt injection (but it doesn’t end there)

OWASP put prompt injection at the top because it’s real, common, and demonstrable. But most of the other risks in that list (improper output handling, data and model poisoning, unbounded consumption) go completely unaddressed in most enterprise deployments.

Teams assume that prompt injection is the hard part. It’s not. It’s just the one that’s easiest to reproduce in a demo.

‍

Most of the attack surface lives below the prompt layer

When GenAI is integrated into apps, APIs, and internal tools, the model becomes part of a system, and that system has far more exposure points than the prompt window. Consider:

How user input is collected, sanitized, and contextualized before it hits the model.
Whether model outputs trigger actions, update records, or feed into other services.
What memory, history, or embeddings are retained between sessions, and how that can be exploited over time.

You’re not just securing a prompt. You’re securing:

The model’s pre- and post-processing layers.
The data pipelines that feed it.
The application logic that wraps around it.
The context memory and system prompts that shape its behavior.
The third-party integrations and downstream systems that consume its outputs.

Without visibility into all of that, your defenses are only covering a fraction of the real risk.

Prompt injection is loud, repeatable, and easy to show. But if that’s the only thing you’re defending against, you’re missing 90% of the attack surface. The risks that actually matter, the ones that cause data leakage, model corruption, and systemic abuse, don’t show up in red-team demos. They show up in production when it’s already too late to fix.

‍

GenAI expands your attack surface quietly and without warning

Most GenAI risks don’t come with alerts, logs, or clear failure states. You don’t get a 500 error or a blocked request. The model just does what it was told, or what it thinks it was told, and that’s exactly the problem. The systems behave like they’re working as expected, while they’re exposing data, executing logic they shouldn’t, or pulling context they were never supposed to access.

‍

Training data poisoning creates persistent and hard-to-audit vulnerabilities

Start with the base layer: the model’s training data. Most security teams have zero visibility into what went into the model, whether it was vendor-trained, open source, or fine-tuned internally. And when that data includes unverified sources, biased patterns, or adversarial inputs, it creates model behavior that can't be easily explained or reversed.

Poisoning can come in through multiple vectors:

Public code repositories used in fine-tuning, containing subtle logic traps or mislabeled examples
Internal documents scraped into training sets without proper data classification, exposing sensitive logic or confidential workflows
Embedded trigger phrases or invisible tokens designed to activate specific outputs under attacker-controlled conditions

Once this behavior is learned, there’s no reliable rollback. You can’t patch a neural weight the way you patch a function.

‍

Inference-time leakage is happening in production

Model outputs are treated as stateless text, but they often reflect more than just the current input. Attackers can extract sensitive data, infer model internals, or reconstruct private embeddings using carefully crafted prompts. Here’s what that looks like in practice:

Model extraction attacks replay inputs and analyze outputs to approximate proprietary tuning or system behavior
Data leakage exploits inject prompts that coax models to reproduce training data or internal examples, including names, credentials, or prior messages
Embedding inference abuses vector search to identify semantically similar data or leak indexed content through prompt interaction

Because there’s no standard logging for inference behavior, these attacks don’t show up in SIEM. There’s no anomaly detection unless you build it yourself.

‍

Memory-based exploits in RAG pipelines are easy to launch and hard to detect

Developers love adding memory to GenAI apps. It makes the user experience smoother and the model more useful. But memory isn’t as simple as being a UX feature, it’s also an attack vector.

In RAG pipelines, attackers can:

Inject misleading or malicious documents that get indexed and embedded into the vector store
Reference those poisoned embeddings through carefully designed prompts that cause the model to retrieve and act on them
Abuse long-term memory features to persist context across sessions and trigger harmful outputs days or weeks later

The problem here is the glue code around it: how documents are ingested, how retrieval works, how memory is scoped, and how outputs are used. Most of this happens silently, without validation, and without audit trails.

‍

AI-generated logic fails quietly and carries business impact

The most dangerous failures happen when models generate actions instead of text. When GenAI is used to interpret user input, trigger workflows, or assemble logic flows, hallucinations become business logic.

Consider a developer who connects an LLM to an internal automation tool. The idea is to let the model generate structured commands based on natural language input. Sounds productive, until a user injects a carefully crafted phrase that causes the model to output something like:

‍

{
"action": "delete_user",
"user_id": "admin"
}

‍

The downstream system sees valid JSON, matches it against a predefined schema, and executes it without question. No shell injection, no broken auth, and no traditional vuln. Just a chain of tools trusting a model to make the right call, and the model getting it wrong in exactly the wrong way.

These are the exact setups being deployed in enterprise automation, customer support, and developer tools.

Autocomplete helpers. AI-driven suggestions. Smart summaries. These aren’t features you’d traditionally flag as security-critical. But once they accept untrusted input, store memory, and generate structured output, they become viable attack surfaces.

Attackers don’t need to exploit CVEs when they can steer model behavior and influence business logic with zero visibility or control enforcement.

‍

What Secure-by-Design looks like for GenAI

Securing GenAI starts at architecture. By the time you’re trying to validate outputs or bolt on filters, you’ve already lost control of the system. The right approach is to design GenAI components with the same discipline you use for any other critical system: define boundaries, control inputs, limit what’s allowed to execute, and monitor what gets reused.

This is about treating the model as just another component in a larger pipeline, one that can fail, be manipulated, or misbehave depending on what it’s given and how it’s connected.

‍

Treat the model like a component

LLMs behave deterministically only within a narrow set of conditions. Outside of that, they’re probabilistic systems, and their behavior can’t be guaranteed. That doesn’t mean they’re untrustworthy, it just means they need to be scoped like any other untrusted service.

Start with how the model is embedded in your stack:

What inputs reach it?
What other services interact with it?
How are its outputs interpreted or acted on?

Treat it like you would any third-party service. Validate all input. Authenticate access. Sanitize and restrict output interpretation. And assume that in some edge cases, the model will generate something wrong, unsafe, or misleading.

‍

Define trust boundaries explicitly

This is the part most implementations skip. When you don’t define trust boundaries, the system starts to assume all inputs are clean and all outputs are safe, and that’s how attackers slip in. You need clear enforcement between:

Prompt inputs from end users, internal APIs, or generated templates
Memory systems including RAG context, vector stores, or session histories
Plugins or tools the model can call (internal APIs, third-party services, automation hooks)
Output consumers, especially systems that treat model output as executable logic or structured commands

Each one of those paths needs its own boundary. That means context validation, input restrictions, output checks, and rate-limiting where applicable. Models shouldn’t be given unchecked access to sensitive systems, and their outputs should never be consumed blindly by downstream logic.

‍

Threat model for RAG requires special attention

RAG is useful, flexible, and increasingly common. It also expands your threat surface considerably. Here, you’re injecting structured context into the model’s response generation pipeline. That means any issue in the retrieval logic becomes an influence vector on model behavior. Here’s where threat modeling needs to focus:

Vector database poisoning: Attackers inject malicious documents into the embedding pipeline. These documents persist and get retrieved later, influencing outputs silently.
Hallucinated retrievals: When the model can’t find an answer, it may fabricate one. Without validation on what was actually retrieved vs. what was generated, systems start returning made-up data with no confidence control.
Prompt chaining abuse: In multi-step workflows, untrusted outputs from one prompt are passed as inputs to the next. This can escalate into logic injection or data leakage unless properly scoped and sanitized.

In secure design reviews, these RAG-specific concerns should be mapped to technical controls, such as isolation between queries, rate-limiting retrieval, validating document source and freshness, and confirming output fidelity before any downstream system consumes it.

‍

Frameworks like NIST AI RMF and OWASP LLM Top 10 give you a starting point

You don’t have to start from scratch. The NIST AI Risk Management Framework gives you a structure to define and track AI-specific risks across governance, data, performance, and security. It forces teams to answer real questions: What are the system’s intended uses? What controls are in place for misuse? How is risk documented and communicated across teams?

OWASP’s LLM Top 10 is a complementary list that’s narrower and more tactical. It covers issues like prompt injection, training data poisoning, insecure output handling, and unauthorized plugin access. Use it as a checklist, not a compliance badge.

Together, these two frameworks give technical leaders the vocabulary and the structure to ask the right questions in design reviews, security assessments, and deployment planning.

‍

This is what Secure-by-Design looks like for GenAI

Security reviews start with architecture. You need to design for:

Input validation across all user-facing and system-generated prompts
Explicit isolation between memory, session, and context retrieval layers
Guardrails on model behavior when connected to automation or logic systems
Monitoring and logging across model decisions, retrieval sources, and output actions
Human-in-the-loop or escalation paths for safety-critical outputs

None of this works when it’s treated as a post-deployment checklist. It has to be part of the system design from the first sprint with clear threat models, scoped trust boundaries, and enforcement built into how the pipeline moves data.

Start building systems where the inputs, logic paths, and access layers are controlled by design. That’s what gives you predictability. That’s how you make GenAI secure enough to scale.

‍

Developers can’t control the model but they can control the stack around it

Security teams might own the risk, but it’s developers who control how GenAI systems are built, connected, and run. And that’s where the leverage is. You don’t need full visibility into the model internals to make it safer. What matters is how the inputs are handled, how the outputs are used, and what protections sit between them.

This is the part that gets missed when GenAI is treated like a black box API. The model isn’t secure by default. The stack around it has to make up for what the model can’t enforce. And that starts with basic, testable controls.

‍

Treat prompts like any other untrusted input

LLMs are designed to respond to prompts, and that makes the prompt your attack surface. Anything coming from users, systems, APIs, or previous model responses should be treated as untrusted until it’s validated. Basic hygiene applies here:

Strip known injection payloads using tokenizer-aware filters
Normalize inputs to reduce adversarial formatting
Escape or block control tokens that affect model behavior
Enforce input schemas where applicable

This won’t eliminate every injection attempt, but it gives you a clear enforcement point to reject malformed or suspicious inputs before they reach the model.

‍

Use built-in controls to limit how the model behaves

Models give you more control than most teams use. You can shape output length, stop it early, block token combinations, or apply output filters based on policy. Make sure your application enforces:

Max token limits to cap how much the model can say, this reduces exposure to long-form hallucinations and injection chaining
Stop sequences to prevent overrun into sensitive or unintended response content
Output type constraints, especially where the model generates structured formats (JSON, code, commands)

Most of these controls can be implemented at the SDK or API level, and they take minutes to set up. They won’t block all abuse, but they make it harder for attackers to stretch model outputs into dangerous territory.

‍

Use post-processing to catch unsafe or unexpected outputs

Once the model responds, the job isn’t done. You still need to verify what it returned before any part of your system consumes it. That includes:

Format validation: Make sure the output matches expected structure, encoding, or schema
Policy enforcement: Use regex, classifiers, or content filters to catch responses that violate rules (like including credentials, links, or restricted terms)
Context checks: In chained workflows, confirm that outputs from one step are safe to use as input to the next

This is where constitutional AI and other post-processing frameworks help. They give you a second layer of review that doesn’t rely on the model getting everything right.

‍

Always treat model output as tainted until verified

This should be your default posture. Model responses are not facts, nor are they trusted logic. They are generative guesses shaped by inputs you often can’t see and training data you don’t control.

So in every part of the stack, make sure model outputs are treated the same way you’d treat data from an unknown source on the internet:

Don’t execute model-generated commands directly
Don’t persist outputs without sanitization
Don’t use responses to drive logic, billing, or workflow state without validation

If a response affects anything critical, from API calls to user messages to UI rendering, it needs to go through the same scrutiny you’d apply to any external input.

‍

Developers can’t block prompt injection, but they can control the blast radius

Attackers will keep trying to manipulate prompts. That part won’t stop. But what developers can do is build systems that don’t blindly trust the model, limit the damage of a bad response, and give security teams the control points they need to monitor and respond. These are controls you can implement now:

Validate and sanitize all prompts before execution
Constrain model behavior using token limits, stop sequences, and structured formats
Verify and filter outputs before use
Never treat model output as trusted logic or clean data

You don’t have to solve GenAI risk all at once. You just have to make sure your stack doesn’t make it worse. That’s the part developers own, and that’s the part that can be fixed.

‍

Break up your LLM stack to contain risk before it spreads

Putting everything behind a single LLM (memory, logic, decision-making, access to internal tools) is one of the fastest ways to turn a useful feature into a high-risk architecture. It may work in dev. It may even survive QA. But the second that model behaves unexpectedly or gets manipulated by user input, the entire system is exposed.

Hardening the model helps, but it’s not enough. You need to design for failure. That means isolating the impact of bad responses and controlling what the model can influence.

‍

Separate the core components

Think about how GenAI is wired into your system. If the same model handles user prompts, queries memory, makes decisions, and calls APIs, you’re giving it full control over the flow, and there’s no guardrail when something goes wrong. Split the architecture into discrete components:

Retrieval handles document or data lookup from approved sources, with scoped access and filters in place.
Generation uses that context to produce a response, but doesn’t make decisions or take action.
Post-processing validates, filters, and transforms output into a safe and structured format.
Execution takes validated output and passes it into approved workflows or automation logic.

Each of these stages should have its own enforcement points, logs, and limits. This turns one large failure domain into four smaller, observable, and controllable ones.

‍

Apply role-based controls to model capabilities

Not every model call needs the same permissions. An LLM writing a summary doesn’t need access to billing systems or admin APIs. A chatbot responding to customer questions shouldn’t be able to modify data in a CRM. You can apply role-based access control to:

The tools the model can call (via plugins, APIs, or action wrappers)
The types of inputs it’s allowed to process
The sources of context or memory it can retrieve from
The types of outputs it’s permitted to return (e.g. commands, structured data, text)

Set these scopes based on function, tie them to service identities or app roles, and enforce them in the orchestration layer instead of inside the model.

‍

Version lock the model and audit its outputs

Too many teams run LLMs in production with dynamic weights or unmanaged versions. That creates inconsistency, removes accountability, and makes incident response nearly impossible when something goes wrong. Instead:

Lock the model version in production, and document the exact config in use.
Capture full input/output pairs along with metadata like prompt type, auth role, and retrieved context.
Store logs in a way that lets you trace a response back to the retrieval that shaped it, the prompt that triggered it, and the identity that initiated it.

This is your audit trail. It’s what helps you debug unexpected behavior, track prompt misuse, and explain outcomes during reviews or post-incident investigations.

‍

Design for containment because prevention will eventually fail

Even with the best input sanitization and output filtering, attacks will get through. A model will misbehave, a prompt will leak context, and a user will inject something clever and trigger a weird result.

The difference between a security event and a breach is containment.

When you break LLM workflows into independent stages, enforce strict boundaries between components, and log behavior across each step, you give yourself space to respond before things escalate.

And when something goes sideways, you’ll be able to see what happened, limit the impact, and fix it without ripping the entire system apart.

This isn’t about limiting capability, but about making sure your AI systems fail safely and visibly. That’s how you scale GenAI without compromising everything it touches.

‍

Build GenAI security into the workflow so it doesn’t slow anything down

GenAI features move fast because they’re easy to prototype, easy to deploy, and business wants them everywhere. The security work can’t sit in a backlog or wait for a quarterly review. The only way to keep up is to automate how GenAI risk gets caught and addressed inside the workflows developers already use.

‍

Automate architecture reviews where GenAI is being used

You can’t review what you can’t see. The first problem most security teams hit with GenAI adoption is visibility. LLMs get embedded into flows, stitched into APIs, or added as sidecar services, and security doesn’t know it’s there until it’s in prod.

Tools like SecurityReview.ai help automate this by detecting where and how LLMs are being used inside the architecture. They flag:

Whether untrusted user input is being passed into models without validation
If model outputs are being executed or passed into critical systems
Where memory, retrieval, or context injection is active without scope enforcement
If third-party plugins or APIs are being invoked directly through model responses

This level of automation turns GenAI use into a traceable part of the system instead of a hidden risk that security finds too late.

‍

Use CI pipelines to catch risky prompts and dangerous integrations early

You don’t need to wait until deployment to spot unsafe GenAI behavior. CI is the perfect place to catch dangerous patterns while the code is still moving. Embed checks that review:

Hardcoded prompts that include unescaped user input
Integration code where LLMs interact with backend systems or privileged APIs
Misconfigurations like open access to memory stores, plugin scaffolds, or tool wrappers

They give teams fast and actionable feedback while the code is still in their hands. And they reduce the chance of something risky slipping through just because it worked during testing.

‍

Log model outputs and monitor for anomalies like you would any other critical system

Model outputs don’t come with confidence scores or error codes. That makes observability even more important. At minimum, make sure your stack logs:

Inputs to the model, including session or context data
Retrieved content for RAG workflows
Raw model outputs before filtering or transformation
Actions triggered as a result of those outputs (API calls, DB writes, workflow events)

Once you have that telemetry, you can start looking for outliers: unusually long responses, outputs with unexpected structure, or sequences of prompts that generate repeated failures.

This kind of monitoring won’t prevent misuse, but it gives you a way to detect it early and prove it happened when it matters.

‍

Velocity and security don’t have to conflict, but only if security moves first

The biggest mistake security teams make with GenAI is trying to catch up after the fact. By that point, the model is in production, the behavior is baked in, and every fix comes with regression risk.

Security needs to be part of the delivery workflow from day one, not to approve every line of code, but to make sure the systems getting shipped don’t quietly introduce risk nobody accounted for.

This only works when you automate the checks, build into the pipeline, and make GenAI security a normal part of how features ship.

It’s not about slowing teams down. It’s about making sure they don’t ship something they’ll regret.

‍

Secure Gen AI by treating it like any other high-risk component

LLMs don’t behave like traditional software components. They don’t fail in ways that security teams are used to detecting, and they don’t respect boundaries you haven’t explicitly defined. That’s where the real risk lives, in assumptions that go unchecked while the system appears to work.

The biggest misconception is that you can secure GenAI by controlling the model. You can’t. You secure it by controlling the systems around it, as in the inputs, the access layers, the decisions made with its output. That’s the shift technical leaders need to make. Because once these models are part of your architecture, they’re part of your risk surface, and that risk compounds fast.

Now is the time to set that foundation, not later when the model is already in production and the failure is quiet.

To help your team build that foundation, AppSecEngineer now offers hands-on training to skill up engineers and security leaders on AI and LLM security. These are real-world labs that walk through threat modeling, prompt injection defenses, system architecture risks, and secure GenAI design patterns. It’s how you turn awareness into practice.

GenAI security is now crucial. Start treating it like a systems problem, and you’ll stay ahead of what’s coming.

Abhay Bhargav

Blog Author

Abhay builds AI-native infrastructure for security teams operating at modern scale. His work blends offensive security, applied machine learning, and cloud-native systems focused on solving the real-world gaps that legacy tools ignore. With over a decade of experience across red teaming, threat modeling, detection engineering, and ML deployment, Abhay has helped high-growth startups and engineering teams build security that actually works in production, not just on paper.

Learn more about this author ➜

Support Center