solutions
Dec 4, 2024

LLMs are People in the Context of Data Security

LLMs are People in the Context of Data Security

Posted by

Matthew Rogers, Product Marketing Manager

In light of recent developments, it has become increasingly clear that the basic guardrails, sanity checks, and prompt-based security measures currently in place are porous and inadequate. As we brainstorm strategies to improve data security in AI workloads, it’s crucial to shift our perspective and begin thinking of AI as akin to a person, susceptible to social engineering attacks. This analogy can help us better understand the vulnerabilities and threats that AI systems face and allow us to craft more robust security measures.

images

Data in Modern Generative AI Workloads

Modern generative AI workflows fundamentally transform how data is stored and accessed. Unlike traditional data storage where metadata and access controls remain tightly coupled with the data, the AI pipeline fragments and redistributes data across multiple components: the language model itself, prompts, embedding databases, vector stores, and generated outputs. This architectural shift means conventional security controls like Active Directory groups and Access Control Lists (ACLs) become ineffective, as these controls don’t propagate through the AI transformation process. A new security paradigm is needed to protect data across this distributed AI ecosystem.

images

Source: Retrieval-Augmented Generation for AI-Generated Content: A Survey

The Generative AI Security Challenge

Chatbots powered by Large Language Models (LLMs) are designed to be helpful, making them susceptible to tricks and lies that can lead to the disclosure of sensitive information or the bypassing of security controls.

Social engineering AI has emerged as a novel and concerning attack vector. Unlike traditional threats, AI systems can be exploited through carefully crafted prompts to access or reveal protected data.

The security challenges posed by Generative AI are multifaceted and complex. At their core, these systems are not inherently secure - the models themselves may inadvertently retain and expose sensitive training data. This risk is compounded by the fact that traditional security measures and access controls that organizations rely on don’t cleanly map to AI interactions.

Particularly concerning is the emergence of prompt injection attacks, where carefully crafted inputs can manipulate AI systems into revealing protected information. Our existing security tools and frameworks weren’t built with AI-specific vulnerabilities in mind, leaving dangerous gaps in our defenses. As AI adoption accelerates, organizations urgently need new approaches and frameworks for evaluating and managing these unique security risks.

While security professionals have attempted to implement prompt-based security measures, these solutions have proven inadequate. Common approaches include: 

  • Adding security instructions to system prompts 

  • Implementing keyword filters and output scanning 

  • Using prompt templates and validation 

  • Monitoring for suspicious interaction patterns 

  • Rate limiting and access controls 

However, these measures can often be circumvented through creative prompt engineering, context manipulation, or by exploiting the AI’s tendency to be helpful. More robust, systematic approaches to AI security are needed that treat AI systems with the same security rigor applied to human users.

Attacks on Generative AI Workloads like ChatGPT 

Here are some example of how people are jailbreaking the prompt-based security measures on ChatGPT. We’re not picking on ChatGPT, as this is a problem that impacts all Generative AI workloads. ChatGPT is just the most popular one (and the biggest target).

images

Source: https://x.com/mixedenn/status/1845939748235628564 

images

Source: https://x.com/mixedenn/status/1847035711985274947

images

Source: https://x.com/mixedenn/status/1853540570246885746

Solving for People and LLMs 

When it comes to securing data in Generative AI workloads, we must return to a fundamental truth: the only guaranteed way to protect data from AI systems is the same approach we use to secure data from people.  

Just as we carefully control human access to sensitive information through robust authentication and authorization mechanisms, we must apply equivalent protections to AI systems that interact with our data.  You can’t reveal what you don’t know and don’t have access to. 

Ensure that individuals have the appropriate access to data in line with Zero Trust principles. Implement security controls at the LLM, embedding, vector store, and database levels. Additionally, log and audit all data access. 

Example Scenarios 

While the following examples are contrived, they highlight the need for security controls at each level of the AI pipeline. Even if you use a general purpose LLM, the data that it is interacting with may be sensitive and require the same protections. 

A Security Engineer may have access to an LLM tuned with Security data, and the generative AI RAG pipeline will have access to further data. The security controls must be applied at each level. 

A Marketing Manager may have access to an LLM tuned with Marketing data, and the generative AI RAG pipeline will have access to further data. The security controls must be applied at each level. 

These individuals might have simultaneous access to supplementary information like HR policies, procedures, and other corporate data. Addressing this issue is not straightforward. 

Solving, the VAST Data Way 

VAST Data is the only solution that can provide a comprehensive security framework for Generative AI workloads at ExaScale solving the problem of data sprawl, and the inherent security risks of data being stored in multiple locations. 

The VAST InsightEngine (available 2025) transforms the VAST Data Platform into a centralized authority for managing complex, multi-layered file authentication. By combining secure data pipelines with robust auditing capabilities, it ensures real-time visibility into access and operational events. This dual focus on security and accountability makes the InsightEngine a cornerstone for AI-driven operations, meeting the demands of modern AI ecosystems with unmatched precision, scalability, and compliance.

images

💣 Prompt based attacks like above will continue to exist, but their value is lost if the LLM or the Person has no “unauthorized data” to divulge.   

VAST Data turns securing your AI science project into a simple engineering exercise.  

Enterprises and Security experts are building with VAST Data, and you should too.

See the VAST InsightEngine:

More from this topic

Learn what VAST can do for you
Sign up for our newsletter and learn more about VAST or request a demo and see for yourself.

By proceeding you agree to the VAST Data Privacy Policy, and you consent to receive marketing communications. *Required field.