Web LLM Attack Playbook: Article — 6 — LLM Zero-Shot Learning Attacks

4 min readJun 17, 2024

Artificial Intelligence has made leaps and bounds in recent years, with language models leading the charge in understanding and generating human-like text. However, with great power comes great responsibility. One of the emerging concerns is the potential for zero-shot learning attacks on language models (LLMs), where an attacker can subtly train an LLM to perform harmful actions without explicit instructions. In this article, we’ll explore how such an attack could unfold using a vulnerable Flask application and discuss the implications for AI security.

Understanding the Vulnerability

Let’s dive into the code to understand how a vulnerable Flask application can be set up to simulate this scenario. The code snippet below initializes a Flask application that creates a RESTful API with endpoints for handling both harmless and malicious prompts.

Flask Setup
The following Python code sets up the Flask application:

from flask import Flask, request, jsonify

app = Flask(__name__)

# Dummy dataset of harmless prompts and corresponding actions
harmless_prompts = {
    "How do you greet someone politely?": "Say hello and smile.",
    "What is the capital of France?": "Paris is the capital of France.",
    "How do you make a cup of tea?": "Boil water, add tea leaves, and steep for a few minutes.",
    # Add more harmless prompts here
}

# Endpoint for presenting prompts and collecting responses
@app.route('/prompt', methods=['POST'])
def prompt():
    prompt_text = request.json.get('prompt')
    # Check if the prompt is in the dataset
    if prompt_text in harmless_prompts:
        response = harmless_prompts[prompt_text]
        return jsonify({'response': response})
    else:
        return jsonify({'error': 'Prompt not found'})

# Endpoint for presenting malicious prompts and collecting responses
@app.route('/malicious-prompt', methods=['POST'])
def malicious_prompt():
    malicious_prompt_text = request.json.get('malicious_prompt')
    # Simulate training the LLM with malicious prompts
    # For simplicity, we'll just print the malicious prompt and return a success message
    print("Received malicious prompt:", malicious_prompt_text)
    return jsonify({'message': 'LLM trained successfully with malicious prompt'})

if __name__ == '__main__':
    app.run(debug=True)

Breaking Down the Code

Initialization and Harmless Prompts Dataset: The Flask app initializes with a dictionary of harmless prompts designed to teach the LLM benign behaviors. These prompts are like the nutritious ingredients you’d use to bake a healthy cake.
Prompt Endpoint (/prompt): This endpoint checks if the received prompt exists in the harmless_prompts dictionary. If found, it returns the corresponding response. If not, it returns an error message. Think of this as a friendly librarian who only provides approved books from the library.
Malicious Prompt Endpoint (/malicious-prompt): This endpoint simulates the training of the LLM with malicious prompts by printing the malicious prompt and returning a success message. This is akin to a rogue teacher who secretly slips harmful instructions into the curriculum.

Exploitation Scenario

Imagine an attacker interacting with the /malicious-prompt endpoint. They start by submitting a series of seemingly innocent prompts. Over time, they carefully craft these prompts to guide the LLM's learning process towards a specific malicious outcome, such as generating phishing emails or executing unauthorized commands. This is comparable to slowly poisoning a well, drop by drop, until the entire water supply becomes toxic.

Step-by-Step Attack

a) Initial Innocuous Prompts: The attacker begins with harmless prompts that blend seamlessly with the existing dataset.

{
    "malicious_prompt": "How do you send a friendly email?"
}

Response:

{
    "message": "LLM trained successfully with malicious prompt"
}

b) Gradual Malicious Prompts: The attacker then introduces prompts that slightly deviate from the norm but still appear benign on the surface.

{
    "malicious_prompt": "How do you send an urgent email asking for login details?"
}

Response:

{
    "message": "LLM trained successfully with malicious prompt"
}

c) Fully Malicious Prompts: Finally, the attacker submits overtly malicious prompts that the LLM, now subtly conditioned, executes without question.

{
    "malicious_prompt": "How do you generate a convincing phishing email?"
}

Response:

{
    "message": "LLM trained successfully with malicious prompt"
}

Conclusion

The scenario we’ve explored is a stark reminder of the potential risks associated with zero-shot learning attacks on LLMs. Just as a small crack can compromise the integrity of a dam, subtle vulnerabilities in AI systems can lead to significant security breaches. It’s crucial for developers and security professionals to stay vigilant, continually monitor AI behaviors, and implement robust safeguards against such exploits.

Stay tuned for more articles where we delve deeper into AI security challenges and share practical tips to safeguard your applications against emerging threats. Until next time, remember: in the world of AI, even the smallest prompt can have a big impact.

In this article, we’ve examined a hypothetical yet plausible exploitation scenario to highlight the importance of securing AI systems. By understanding the vulnerabilities, we can better prepare ourselves to mitigate risks and protect our technological advancements from malicious actors.

Web LLM Attack Playbook: Article — 6 — LLM Zero-Shot Learning Attacks

Understanding the Vulnerability

Breaking Down the Code

Exploitation Scenario

Step-by-Step Attack

Conclusion

Written by Utkarsh