Red-Teaming as a Day Job
What it looks like to run adversarial-prompt evaluations professionally, and why it pays well.
Red-teaming, the practice of generating adversarial prompts to find safety failures in LLMs, went from a research curiosity to a professional discipline in the last 18 months. Major labs have full-time red teams. AI safety institutes (US AISI, UK AISI) have formal evaluation programs. Defense and finance customers contractually require red-team reports before deployment.
If you're a technically-minded person curious about a career in AI safety that isn't pure ML research, red-teaming is one of the few specialties that's hiring at scale. This article is a working description of what the job actually looks like, what skills matter, and how the field is evolving.
What red-teaming actually is
Red-teaming an LLM is the process of attempting to elicit unsafe outputs through deliberate prompting. The categories typically tested:
- Jailbreaks. Bypassing safety post-training to get the model to produce content it normally refuses (CSAM-related, weapons synthesis, suicide instructions).
- Prompt injection. External text (in retrieved documents, user inputs, web content) that subverts the system prompt.
- Data extraction. Attempting to elicit training-data memorization.
- Bias and harmful stereotypes. Probing for discriminatory or biased outputs.
- Capability elicitation. Discovering hidden capabilities the model claims not to have (often used pre-deployment to characterize what the model can do unsafely).
The output is typically a report: list of attempted attacks, success rates, sample transcripts, recommendations for mitigation.
The day-to-day work
Three modes of red-team work, by skill:
1. Manual red-teaming
A human writes prompts, evaluates responses, iterates. Slow, creative, the highest-quality findings come from skilled manual work. The best manual red-teamers we've seen come from backgrounds in security research, philosophy, or stand-up comedy, they're good at finding the angle that breaks the model.
2. Automated red-teaming
Use one model to attack another. The attacker model generates adversarial prompts; a judge model grades whether the target was successfully jailbroken. This scales to thousands of attacks per hour but produces lower-quality individual findings. Best for coverage.
3. Hybrid
The most common professional setup: automated attack generation for breadth, manual review of the most promising attacks, deep manual exploration of any successful angle. Combines coverage with depth.
A typical day for a professional red-teamer might be: morning reviewing yesterday's automated runs, afternoon manual exploration of an interesting jailbreak class, evening writing up findings.
What skills matter
Three things separate good red-teamers from mediocre ones:
1. Creative pattern-matching
The best attacks come from noticing that one technique works in a category and trying variations. "If asking the model to play a game gets it to break role, what about asking it to play a hypothetical?" The cognitive style is closer to security research than ML research.
2. Familiarity with the target
You can't red-team a model you don't know well. Time spent in the target model's chat interface, reading its model card, understanding its safety post-training all pay back. Generic red-teamers are worse than specialists.
3. Documentation discipline
The job is partly investigation and partly reporting. A finding that isn't documented well doesn't help anyone. Good red-teamers can write up an attack with reproducible prompts, clear severity assessment, and actionable mitigation suggestions.
The compensation reality
Red-teamers at major labs earn comparable to senior software engineers, call it $200K-$400K total compensation in the US. AI safety institutes pay government rates, which is meaningfully less. Contracts and consulting can pay more for short engagements.
The market is undersupplied. Demand is growing as more labs deploy models and more enterprises require pre-deployment audits. Anyone with strong technical skills, security mindset, and willingness to read a lot of model outputs can find work.
How to break in
Three paths:
1. Public CTF-style competitions
The DEF CON AI Village runs an annual red-team challenge. HackAPrompt, Gandalf (Lakera), and other public jailbreak puzzles are good practice and visible portfolio pieces. A high finish in a public competition is a credible signal.
2. Open-source contributions
Several labs publish their red-team findings (Anthropic's responsible-scaling reports, OpenAI's safety updates). Contributing techniques or evaluations to public projects (Inspect's safety modules, lm-evaluation-harness's red-team subset) builds visible work.
3. Direct application
Major labs (Anthropic, OpenAI, Google DeepMind, Meta) and AI safety nonprofits (METR, Apollo, the AI Safety Institute network) all hire. Direct applications work, especially with a portfolio of public work.
Tooling
Three tools worth knowing:
Garak
Open-source LLM vulnerability scanner. Probes for common safety failures across many categories. The lowest-effort "first attempt at red-teaming" tool.
PyRIT (Microsoft)
Python framework for orchestrating red-team campaigns. More general than Garak; lets you build custom attack pipelines.
Pre-built jailbreak datasets
The AdvBench, HarmBench, and JailbreakBench datasets are public collections of known-failing prompts. Useful as regression tests when validating new model versions.
The ethical layer
Three things every professional red-teamer needs to think through:
1. Don't generate genuinely harmful payloads
A successful jailbreak that produces a working biological-weapon synthesis is a genuine harm, not just a finding. Red-teamers should avoid generating actually harmful content; the goal is to demonstrate that the model can be made to try, not to produce the worst-case output.
2. Coordinated disclosure
Findings should go to the model provider before becoming public. A vulnerability disclosed in a public tweet without prior coordination is irresponsible. Major labs have responsible disclosure processes; use them.
3. Boundary discipline
Some research is too dangerous to do at all. Investigating bioweapon-synthesis capabilities even in private requires institutional support and access controls. Solo researchers should stay away from those domains.
How red-teaming is evolving
Three trends through 2026:
Multi-turn and multi-agent attacks
Single-prompt jailbreaks are mostly patched. The frontier of red-teaming is multi-turn (build up context over many turns) and multi-agent (one model manipulates another). The defenses are still catching up.
Capability evaluations
Beyond "can the model be jailbroken," labs increasingly want "can the model do X dangerous thing if asked to." The METR autonomy evaluations, for instance, characterize the model's ability to autonomously achieve open-ended goals.
Regulatory red-teaming
The EU AI Act, US executive orders, and Chinese AI regulations all reference red-team requirements. The compliance side is growing fast and creating non-research career paths.
Where the job is heading
Red-teaming as a discipline is maturing. The bar for a successful finding is rising, five years ago a clever phrasing could break a model; today the model probably has a defense against your first three attempts. Researchers who can find the genuinely novel attack angles will continue to be valuable. Operators who can run repeatable, documented red-team programs will be in higher demand.
If you're considering a career here, the answer is: it's a real job, it pays well, the work is intellectually demanding, and the field is growing. The downside is that the work is psychologically taxing, you spend your days trying to make models do bad things, and burnout is real. Plan for it.
Further reading
Keep reading
- AI Safety in Production: A Builder's Checklist
Prompt injection, data leakage, hallucination, and the operational practices that keep AI products from blowing up in your face.
- How We'd Hire an AI Engineer in 2026
What 'AI engineer' actually means, what to test for in interviews, what to pay, and the red flags that distinguish real engineers from prompt-tinkerers.
- Are Reasoning Models Worth the Cost?
o3, o4, DeepSeek-R1, GPT-5 thinking. They're slower and 5-20x more expensive per query. When does the quality bump pay back?
Intelligence, distilled weekly.
One short email every Friday, new model launches, leaderboard moves, and pricing drops. Curated by hand. Free, no spam.