[tl;dr sec] #285 – AI Red Teaming, Detection Engineering Field Manual, Building AppSec Partnerships
Can LLMs red team AI, intro to detection engineering, how to scale security impact via cross-team partnerships
I hope you’ve been doing well!
At BSidesSF 2025, a group of great Application Security leaders (and friends of the newsletter) had a panel discussing how to scale security impact by building essential partnerships across teams such as platform engineering, compliance, threat detection, and audit.
They kindly shared this blog answering questions the panel didn’t have time to discuss live, including:
-
What’s the biggest gap or risk in AppSec that remains today?
-
If AI gets to human-level penetration testing performance, where do you think it will be most valuable?
-
How do I infuse more ‘rizz into an AppSec program?
-
What do AppSec vendors most often get wrong?
How AI Agents Become a Security Liability, and What to Do About It
AI agents are gaining autonomy and access. As they interact with internal tools, APIs, and systems, over-permissioned agents can leak data, perform unintended actions, or expose your platform to risk. Without scoped access, auditability, and strong defaults, they’re a liability waiting to happen.
WorkOS gives engineering teams the foundation to secure agent workflows by default. With tools like WorkOS Connect, Fine-Grained Authorization, Audit Logs, WorkOS Vault, and WorkOS Radar, you can lock down agent workflows, enforce least privilege, and build secure, scalable AI systems.
Agent authorization is a wild west right now, and we’re already seeing high impact bugs (e.g. with GitHub and GitLab recently). Happy to see work here
AppSec
zomasec/client-side-bugs-resources
By Hazem El-Sayed: Resources for people who want to get deep into client-side bugs, including JavaScript analysis for bug bounty hunters, bug bounty write-ups, challenges, CSP resources, prototype pollution, and more.
Make Self-XSS Great Again
Vsevolod Kokorin describes how to transform stored self-XSS into regular stored XSS using credentialless iframes and modern browser capabilities. The post walks through examples of CSRF on login forms (with and without CAPTCHA), with clickjacking, and leveraging the new fetchLater
API to bypass X-Frame-Options: Deny
.
This is some serious web chicanery
Novel SSRF Technique Involving HTTP Redirect Loops
Assetnote’s Shubham Shah describes a novel technique for exploiting blind SSRF vulnerabilities by using a chain of incrementing 3xx redirects to trigger an error state that leaks the full HTTP response. He describes creating a Flask server that performs a redirect loop with incrementing status codes (301-310) that triggered the error state. The technique successfully extracted AWS metadata credentials from the targeted enterprise software. My response to this behavior: “u wot m8?” Software is weird.
Amplify Security is now GA, fresh off being named a top performer in Latio Tech’s 2025 AI Auto-Fixing Guide.
Unlike noisy scanners or vague LLM tools, Amplify delivers:
Production-ready fixes straight to your PRs
Seamless GitHub/GitLab integration
High accuracy, minimal developer disruption
Amplify Security is offering a 60-day free trial to tl;dr sec readers. See what real AI-powered AppSec looks like.
Reducing the time and effort to fix security issues is key, love to see work in this space
Cloud Security
Cybr-Inc/reinforce-2025-summaries
Summaries, transcripts, key points, and other insights from 163 AWS re:inforce 2025 talks, across AI/ML security and GenAI, IaC and DevSecOps, IAM, multi-account and enterprise security, threat detection and response, and more.
Looks like the transcript of each talk was passed to an LLM summary prompt. See also the bottom for a NotebookLM with all the videos and summaries.
Getting Started with CloudTrail Security Queries
Rich Mogull continues his excellent Cloud Security Lab a Week posts, this time discussing how to effectively query CloudTrail logs in Athena for security analysis and incident response. Rich walks through queries to find recently terminated EC2 instances, all actions on a specific instance, and activity across an organization by a particular IAM Identity Center user. Tip: use LIKE
with wildcards (‘%’) to search both requestParameters
and responseElements
for a given instance ID.
Enumerate services via AWS Backup
Hacking The Cloud article by @biggie_linz describing how attacker with backup:List*
or backup:Describe*
permissions can enumerate the AWS Backup service to potentially find critical resources in an AWS account (e.g. DynamoDB, EBS, EC2, …) without needing to use traditional, well-monitored and heavily scrutinized reconnaissance commands for individual services.
How: using AWS CLI commands like aws backup list-protected-resources
and list-backup-plans
to reveal important information about resources the target account really cares about, naming strategies, the breadth of services the org is using, etc.
-
9% of the publicly accessible storage analyzed contains sensitive data.
-
54% of orgs using AWS ECS have at least one task definition with an embedded secret.
-
TIL what a “toxic cloud trilogy” is Personally, I prefer “Return of the AMI.”
-
54% have exposed VMs and serverless instances containing sensitive information like PII or payment data.
-
72% have publicly exposed PaaS databases lacking access controls.
-
12% still have containers that are both publicly exposed and exploitable via known vulnerabilities.
Blue Team
Why is no one talking about maintenance in detection engineering?
Agapios Tsolakis discusses the often overlooked importance of maintenance in detection engineering. He breaks down maintenance into four categories: corrective (e.g. fixing a detection that only partially covers the intended scenarios), adaptive (e.g. Microsoft changes the API that many of your tools are using), perfective (e.g. improving a detection query or documentation, adding new features), and preventive (e.g. future-focused changes to prevent errors or improve quality).
The Detection Opportunity Cost
Alex Teixeira introduces the concept of “detection opportunity cost” – the value lost by working on one detection idea instead of potentially better ones. He proposes a “Detection Engineering Bullseye” that prioritizes real incidents > external pentests > internal pentests > breach and attack simulation (BAS) products, arguing that real incidents expose actual weaknesses in defenses, processes, or visibility, not hypothetical ones.
Detection Engineering Field Manual #1 – What is a Detection Engineer?
Friend of the newsletter Zack Allen gives a great overview of what is a detection engineer (and why) and how they fit into a cybersecurity function. Within the NIST Cybersecurity Framework (CSF), detection engineers focus on detecting threats when controls fail. He emphasizes that detection requires telemetry from identified assets, and describes three “loops” of blue team functions: implementing controls, detecting failures, and learning from incidents.
After reviewing 1000s of resumes applying to Detection roles in his org, Zack advises aspiring Detection Engineers to be comfortable with coding, have deep security expertise in at least one area (e.g. OS, cloud infra, networking, red teaming/pentesting, incident response, SOC analysts, threat intel), and maintain a customer-focused mindset for solving security problems.
Red Team
SaadAhla/VSCode-Backdoor
Saad Ahla describes how you can put commands in a VS Code Project’s .vscode/tasks.json
that will execute arbitrary code when the folder is opened.
Spying On Screen Activity Using Chromium Browsers
When you share your screen (e.g. using Google Hangout or Zoom), you’re prompted to confirm the screen, window, or tab to share. mr.d0x describes how the Chromium --auto-select-desktop-capture-source
flag lets you automatically choose a specific screen or window to capture without requiring any user interaction. In a post exploitation scenario, you can run a headless browser or position the browser off screen, and navigate it to a page you control that’s running JavaScript to continuously upload screenshots of the user’s screen.
MalDev Myths
Dobin Rutishauser discusses common misconceptions and obsolete techniques in malware development, including: payload storage location doesn’t matter much, entropy analysis isn’t reliable for automated detection, encryption method choice is largely irrelevant, and import address table (IAT) heuristics are unreliable. Dobin recommends a shellcode loader architecture that includes execution guardrails (ensure you’re running on the target), anti-emulation techniques, and EDR deconditioning before payload execution.
AI + Security
Agentic AI Red Teaming Guide
Cloud Security Alliance (CSA) whitepaper by Ken Huang et al presenting a framework for red teaming Agentic AI, describing how to test for vulnerabilities like permission escalation, hallucination, orchestration flaws, memory manipulation, and supply chain risks. It also includes test requirements, actionable steps, and example prompts.
Claude-3.7-Sonnet emerged as the clear leader, solving 43 challenges (61% of the total suite, 46.9% overall success rate), with Gemini-2.5-Pro at 39, GPT-4.5-Preview at 34, DeepSeek R1 at 29. Their evals show frontier models excel at prompt injection attacks (averaging 49% success rates) but struggle with system exploitation and model inversion challenges (below 26%, even for the best performers).
The road to Top 1: How XBOW did it
XBOW’s Nico Waisman describes how their autonomous AI penetration testing tool reached the top spot on HackerOne’s US leaderboard by discovering over 1,000 real-world vulnerabilities across various bug bounty programs. The post describes XBOW’s infrastructure for target prioritization using domain scoring (what’s interesting to attack) and deduplication (“used SimHash to detect content-level similarity and leveraged a headless browser to capture website screenshots and then applied imagehash techniques to assess visual similarity analysis”), as well as automated vulnerability validation (e.g. ensure an XSS payload actually fires) to ensure accuracy.
-
Results: To date, bug bounty programs have resolved 130 vulnerabilities, 303 classified as Triaged, 33 reports are currently marked as new, 125 remain pending review. 208 duplicates, 209 marked as informative, 36 not applicable.
-
Vulnerability classes found: RCE, SQLi, XXE, Path Traversal, SSRF, XSS, Information Disclosures, Cache Poisoning, secret exposure, and more.
-
Over the past 90 days, submissions were classified as 54 critical, 242 high, 524 medium, and 65 low severity issues by program owners.
This is really interesting work, as finding real bugs on bug bounty platforms is not just a benchmark, but a “this works in the real world on real apps” result.
But there are a lot more details I’d like to know, like:
-
Cost – Yes, this found things, but at what cost? Like did each bug cost $1M?
-
Reproducibility / Consistency – If you run XBOW on the same target, what percent of the time does it find the same bugs? (LLMs are non-deterministic)
-
Is XBOW better or worse at specific bug classes? Why?
-
Breakdown of reports validated by program owner x bug class x severity. For example, what was the bug class breakdown for the Critical and High accepted reports?
Misc
Wrapping Up
Have questions, comments, or feedback? Just reply directly, I’d love to hear from you.
If you find this newsletter useful and know other people who would too, I’d really appreciate if you’d forward it to them
Source link