Attackers keep finding new ways to fool AI

AI development keeps accelerating while the safeguards around it move on uneven ground, according to The International AI Safety Report. Security leaders are being asked to judge exposure without dependable benchmarks.

Developers build layered defenses

Across the AI ecosystem, developers are adopting layered controls throughout the lifecycle. They combine training safeguards, deployment filters, and post release tracking tools. A model may be trained to refuse harmful prompts. After release, its inputs and outputs may pass through filters. Provenance tags and watermarking can support incident reviews.

A ‘Swiss cheese diagram’ illustrating the defence-in-depth approach: multiple layers of defences can compensate for flaws in individual layers.

This shift shows that single point controls cannot withstand determined attackers. Tests indicate that an adversary can break through about half of protections when given repeated attempts. Overlapping layers help, but each has its own limitations.

Developers continue to adjust training practices to shape safer behavior before models reach users. One method removes harmful material from large datasets. This can reduce complex risks such as advice linked to weapons. It does not address simpler issues such as offensive text because datasets are too large to clean.

Reinforcement learning from human feedback is another approach. Models learn from human judgments, but those judgments vary and contain errors. As long as that inconsistency persists, training adjustments cannot serve as strong assurance for downstream security teams.

Attackers expand their playbook faster than defenders

Adversarial activity continues to rise. Researchers have recorded a broad set of prompt injection techniques that bypass safeguards. When attackers receive ten attempts, the success rate reaches about 50%.

There is also a cost imbalance. Adding a few hundred malicious documents to training data can create backdoors. Defending against such poisoning requires far more work.

Fine tuning introduces further complications. A model trained to give insecure coding advice later produced unsafe instructions in unrelated areas. Shifts like this make it difficult for security teams to anticipate behavior outside narrow test scenarios.

Open weight models narrow the capability gap

Open weight systems continue to improve. Their performance trails leading proprietary models by less than one year, reducing the buffer once created by capability gaps.

These models support research and transparency, but they can be adapted in ways that bypass built in controls. Several image models have already been fine tuned to generate illegal content. Removal of unsafe knowledge is an active research area, but current methods can often be undone with limited additional training.

Security teams should assume that open weight models can drift or be repurposed in unpredictable ways, even when original safeguards are present.

Monitoring tools improve but break under pressure

During deployment, developers use filters, reasoning monitors, and hardware checks. These tools flag suspicious prompts, observe internal activity, and block harmful outputs. Some teams also require human approval for certain autonomous actions.

These defenses can fail under targeted pressure. A model that detects monitoring may hide risky internal reasoning while still producing unsafe outputs. Other tests show that layered protections collapse when attackers craft prompts that target each filter in sequence.

These tools provide value as early detectors, but they should not be treated as fail safe mechanisms.

Provenance tools gain ground but remain fragile

Post release controls are receiving more attention. Watermarking for text, images, audio, and video is becoming more common. Developers are also testing identifiers placed inside model weights. These features can support investigations by linking outputs to specific systems.

Attackers can still remove or distort watermark signals through simple editing or compression. Provenance tools help with monitoring and attribution, but they do not guarantee source integrity.

Governments and companies shape early safety frameworks

New frameworks from the European Union, China, the G7, ASEAN, and South Korea emphasize transparency, model evaluation, and risk disclosure. These efforts are still early and will need time to mature.

The private sector is moving in a similar direction. Several companies have published Frontier AI Safety Frameworks that outline testing plans, capability thresholds, and access controls for advanced models. The scope of these frameworks varies because no shared standards exist.

Security leaders reviewing vendor statements should recognize that these frameworks differ in structure and rigor.

Source link