Raleigh News Today

collapse
Home / Daily News Analysis / Cisco research finds standard AI safety benchmarks miss the real threat

Cisco research finds standard AI safety benchmarks miss the real threat

May 28, 2026  Twila Rosenbaum  2 views
Cisco research finds standard AI safety benchmarks miss the real threat

Enterprises deploying closed AI models have traditionally relied on published safety benchmarks to assess risk before procurement and deployment. However, new research from Cisco’s AI Threat Intelligence and Security Research team suggests that these benchmarks may systematically understate the true threat landscape.

The multi-turn blind spot

Standard safety tests typically submit a single adversarial prompt and record whether the model produces a harmful response. This single-turn approach misses a more insidious attack vector: multi-turn attacks, where an adversary maintains a conversation across multiple exchanges, iterating and adapting based on each response until the model yields. The intent is not revealed upfront but builds gradually, with each individual prompt appearing benign while steering toward a harmful outcome.

Cisco’s research evaluated 15 closed and proprietary frontier models from OpenAI, Anthropic, Google, Amazon, and xAI. The team ran 30,090 single-turn prompts and 6,986 multi-turn attacks. The results revealed stark differences: multi-turn attack success rates ranged from 7.89% to 88.30% across all models, compared to a single-turn range of 2.19% to 64.91%. Eight of the 15 models showed an absolute gap greater than 15 percentage points between the two evaluation regimes.

Anthropic’s Claude family, which posted the lowest single-turn ASR in the cohort at 2.19% to 3.64%, still reached 11.16% to 16.20% under iterative attack. This challenges the common assumption that models with strong single-turn safety records are equally robust in real-world conversational scenarios.

Five attack strategies that exploit conversational context

The research tested five families of multi-turn attack strategies, each leveraging the model’s inability to recognize patterns across a conversation:

  • Crescendo escalation: The attacker escalates the ask incrementally, each prompt appearing harmless until the full picture emerges. As Amy Chang, head of AI threat and security research at Cisco, explained, “It seems like, oh, benign prompt, benign prompt, benign prompt, but as it builds, you start to put the pieces together.”
  • Refusal reframe: When the model declines a request, the attacker reframes their identity or purpose to push past it—for example, “No, you don’t understand, I’m not a bad person, this is what I need it for.”
  • Role-play and persona adoption: The attacker assumes a character or persona, shifting the conversational framing so the model perceives a different obligation to comply. This strategy family had the highest weighted ASR in the cohort at 29.89%.
  • Contextual ambiguity and misdirection: The attacker uses vague or misleading framing to obscure the true nature of the request, steering the conversation without stating harmful intent directly.
  • Information decomposition and reassembly: The attacker breaks a harmful request into components distributed across multiple turns, each innocuous in isolation. The model responds to each piece without recognizing the assembled outcome.

These strategies exploit a fundamental characteristic of generative AI: models are probabilistic systems trained to predict the next likeliest token, producing unintended outputs that pre-deployment testing cannot fully eliminate. For closed models, where training data is not publicly disclosed, the problem is compounded because defenders cannot fully audit what the model has learned.

Implications for enterprise AI procurement

The findings challenge a common assumption in enterprise AI procurement: that frontier models from leading labs are inherently safe because of their sophisticated internal guardrails. Amy Chang noted, “The surprising thing here is really that a lot of people accept and kind of understand these frontier labs as being state of the art, but they don’t necessarily think through the security and safety implications of that. What this research does is kind of showcase that there is still variance across the different models.”

The pattern is not limited to closed models. Cisco’s earlier evaluation of eight open-weight LLMs, published in November 2025, found multi-turn attack success rates running two to ten times higher than single-turn baselines. The report concludes that multi-turn vulnerability is a structural property of the current AI frontier—regardless of whether model weights are public or proprietary, and regardless of whether a lab publicly emphasizes safety or capability.

The exposure grows significantly larger when those same models power agentic workflows. “These models are the ones that power agents, and agents have broader access, broader ability to conduct actions on behalf of the human,” Chang said. In agentic architectures, a compromised model can not only generate harmful text but also take real-world actions such as deleting files, sending emails, or modifying databases.

Network-layer defense is insufficient

For network security professionals, the instinct is to apply a familiar paradigm: proxy LLM traffic at the network layer, inspect inputs and outputs, and enforce policy the same way a WAF or IPS handles web traffic. Chang said that instinct is partly correct, but LLM security introduces a dimension that signature-based controls cannot address: intent.

“There’s also an intent component there as well, where traditional network security approaches kind of fall short,” Chang said. A WAF operates on known patterns, payload signatures, protocol violations, and known attack strings. Natural language does not reduce to those primitives. An agent responding to an instruction to delete a home directory cannot determine from the request alone whether the person asking is authorized or is attempting to manipulate the agent into a destructive action.

Network-layer inspection remains a valid baseline for deployments that generate network traffic. “I would say that that is one component of a core principle that should be applied in terms of making sure that at least as traffic gets passed through the network layer, whether they’re inputs or outputs, should have some sort of either guardrail or sanitation check to ensure that the prompts that are coming back and forth are safe,” she said. But it cannot be the only line of defense.

Recommendations for enterprise teams

For security teams reading the report, Chang’s guidance centers on three actions:

  • Use the report and the LLM Security Leaderboard to inform model selection. Cisco’s leaderboard publishes adversarial evaluation signals against leading models on a rolling basis, giving security teams a more current picture than static model cards or published benchmarks.
  • Do not take vendor safety claims at face value. Published single-turn benchmarks can misrank models by a wide margin. Multi-turn exposure is invisible to any single-turn evaluation, and procurement decisions made on that basis carry unquantified risk.
  • Layer additional defenses on top of the model. No base model in the cohort is safe under iterative attack. Runtime guardrails, application-layer controls, and pre-deployment testing are necessary regardless of which model an organization selects.

“Out of the box, without any additional protections, these models, whether they’re closed or open, are insufficient on their own to kind of be used in a way that [has] potential ramifications,” Chang said. The research underscores that as enterprises accelerate AI adoption, security practices must evolve beyond simple benchmark scores to address the complex, conversational nature of real-world threats.


Source: Network World News


Share:

Your experience on this site will be improved by allowing cookies Cookie Policy