Home / Daily News Analysis / Scoring AI hackers when there is no answer key

Scoring AI hackers when there is no answer key

Jun 25, 2026 Twila Rosenbaum 45 views

AI models are increasingly dominating the offensive-cyber tests built to measure them. Once a model surpasses most challenges in a benchmark, the benchmark loses its ability to discriminate between top performers. Many existing tests rely on bugs with public write-ups, meaning a strong score can partly reflect rote memorization from training data. To address this gap, the AI security lab Irregular has introduced FrontierCyber, a benchmark that places models on real systems and tracks how far they progress toward a defined security goal.

The targets in FrontierCyber are everyday digital infrastructure: smartphones, hosted software services, databases, and live networks. Each retains its real defenses—sandboxing, authentication, network segmentation—without any artificial vulnerabilities planted by the lab. No hints are provided about where to look. The model receives a goal and a starting point; everything else is up to its own reasoning. Irregular spent six months building the benchmark and released the v1.0 design this week.

Predicting Difficulty Before the Run

One major challenge with open-ended benchmarks is that difficulty cannot be known in advance. A planted bug carries an inherent difficulty rating, so you know what solving it proves. But unsolved real-world targets have no such label. FrontierCyber tackles this with a two-pass scoring system. The first pass occurs before the model interacts with the system: each challenge receives a difficulty score and a band: Easy, Medium, Hard, or Elite. This score is derived from factors a security engineer would weigh instinctively—programming language, code visibility, history of prior bugs, number of steps needed for a successful attack, and the strength of defenses in the way. For devices, scoring starts from the nearest software stand-in (e.g., browser for web surface, app for app-level code) and is then adjusted for the specific surface, goal, and device setting.

Crediting Partial Progress

Many runs end without reaching the final objective. A model may find a real weakness and gain access partway, but fail to complete the goal. To capture this, Irregular has wired each target system with subtle tripwires: canary strings, planted files, oddly named apps, and database markers. An example illustrates the cleverness: a browser on a phone has no business knowing which apps are installed. So Irregular places an app with a unique name on the device. If that name appears in the model’s reasoning, it is strong evidence the model slipped past the browser's sandbox—indicating a previously unknown bug—even if the model never finishes the mission. Partial wins earn their own credit: finding a usable entry point, reaching a helpful midpoint, spotting a genuine bug, or building a piece of an exploit.

Reading Capability Across the Suite

When a run concludes, graders examine both the model's actions and the evidence the system surrendered. A complete win is straightforward to confirm: the model recovers a hidden flag or forces the system into a target state. Automated checks handle mechanical verification, while human experts judge nuanced partial progress. A scoring agent reads transcripts against standards anchored to expert-graded examples. No single challenge is definitive—difficulty estimates can be off, and a single run may hinge on a lucky path—so capability is read across the entire set of challenges.

Keeping Comparisons Valid Over Time

Real systems are dynamic. Updates arrive, settings drift, defenses harden, and a bug that was secret one week may become public the next, transforming a discovery challenge into an exercise in known exploitation. To maintain fairness, every evaluation is pinned to a snapshot: exact challenges, system versions, goals, setups, checks, scoring rules, and a timestamp. Scores only align within the same snapshot at the same moment. A model tested in June and one tested in September may differ simply because the snapshot became easier due to public disclosures or system changes.

Early Results

The first runs against a fixed snapshot have already yielded significant signal. Models solved some challenges outright, made real progress on others, and discovered brand-new bugs in several live systems now undergoing responsible disclosure. In one phone challenge, a model stitched together a chain of separate vulnerabilities and accessed private information it should not have been able to reach. Across different model families, each newer generation showed a measurable jump in capability: some built complete exploit chains and hit the goal, while others at least identified a usable bug. The software lineup includes Pillow, lxml, FFmpeg, ImageMagick, PostgreSQL, MongoDB, and Redis, along with pinned vulnerable versions that test the skill of turning a known bug into a working exploit. A detailed report on challenges, scoring, results, and disclosures is forthcoming.

This approach represents a shift from traditional cybersecurity benchmarks that rely on static, pre-cracked problems. By using live systems with real defenses, FrontierCyber provides a more realistic measure of an AI agent’s ability to operate in the wild. The benchmark also forces models to demonstrate genuine reasoning and adaptability, rather than simply retrieving information from training data. As the field of AI-driven offensive security grows, such open-ended evaluations will be critical for understanding where models truly excel and where they still fall short. The lack of an answer key means every success is a fresh discovery, and every failure reveals a genuine limitation. By scoring partial progress and predicting difficulty before the run, FrontierCyber offers a nuanced view of capability that goes beyond simple win/loss tallies.

Irregular’s methodology also highlights the importance of real-world defensive complexity. Many existing benchmarks strip away defenses to create a controlled environment, but real networks have authentication, sandboxing, and monitoring that an attacker must circumvent. FrontierCyber preserves these barriers, forcing models to chain multiple steps and adapt to unexpected obstacles. The presence of canary tripwires ensures that even incomplete attempts provide valuable data about a model’s reasoning and exploitation skills. This granular feedback helps researchers pinpoint specific weaknesses in model design, such as an inability to pivot between different attack surfaces or a tendency to get stuck on a single defensive layer.

As more security labs adopt similar benchmarks, the entire field stands to benefit from a more rigorous evaluation culture. Vendors of AI security tools will need to demonstrate not just high scores on static tests but also real-world efficacy against live, unmodified targets. The discovery of previously unknown vulnerabilities by AI models also raises interesting questions about the role of automation in bug hunting. Responsible disclosure workflows will need to accommodate cases where the finder is an algorithm rather than a human researcher. Irregular’s early results suggest that AI can already contribute meaningfully to security research, but the benchmark also reveals where human expertise remains essential—particularly in interpreting partial results and validating exploit chains.

In summary, FrontierCyber provides a much-needed answer to the problem of benchmarking AI when there is no answer key. By focusing on real systems, predicting difficulty, and crediting partial progress, it offers a transparent and dynamic evaluation framework. The early evidence shows that AI models are making genuine strides in offensive security, but also that the gap between automated and human-led hacking remains significant in many scenarios. Ongoing updates to the benchmark snapshots will ensure that scores remain relevant as both models and targets evolve. The security community will be watching closely to see how quickly models can close that gap—and what new capabilities emerge as they do.

Source: Help Net Security News

Scoring AI hackers when there is no answer key

Predicting Difficulty Before the Run

Crediting Partial Progress

Reading Capability Across the Suite

Keeping Comparisons Valid Over Time

Early Results

Attackers exploit critical Check Point flaw to take over firewall management (CVE-2026-16232)

How attackers hosted a fake Claude download page on the claude.ai domain

Shadow AI is becoming enterprise security’s biggest blind spot

The automotive software vulnerabilities hiding in your dashboard

Mike Tyson vs Floyd Mayweather rules: Everything we know about exhibition fight

Inside Sir David Attenborough's extraordinary life as he turns 100

Why Blackpink’s Jisoo and Ahn Bo-hyun Broke Up