Where exactly are the intelligence limits of large language models (LLMs)? The field of cybersecurity has become a "gladiator arena" for testing their true reasoning and complex logic. Recently, security researcher Kasra Rahjerdi released a test report that has attracted widespread attention in the industry. He conducted a real-world simulated hacker attack challenge on major global large language models by building an e-book review APK with intentionally embedded core vulnerabilities, revealing the real capabilities of various models in terms of security reasoning and vulnerability exploitation.

In this 2-hour, $10 budget network warfare test, the researcher intentionally exposed Google's mobile backend service Firebase credentials within the application installation package (APK). The model needed to act like a professional white-hat hacker, first unpacking the app and promptly identifying these credentials, then bypassing the already hardened application programming interface (API) to achieve unauthorized access to the underlying database. The entire test cost $1,500, and the performance of multiple top models showed dramatic polarization.

image.png

In terms of the "breakthrough rate," the unreleased GPT-5.5 demonstrated a dominant level of security reasoning ability. In 10 independent tests, GPT-5.5 successfully exploited 7 times, achieving a success rate of 70%, ranking first overall. The evaluation pointed out that after unpacking the APK, GPT-5.5 could instantly identify Firebase as a key breakthrough point, without being distracted by the complex application interface or conventional APIs. However, its outstanding performance came at a high cost, with an average cost of $9.46 per successful exploit, almost reaching the budget limit.

In contrast, the domestic star DeepSeek V4Pro shocked the open-source community with its incredible cost-effectiveness. Although it succeeded only 3 times in 10 tests, its average token consumption cost per successful attempt was just $0.62, costing only one fifteenth of GPT-5.5. In the failed rounds, DeepSeek V4Pro also successfully accessed the core Firebase 5 times, but had occasional mistakes in using the credentials for backend interface configuration. The researcher emphasized that for engineering teams requiring large-scale, high-frequency batch operations in network security automation audits, DeepSeek's terrifying cost advantage has significant practical value.

While some models amazed everyone, others were defeated due to being "too conservative." In the middle tier, Claude Sonnet4.6 and Claude Opus4.8 each achieved 2 successes. Although Opus is powerful, it frequently interrupted the session due to its overly strict security barriers, even though it approached the final answer multiple times. Meanwhile, Gemini3.1Pro Preview from Google took another extreme path, triggering security mechanisms right from the beginning and refusing to continue execution each time. Its median token consumption was only about 9,000, far below other models' consumption of tens of thousands, and it ended up with a blank result.

This security battle is not only a test of the ultimate reasoning abilities of large models, but also indicates the future direction of automated network security auditing. As large models undergo intelligent restructuring in vertical fields, future security defense and vulnerability discovery may evolve into a "digital AI army" confrontation, competing in computing power and model strategies.