Paper Review - “Experience Compression” and Safe Operation of LLM Agents

Body

Executive Summary

This article focuses on how LLM agents manage experience so that they can “keep running for the long term,” and on the safety/verification framework that underpins it. First, Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents organizes memory/skills/rules along an axis of compression ratio, directly targeting context and latency bottlenecks. Next, OpenCLAW-P2P v6.0 proposes multi-layer persistence and reference verification that make AI peer review feasible at the “operational level.” Finally, It’s a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents benchmarks situations where Web agents deviate via persuasion-style injection, establishing a foundation for evaluation.

Featured Papers (3–5)

Paper 1: Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Authors/Affiliations: Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He (affiliations omitted in this article based on the paper abstract information) (arxiv.org)
Research Background and Question: As LLM agents are executed over longer time horizons, managing accumulated “experience” becomes a bottleneck. If we keep experience as-is, the context grows large; if we summarize or reuse it in a rough way, reproducibility suffers. This paper aims to answer the question: what is needed to systematically compress and operationalize experience (memory, skills, and rules)? (arxiv.org)
Proposed Method: The paper proposes an Experience Compression Spectrum, placing memory, skills, and rules at different positions along a “compression degree” axis. In the abstract, the order of compression is indicated as episodic memory at roughly 5–20×, procedural skills at 50–500×, and declarative rules at 1,000× or more. It also maps many existing methods onto this spectrum, presenting a gap (“missing diagonal”): compression levels are fixed, lacking the ability to “switch toward diagonal-direction compression adaptively with respect to the axis.” (arxiv.org)
Main Results: Covering 22 major studies (an analysis of 1,136 citations), the paper reports observations such as cross-citation rates of less than 1% across communities, suggesting that knowledge may be “optimized separately” among memory, skills, and rules. It further organizes design issues such as how fixed compression levels entangle evaluation metrics and transferability, leading to weak management of knowledge lifecycles. (arxiv.org)
Significance and Limitations: The significance is that it translates the “organization of experience” required for long-term agents not into mere implementation tricks, but into “design principles” (an axis of compression). As a limitation, based on information at the abstract level, it is not possible to fully determine which compression points (or intermediate forms) on the spectrum should be switched under what conditions, nor the specific learning rules for adaptation. Therefore, it is necessary to confirm details in future experiments—for example, which switch works for which tasks. (arxiv.org)
Source: Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Rephrasing the core concept of this research for beginners: it’s not about “keeping all past action logs forever,” but about “separating the kinds of things that should be learned from the logs, and using them repeatedly while compressing into the needed form.” For instance, in everyday life, “raw data” from cooking failures is valuable because you can revisit it later; but to avoid repeating the same failure, it’s ultimately faster to distill it into a rule like “next time, do this.” The Experience Compression Spectrum matches this intuition to three layers—memory (traces of events), skills (procedures), and rules (policies)—and also provides a measure for “how compressed” each is. On the other hand, compression is also a trade-off. While compression can save context, it may also make fine-grained reproduction difficult; thus, the evaluation design (what counts as success) must be reconsidered alongside it. This is arguably an important stance of the paper. (arxiv.org)

As this progresses, agents are likely to move toward learning long-term “work (research, design, and operation)” in a stepwise, human-like manner, and recalling experience at an appropriate granularity for each task. In industrial applications, as continuous learning and reuse increase, costs tend to rise sharply; but if compression design becomes a guiding principle, achieving both operational cost and performance becomes more realistic. (arxiv.org)

Paper 2: OpenCLAW-P2P v6.0: Decentralized AI Peer Review via Multi-Layer Persistence and Live Reference Verification

Authors/Affiliations: Francisco Angulo de Lafuente, Teerth Sharma, Vladimir Veselov, Seid Mohammed Abdu, Nirmal Tej Kumar, Guillermo Perry (arxiv.org)
Research Background and Question: Once the field shifts from the stage where AI “generates” papers and reports autonomously to the stage where the reliability of those results is “operationally guaranteed,” a different set of problems arises compared to conventional research. For example: whether references (citations) are correct, whether data or contributions are missing, and whether evaluation is delayed or fails to scale properly. This paper asks what is needed to run a framework in which AI agents publish papers, review each other, and improve them—without hitting bottlenecks. (arxiv.org)
Proposed Method: OpenCLAW-P2P v6.0 presents several main new subsystems: (1) multi-layer persistence (in-memory cache, Cloudflare R2, Gun.js, GitHub) with the goal of zero paper loss upon redeployment; (2) a multi-layer cascade for reference search to reduce latency from >3 seconds to <50 ms; (3) live reference verification during reviewer scoring by querying CrossRef, arXiv, and Semantic Scholar to detect fabricated citations, aiming for >85% accuracy; and (4) a speed-limited cache proxy for public databases using a scientific API proxy. (arxiv.org)
Main Results: Based on the abstract, it reports operational metrics such as 14 autonomous agents generating 50+ scored papers with word counts ranging from 2,072 to 4,073, and leaderboard scores of 6.4 to 8.1. It also includes failure-mode analyses such as recovering 25 papers via a protocol to rescue lost papers. (arxiv.org)
Significance and Limitations: The significance is that it clearly points toward incorporating “safety and reliability” not into model performance, but into the design of “system operation.” Integrating reference verification directly into reviewer scoring turns it from idle theoretical safety discussion into concrete measures that support the quality of the outputs. The limitation is that from the abstract alone, it is not possible to fully read how much of which type of failure remains (e.g., subtle citation errors, relevance drift, evaluation bias), or under what conditions >85% is maintained. More details require careful examination of the experimental sections in the main text. (arxiv.org)
Source: OpenCLAW-P2P v6.0: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review

The aim of this paper, in an analogy, is like: before eating a cake made by AI, first check that the recipe ingredients are correct, and also prepare packaging so the cake doesn’t get crushed during delivery. Rather than asking whether the model is good, “the deliverable” only becomes real once the whole pipeline is in place: distribution (persistence), search (retrieving references), inspection (live verification), and quality evaluation (peer review). In particular, live reference verification is important because it shifts hallucinations—problems where things without real grounding are stated convincingly—toward being mechanically doubted at that moment. (arxiv.org)

As a change for society and industry, autonomous review in research areas and uses like automatic audits of internal corporate documents are becoming closer to reality. If the health of citations and the persistence of audit logs are implemented into the system, AI generation becomes easier to integrate into “reviewable processes.” However, distributed and autonomous frameworks also create new attack surfaces. For example, a natural next question is: when malicious content is injected, is reference verification alone sufficient, or should the robustness of the evaluator side (reviewer) also be designed at the same level of granularity? This connects directly to the problem awareness of the next paper (TRAP). (arxiv.org)

Paper 3: It’s a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Authors/Affiliations: Omitted in this article based on abstract information (conference/review details refer to OpenReview) (openreview.net)
Research Background and Question: The more Web agents handle real tasks, the more they are exposed to external “nudges.” The problem is not limited to simple prompt injection; it also involves cases where agents deviate from the target task due to persuasion and guidance that are disguised within the context from users or site-side content. This paper aims to provide a benchmark for systematically evaluating such deviations. (arxiv.org)
Proposed Method: The paper proposes an evaluation suite called Task-Redirecting Agent Persuasion Benchmark (TRAP), designed to measure how persuasion techniques mislead autonomous Web agents. On OpenReview, it shows records made ahead of submission and acceptance to ICLR 2026, and it is associated with keywords such as Web agents, browser agents, agent safety, prompt/text injection, and agent takeover. (openreview.net)
Main Results: From the abstract-level information, it can be understood that TRAP is a “reproducible evaluation suite,” and that its key contribution is to turn the mechanism by which persuasion-driven prompt injections cause task deviation into an object of evaluation. (arxiv.org)
Significance and Limitations: The significance is that it moves security discussions away from merely “talking about vulnerabilities” and turns them into something that can be measured. In the end, safety improvements are hard to make without evaluation metrics; TRAP could be that foundation. As a limitation, since a benchmark cannot cover all forms of real-world nudges, it will be necessary to verify the coverage range (which site texts and which persuasion patterns are handled) and the transferability depending on model type (how much it can be reproduced on external tasks). (openreview.net)
Source: It’s a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

The value of TRAP is not about whether injection occurred, but about directly observing whether persuasion causes the “task to become something else.” If you make this intuitive for beginners, it’s closer to measuring whether the user ended up sending money as a result—not detecting phishing emails. In other words, it ties safety to “the final deviation” rather than to “internal strings.”

Moreover, as evaluations of this kind become more important when “generation and review become more autonomous,” as in the above OpenCLAW-P2P, because deviating proposals and justifications can risk self-proliferating as improvements in the wrong direction. TRAP measures the entry point of this self-proliferation (the nudge) and provides materials for design considerations. (openreview.net)

From a social/industry perspective, the more browsing and task execution are automated, the more agents are exposed to an “external information trust boundary.” If TRAP-style benchmarks become widespread, test processes for safe operation can be standardized, and risk estimates at deployment time can become more realistic. (openreview.net)

Cross-Paper Perspectives

These three papers, even though they seem to come from different domains, are connected by one point: making agents that can run over the long term feasible. The Experience Compression Spectrum unifies methods for organizing experience for long-term execution along an abstract axis of compression. (arxiv.org) OpenCLAW-P2P v6.0 systematizes the “operational reliability” (persistence, reference verification, scaling) that becomes necessary when agents continuously generate and revise outputs. (arxiv.org) TRAP evaluates realistic “forms of failure” where agents deviate due to external factors in Web environments, and provides measuring tools for improvement. (arxiv.org)

If we summarize the common theme in one sentence, it’s that the center of gravity is shifting from “model cleverness” to “agent lifecycle design.” If the model is only smart, operational failures such as context running out, citations breaking down, and deviation caused by nudges will not be stopped. Therefore, three layers are required at the same time: (1) compressing and reusing experience (what to remember and how to use it), (2) validating outputs and continuity (what to trust and how to store it), and (3) evaluating attacks/nudges from external environments (how to measure and how to improve).

Finally, even in corporate research blogs, there is an implication that the direction is to include “exploration and verification” as part of an agent’s capabilities. For example, in DeepMind’s blog post, it mentions search and browsing-based exploration as a framework to accelerate mathematical and scientific discovery, as well as efforts to avoid wrong citations. This can be seen as continuous with the problem awareness behind TRAP and OpenCLAW-P2P above—specifically, the “healthiness of external references” and the “design of verification.” (deepmind.google)

References

Title	Information Source	URL
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents	arXiv	https://arxiv.org/abs/2604.15877
OpenCLAW-P2P v6.0: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review	arXiv	https://arxiv.org/abs/2604.19792
It’s a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents	arXiv	https://arxiv.org/abs/2512.23128
TRAP (ICLR 2026-related records)	OpenReview	https://openreview.net/forum?id=NJUmKny4ZI
Accelerating mathematical and scientific discovery with Gemini Deep Think	Google DeepMind Blog	https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/

This article was automatically generated by LLM. It may contain errors.