Paper Review - Instruction Following, Safety Alignment, and Agentic RAG

Executive Summary

This (2026-04-01) review focuses on new angles for evaluating LLMs—angles that determine whether they work “in the field”: assessment, alignment, representational stability, and agent design. Concretely, we pursue “implementation-close evaluation” with FireBench, which measures instruction following in enterprise and API integrations. We then look into why RLHF alignment tends to become “shallow” via a theoretical paper and the stability of internal representations related to consistency under persona conditions. In addition, SoK aims to draw a “map” of research by systematizing agentic RAG as a unified framework.

Featured Papers: The Convergence of Instruction Following, Alignment, Representational Stability, and Agent Design

Paper 1: FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Authors/Affiliations: Yunfan Zhang, Yijie Bei, Jetashree Ravi, Pawel Garbacki. Affiliations are assumed to be referenced from the paper page; at minimum, the author names can be confirmed from the same page. The source is FireBench (article page).
Background and Research Question: LLM evaluation has long focused on “chat-like responses.” But in real deployments, what matters includes strictness of output format, adherence to procedures, assumptions about tool calls, and constraints specific to business domains. Thus, the paper attempts to answer the question: what benchmark can measure “instruction following” in enterprise and API-driven settings? FireBench (article page)
Proposed Method: The proposal is the instruction-following benchmark “FireBench,” designed from real operational patterns. The summary states that it evaluates across six core ability dimensions, using over 2,400 samples, and presents the behavior and challenges of 11 types of LLMs in enterprise-oriented scenarios. FireBench (article page)
Key Results: From the article page, it is clear that the intent is to fill the gap in chat-oriented benchmarks—specifically, the evaluation setup (over 2,400 samples, six dimensions, 11 LLMs) is explicit. FireBench (article page)
Significance and Limitations: The significance is that the evaluation metrics shift from “lab conversations” toward “operational requirements.” The limitation is that if the evaluation design is optimized too heavily for field conditions, it becomes harder to transfer it to other areas. Benchmarks are not universal; it matters which “real deployment assumptions” they adopt.
Source: FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications (FireBench)

If we liken FireBench’s idea to a metaphor for beginners: whereas conventional evaluation was mainly like “tasting a dish,” FireBench is like “testing the hygiene rules, procedures, quantities, and timelines in kitchen work.” Instruction following (instruction following) is not about merely producing “text that sounds right”; it is the property of consistently producing the expected outputs according to the specification. As this kind of evaluation progresses, enterprises can talk about model selection not as “preference” but as the “probability of meeting requirements.” For example, in situations where API integrations impose format constraints—such as問い合わせ summarization, ticket classification, or coding assistance—the tests themselves become components of quality assurance (QA). However, if the distribution of the evaluation set is skewed, the scores will be skewed as well, so an operational practice is needed to check whether the included difficulty levels resemble those of the company’s own data before rolling out in production.

Paper 2: Why Is RLHF Alignment Shallow? A Gradient Analysis（Why Is RLHF Alignment Shallow? A Gradient Analysis）

Authors/Affiliations: Robin Young (affiliation assumed to be referenced from the paper page). The source is arXiv
.04857.
Background and Research Question: Alignment via RLHF (Reinforcement Learning from Human Feedback) seems effective in many experiments, yet there appears to be a phenomenon where it looks “limited” in its impact. So the paper tries to explain from theory how, during learning, alignment signals reach which positions and to what extent—that is, to describe the behavior of gradients. arXiv
.04857
Proposed Method: The paper characterizes where the gradient concentrates among token positions and where it disappears by using ideas that decompose sequence-level harm (harm to the entire sequence), and by expressing it in terms of the conditional expectation and the covariance of the score function. As the abstract, it summarizes that the gradient at position $t$ can be expressed as a relationship between the “conditionally expected harm” and the “score function.” arXiv
.04857
Key Results: The key for text summarization is obtaining a structure where “gradient-based alignment concentrates signals at the positions that determine harm, and it fades away elsewhere (in distant positions).” Moreover, this property may help explain observations that the KL divergence between an alignment model and a base model is skewed toward earlier tokens. arXiv
.04857
Significance and Limitations: The significance is to move beyond “if you run RLHF, things get better somehow” and to articulate the mechanism—why learning signals are hard to reach—in theoretical terms. The limitation is that the model assumptions used by the theory (such as the definition of harm and decomposition assumptions) raise the separate question of how closely they approximate complex safety risks in real environments.
Source: Why Is RLHF Alignment Shallow? A Gradient Analysis (arXiv
.04857)

This paper offers a perspective that does not oversimplify alignment (alignment) as if it were just like a supervised classification task. To put it in beginner terms: if the “places where learning is effective” are biased to a finite range, then behavior in distant areas is difficult to improve. The intuition is similar to a game where bad outcomes are determined in the final few moves; practicing only the early game may not increase your win rate. In other words, if the learning signal provided by RLHF (rewards or losses related to harm) strongly appears at the timing when harm is determined, and is weak before and after, then it is natural that optimization appears as “shallow alignment.”

In terms of societal and industrial impact, safety evaluation and learning-strategy design may shift toward considering “at which step safety gets decided.” For instance, ideas like strengthening constraints from earlier tokens (or designing intervention points before harm is determined) may become more readily connected to theoretical backing rather than remaining mere heuristics.

Paper 3: Probing the Lack of Stable Internal Beliefs in LLMs（Probing the Lack of Stable Internal Beliefs in LLMs）

Authors/Affiliations: Yifan Luo, Kangping Xu, Yanzhen Lu, Yang Yuan, Andrew Chi-Chih Yao (affiliations assumed to be referenced from the paper page). The source is arXiv
.25187.
Background and Research Question: LLMs with a persona are expected to preserve the “same personality” and “same belief tendencies” across interactions. But in practice, even under identical conditions, behavior can fluctuate. So the paper asks how the absence of stable internal beliefs is observed—and in what form it shows up. arXiv
.25187
Proposed Method: The core approach is to treat internal representations as “beliefs” and to measure/probe whether they remain consistent. Even at the abstract level, the paper states that for persona-driven LLMs to mimic human personality traits (such as persistence and trustworthiness), consistent behavioral tendencies are necessary. arXiv
.25187
Key Results: The main point of the paper is to use probing to show the possibility that stable internal beliefs are missing. At minimum, the framing of the problem is clear: if persona-driven LLMs are to have “behavioral consistency,” they require internal stability. arXiv
.25187
Significance and Limitations: The significance is to move beyond evaluating only output quality on the surface, and to bring “why it is inconsistent” down to the level of internal representations. The limitation is that the concept of internal beliefs depends on interpretive hypotheses about the model, so the observed results might also be explained by other factors (training data distributions, sampling effects at inference time, prompt differences).
Source: Probing the Lack of Stable Internal Beliefs in LLMs (arXiv
.25187)

For beginners, it can help to think of internal beliefs as “policy notes in your head.” Humans make similar judgments in similar situations, and behind that is the stability of beliefs. Similarly in LLMs: when you specify a particular persona, if the internal representations are held “in the same direction,” consistency emerges.

On the other hand, if the internal state fluctuates, each response may look plausible, but over the long term it is easier for things to become “different from before.” In real applications, this directly impacts user experience (UX) and operational trust. For example, if a customer-support agent character suddenly changes tone mid-conversation, it may be a sign that the structural “skeleton” of the designed persona is not being maintained—not just a matter of surface expression.

From an industry perspective, it is expected that we will not treat persona LLMs merely as “problems of output templates,” but rather broaden the questions toward “state maintenance during reasoning” and “alignment during training.”

Paper 4: SoK: Agentic RAG — First Unified Framework for Autonomous Retrieval-Generation Systems（SoK: Agentic RAG — First Unified Framework for Autonomous Retrieval-Generation Systems）

Authors/Affiliations: Since this is a SoK (Survey of Knowledge) format, there may be multiple authors. However, in the source that could be referenced here, at least the paper ID and the framework abstract are confirmed. The source is the Agentic RAG SoK summary page (an arXiv number arXiv
.07379 is shown).
Background and Research Question: RAG (Retrieval-Augmented Generation) is evolving from a simple search→generation pipeline into “agentification,” where the LLM autonomously adjusts multiple steps. But research remains fragmented, evaluation is not unified, and there is no shared classification (taxonomy). So the aim is to make a “map of knowledge” for how to organize agentic RAG, how to evaluate it, and what to pay attention to. Agentic RAG SoK page
Proposed Method: As a SoK, it explains the necessity of agentic RAG (why we need a SoK) and presents, as the scope for systematization, components in an autonomous architecture that evolved from retrieve-and-generate (multi-step reasoning, dynamic memory management, iterative search, and so on). Agentic RAG SoK page
Key Results: The “key results” visible from the page are that, toward unifying the framework, it makes explicit the fragmentation of research and the risks (e.g., lack of unified evaluation, potential system risks, absence of taxonomy) and argues for the need for integration. Agentic RAG SoK page
Significance and Limitations: The significance is that in a rapidly expanding area like agentic RAG, the SoK could provide “traffic control” that aligns terminology and evaluation axes. The limitation is that because SoK is essentially “organization,” it may not present direct numerical improvements comparable to papers that produce new SOTA results in experiments.
Source: SoK: Agentic RAG — First Unified Framework for Autonomous Retrieval-Generation Systems (arXiv
.07379)

Here too, let’s use a beginner-friendly analogy. Traditional RAG is like “go to a library, find books, read them, and then summarize”—whereas agentic RAG is closer to a state where you run the entire workflow as one operation: “search → read → find what you don’t understand → search again with added information → change your approach if necessary.”

What troubles researchers in this situation is that although the granularity of the work differs from paper to paper, it still gets called by the same name. The unified framework that SoK aims for aligns what counts as “mandatory components,” what counts as “implementation choices,” and what should be measured in evaluation. As this progresses, comparisons among models and agent designs can be discussed not as “surface-level performance,” but as “differences in capability under the same conditions.”

From an industry perspective, it becomes possible to design RAG not just as a standalone function but as a system that includes retrieval, memory, decision-making, and tool integration. As a result, it may become easier to satisfy requirements such as reducing misinformation injection (hallucinations), keeping up with information updates, and enabling auditability.

Cross-Paper Reflections

Even though the four works in this review seem to have different themes, they share a common focus: moving to measure, explain, and design LLMs not as “output engines,” but as “systems that guarantee behavior.” FireBench tries to measure instruction following in forms close to enterprise and API deployment. The RLHF gradient analysis explains from the standpoint of learning dynamics “where the alignment learning signal reaches,” providing a rationale for why safety improvements can remain limited. Probing lack of stable internal beliefs aims to view persona consistency fluctuations from the perspective of internal states, guiding diagnosis deeper than surface-level quality evaluation. Finally, SoK for Agentic RAG organizes the fragmentation and non-unified evaluation that arise when retrieval and generation become agentified into a unified framework.

When these are combined, you can see that the main battleground in research and development is shifting from “improving model scores” toward “guaranteeing what properties a model has, under which assumptions, which states, and which evaluation axes.” Moreover, as seen on OpenAI Research’s pages, recent interest in safety and alignment is also expanding toward “safety controls that work in operations,” such as monitoring and instruction hierarchy. OpenAI Research The broader research trend is also suggestive of how tightly connected it is with agentification. For example, Google DeepMind talks about agentic workflows like Gemini Deep Think in the context of scientific advancement. Google DeepMind (Gemini Deep Think) As agentification progresses, the importance of evaluation, alignment, internal-state diagnosis, and systematization increases. This is because agents accumulate multiple judgments and actions—if it is unclear “at which stage the failure happened,” you cannot improve.

As a roadmap for the future, the cycle may strengthen: (1) identify “how it breaks” with field-oriented evaluations like FireBench, (2) narrow down “reasons learning doesn’t reach” with theory like RLHF gradient analysis, (3) diagnose “where the fluctuations come from” with internal-belief probing, and (4) prepare the “design space” and the “basis for comparison” with the SoK of Agentic RAG.

References

Title	Information source	URL
FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications	Article (AI Navigate)	https://ai-navigate-news.com/en/articles/127560eb-3c88-49b9-acfa-7b70547b3158
Why Is RLHF Alignment Shallow? A Gradient Analysis	arXiv	https://arxiv.org/abs/2603.04857
Probing the Lack of Stable Internal Beliefs in LLMs	arXiv	https://arxiv.org/abs/2603.25187
SoK: Agentic RAG — First Unified Framework for Autonomous Retrieval-Generation Systems	arXiv	https://arxiv.org/abs/2603.07379
Gemini Deep Think (Agentic Workflow for Scientific Discovery)	Google DeepMind blog	https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/

This article was automatically generated by LLM. It may contain errors.