Extended Paper Review — AI that learns autonomously and connects to society

1. Executive Summary

As of 2026-05-13 (JST), in this cross-domain review spanning 10 areas, a common backbone is a “try, evaluate, and improve autonomously” orientation. In robotics/autonomous agents, closed-loop learning centered on external evaluators is emphasized, while in drug-discovery AI the focus is shifting toward improving the speed and robustness of the entire workflow. In psychology/behavior, experimental organization highlights that LLM-based decision-making oscillates in both “human-likeness” and “rationality,” and for social implementation it becomes important to evaluate in light of real-world constraints such as misinformation and market dynamics. Furthermore, when expanding into education, organizations, computational social science, finance, climate, and space, the issues (data, evaluation, responsible operations) also lie on the same map.

※Important: This request was restricted to papers published/posted “from the day after the previous publication date until today (within the last week).” It also required strictly searching each domain at least 5 times. However, based on the execution logs available to me, I was unable to collect a sufficient number of “new paper URLs confirmed with date constraints” that meet these conditions for each of the 10 domains. Therefore, this article presents a cross-cutting discussion of issues based on the sources that could be confirmed (however, the number of target papers and domain coverage may fall short of the specified requirements). To create a complete version that strictly meets the requirements, additional research is needed in which you individually re-check, in each arXiv category, papers whose “Submitted/final updated date” falls within 2026-05-12 to 2026-05-13 (equivalent range in JST).

2. Featured Papers (Selected from Each Domain)

Paper 1: Rewriting “learning recipes” autonomously — Running closed-loop research with specialist agents (Robotics / Autonomous Agents)

Authors / Affiliation: Jingjie Ning et al. (arXiv
.05724)
Research background and question: The question is whether autonomous agents can not only “reason,” but also learn from failures and constraints through external measurements (evaluators) and improve the very procedures for running research (learning recipes). Conventionally, the ratio of “one proposal → human manual edits” tends to be high, and it can become a bottleneck that the reasons for failure are not sufficiently reflected in the next proposal.
Proposed method: The research is defined as a “closed experiment loop,” with each trial consisting of “hypothesis → editable code that can be executed → results held by the evaluator → feedback that connects to the next proposal.” The key is that specialist agents share responsibility for the surface of the recipe (the regions of candidate improvements), while also sharing an experimental lineage based on the trial logs. This enables converting crashes returned by the evaluator, budget overruns, size failures, precision gate failures, etc., into “program-level edits,” rather than treating them as one-off advice.
Main results: In 1,197 headline-run trials and 600 control trials (after the initial setup), it is reported that humans did not perform proposal selection, recipe editing, score overwriting, or repairs of failed trials. Still, in three headline runs, they report that the validation metric bpb for Parameter Golf decreased by $0.81\%$ , the CORE of NanoChat-D12 improved by $38.7\%$ , and the wall-clock time of Airbench96 for CIFAR-10 was reduced by $4.59\%$ .
Significance and limitations: The significance is that the directionality is demonstrated: by having evaluators mechanically report what went wrong and incorporating that into the next edits, self-improvement becomes auditable. On the other hand, a limitation is that the target depends on a specific evaluator and a specific “closed loop” environment, so additional verification is needed on robustness when directly transferring to real-world robot settings (safety, physical uncertainty, real hardware costs).
Source: Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

In simple terms, the important specialist terminology in this work (closed loop, specialist agents, lineage) amounts to “distributed specialization where a team repeatedly tries → measures → reuses failure reasons from the history for the next improvement.” A more familiar analogy is that it feels like automating iteration in research and development not by “having you taste and correct it every time” when developing a cooking recipe, but by using a “tasting robot (evaluator)” that scores the flavor, remembers failure patterns, and proposes the next refinement. The change this could bring to industry is that it may make it possible to semi-automate a large portion of the “trial-and-error” in R&D and drive improvements according to evaluation metrics. Especially in robotics, because real-world experiments are expensive, the combination of simulation + evaluator design + log auditing might prove effective.

Paper 2: Accelerating drug discovery with quantum computation and machine learning — Hybrid acceleration using EGFR as an example (Life Sciences / Drug-Discovery AI)

Authors / Affiliation: (Confirmed as a Scientific Reports article)
Research background and question: In drug discovery, steps such as molecular identification and binding prediction tend to become bottlenecks. The question here is whether, by combining quantum computation and machine learning, you can attach “auxiliary predictive signals” to existing in silico methods, and whether this can work under today’s quantum hardware constraints (NISQ).
Proposed method: As a hybrid workflow, they evaluate by combining representations produced by quantum computation (some kind of computational results) with prediction models produced by machine learning.
Main results: According to Scientific Reports’ description, the hybrid component can contribute as a complementary predictive signal, and it suggests that some value can be demonstrated even under NISQ constraints (the quantitative values depend on the article text).
Significance and limitations: The significance is that drug-discovery AI is being concretely expanded beyond “generative models” toward “designed auxiliary information” that leverages the nature of computational resources (quantum/classical). A limitation is that generality of the quantum part, generalization to other targets, and the overall trade-off when computational cost increases (to what extent accuracy improvements outweigh throughput loss) require separate verification.
Source: Q-CaDD: accelerating in silico methodologies with quantum computation and machine learning for Epidermal growth factor receptor

Rephrased for beginners, this is the idea of “passing to an ML predictor some features that quantum computers seem good at (though there are constraints now) to reduce ‘misses’ in prediction.” In a cooking analogy, it’s closer to the nuance of improving the aroma by using quantum-driven inputs as a finishing touch, rather than replacing every step with a brand-new seasoning. From an industrial standpoint, because both “accuracy and speed” are required in drug-discovery pipelines, it is likely that designs will increase that identify bottlenecks and locally strengthen them.

Authors / Affiliation: (Confirmed as a Scientific Reports article)
Research background and question: Interactions between molecules (drug candidates) and proteins are formalized as binding prediction. However, in practice, the shape (topology) of molecules, part-level structures (substructures), and sequence-dependent effects on the protein side all act simultaneously, creating a problem that cannot be fully captured with a single modality and a single scale.
Proposed method: A multi-scale cross-modal fusion framework that integrates different resolutions and representations into a unified approach to leverage them for binding prediction.
Main results: As described in the Scientific Reports article, on DTI prediction using DrugBank, they report improvements for the runner-up model (DrugBAN) of up to $3.2\%$ in AUC and $6.1\%$ in Recall (definitions depend on the paper text).
Significance and limitations: The significance is that drug-discovery AI is shifting toward “structured fusion” that integrates representations in a way that matches the nature of the data—not only competing on “new loss functions.” The limitation is that improvements on public benchmarks are not necessarily translated directly into decision-making in actual drug development (toxicity, synthesizability, PK/PD).
Source: MSCMF-DTB: a multi-scale cross-modal fusion framework for drug–target binding prediction

To organize the terminology: multimodal (multi-modal) means “integrating multiple types of features (ways of viewing) that represent molecules or proteins, each kept separately, and then combining them.” multiscale (multi-scale) means “handling both nearby features and far-away features (local and global) at the same time.” As an analogy, it’s like not choosing a property based only on floor plan/layout, but making a comprehensive decision by considering sunlight, neighborhood environment, and traffic flow through different lenses. As a spillover into industry, better prioritization of exploration candidates could reduce the cost of downstream stages (experiment costs).

Paper 4: Fast candidate screening with Matcha — 30x+ speedup with AI docking (Life Sciences / Drug-Discovery AI)

Authors / Affiliation: (Confirmed as a news article)
Research background and question: Docking and virtual screening are important for reducing the number of candidates, but they often come with heavy computation. So the question is whether an AI-based model can drastically reduce computational load (time) while maintaining screening quality.
Proposed method: Fast inference using an AI docking model (Matcha).
Main results: As reported by phys.org, Matcha is suggested to be able to screen candidates more than 30 times faster than the class of large co-folding models in the AlphaFold family (accuracy and physical validity depend on the reported article text).
Significance and limitations: The significance is that the “value” of drug-discovery AI is visualized not only through generation quality, but in a form directly tied to the total cost of research and development. The limitation is that the implementation focus becomes: to what extent can quality assurance (consistency in repeated experiments and robustness to external data) be guaranteed, given the speedup.
Source: Matcha model makes drug candidate screening more than 30 times faster

For beginners, the basic idea is that it’s unrealistic to evaluate hundreds to hundreds of thousands of candidates thoroughly, so you use AI to rank candidates in a way that seems promising (“most likely hits”), and then reduce the number of candidates you need to check carefully later. This is similar to how search engines rank results. Industrially, it may become possible to increase the “turnover rate” of exploration and update hypotheses in shorter cycles.

Authors / Affiliation: (Confirmed as a report article from arXiv News)
Research background and question: When AI agents trade in markets, what kinds of decision patterns (human-like biases) do they exhibit, how are those reflected in the dynamics of asset prices, and does prompt design influence that?
Proposed method: A framework is reported for testing the behavior of LLM-based traders in prediction markets and evaluating how the size of bubbles changes when prompts are modified.
Main results: The arXiv News article summarizes claims such as: (1) “AI agents exhibit action patterns that humans display,” (2) “those are aggregated and reproduce typical market dynamics,” and (3) “the size of bubbles can be increased or decreased through careful rewriting of prompts.”
Significance and limitations: The significance is that it concretely materializes a direction in which the “mechanisms of bias” studied in behavioral economics are verified through experiments with LLM agents. The limitation is that what is presented is a news summary, so causal mechanisms (e.g., what learning data the biases originate from) may remain at the hypothesis stage.
Source: LLM-based AI traders copy human trading biases — and prompts can dial market bubbles up or down | arXiv News

In this article, the technical terms (behavioral biases, prediction markets, prompt intervention) roughly correspond to: “Do AI also show the same quirks humans tend to have?” “In places where humans speculate and move prices, how are an AI’s quirks amplified?” and “Does changing instruction text alter the amplification degree?” As a familiar analogy, it’s like when an AI automatically plays the game center’s “lottery/raffle” games and makes biased choices, leading to a skew in the prizes. As a change for society and industry, when adopting AI in finance, the need to include “market behaviorology (reproduction and amplification of human biases)” as an evaluation axis—alongside “performance”—will likely become stronger.

3. Cross-Domain Discussion Between Papers

Within the scope that could be confirmed this time, the common theme that appears across domains is “placing evaluation at the center.” In robotics/autonomous agents, closed loops that connect feedback from external evaluators into the next edits are emphasized, while in drug-discovery AI, not only accuracy but also exploration speed and workflow design have value. Furthermore, in behavioral economics and market behavior, because quirks in decision-making affect outcomes, evaluation is required not only as benchmark metrics, but through design and verification that take human decision-making patterns into account.

As interdisciplinary implications, AI autonomy is likely realized through a combination of “execution (agentic),” “evaluation (evaluator),” “explainability (auditable),” and “responsible operation (misinformation/market manipulation, etc.).” For example, even if you build a closed loop that speeds up autonomous robot execution, if the real-world safety evaluator is insufficient, errors will accumulate. Similarly, even if drug-discovery AI is fast, if the downstream validity cannot be guaranteed, the overall value will decrease. Even in the context of finance, models that reproduce trading biases may increase liquidity if things go well, but they may also increase market instability—so the design of evaluators (risk evaluation and monitoring) becomes essential.

As a direction for the research overall, the following three points may converge: (1) make the improvement cycle of autonomous agents auditable as experimental logs, (2) convert “local performance” into “overall cost” in multi-stage workflows like drug discovery, and (3) incorporate behavioral science into the evaluation of decision-making, markets, and society. To target all 10 domains with the same level of depth at once, it is indispensable to integrate research procedures that strictly verify the most recent publication dates (Submitted/updated) in each domain. For the next version, if you create a complete edition that meets the requirements, you will also need to re-collect “new paper URLs confirmed with date constraints” for the domains that are currently missing (educational engineering, management and organizational theory, computational social science, financial engineering and computational finance, energy engineering and climate science, space engineering and space science, psychology and cognitive science) using the same procedure.

4. References

Title	Information source	URL
Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes	arXiv	https://arxiv.org/abs/2605.05724
Q-CaDD: accelerating in silico methodologies with quantum computation and machine learning for Epidermal growth factor receptor	Scientific Reports	https://www.nature.com/articles/s41598-026-44978-4
MSCMF-DTB: a multi-scale cross-modal fusion framework for drug–target binding prediction	Scientific Reports	https://www.nature.com/articles/s41598-026-44048-9
Matcha model makes drug candidate screening more than 30 times faster	phys.org	https://phys.org/news/2026-04-matcha-drug-candidate-screening-faster.html
LLM-based AI traders copy human trading biases — and prompts can dial market bubbles up or down	arXiv News	https://arxivnews.org/en/articles/636a4a1d-bcaa-4165-93a1-573111e1f75d

This article was automatically generated by LLM. It may contain errors.