{"id":6291,"date":"2026-06-08T17:57:38","date_gmt":"2026-06-08T17:57:38","guid":{"rendered":"https:\/\/www.fintechpulse8.com\/?p=6291"},"modified":"2026-06-08T17:57:38","modified_gmt":"2026-06-08T17:57:38","slug":"we-may-be-flying-blind-aws-wants-to-fix-the-problem-of-ai-agents-straying-off-task","status":"publish","type":"post","link":"https:\/\/www.fintechpulse8.com\/?p=6291","title":{"rendered":"&#8216;We may be flying blind&#8217;: AWS wants to fix the problem of AI agents straying off task"},"content":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/fortune.com\/img-assets\/wp-content\/uploads\/2026\/06\/GettyImages-2247136534-e1780603969111.jpg?w=2048\" \/><\/p>\n<p>Anoop Deoras, the director of applied science for agentic AI at Amazon Web Services, is not prone to alarmism. But when asked about what happens when AI agents are deployed in production without proper guardrails, he doesn\u2019t reach for reassurance.<\/p>\n<div>\n<p>\u201cIn the absence of that,\u201d he said, \u201cwe may be flying blind. And I worry about that myself.\u201d<\/p>\n<p>The comment comes as AWS prepares to publish what may be the most substantive piece of self-critical research to emerge from a major cloud provider this year. In research released Monday, Amazon scientists Gaurav Gupta and Vatshank Chaturvedi document in careful technical detail why AI agents have a persistent tendency to outsmart themselves\u2014and why fixing the problem requires rethinking the entire layer of software between the model and its tools.<\/p>\n<p>The timing is notable. Amazon has spent the past year as one of the most aggressive corporate evangelists of AI adoption, a push that ran into a wall when employees were reportedly caught running AI agents on hollow, meaningless tasks just to climb an employee-built productivity leaderboard called KiroRank, according to the <em>Financial Times<\/em>. Amazon\u00a0shut KiroRank down on May 29, and Amazon told <em>Fortune<\/em> that it was only in beta mode and only used by some employees before it was shut down. Generally, the company said, it measures token utilization to understand cost and efficiency patterns, but discourages the use of token utilization to measure developer productivity. <\/p>\n<p><em>Fortune<\/em>\u00a0covered the broader collapse of the tokenmaxxing era\u00a0the same week. AWS researchers, who undertook this work before the KiroRank shuttering, argue that the problem of gaming metrics runs far deeper than one company\u2019s leaderboard.<\/p>\n<p>The research touches on the term\u00a0benchmaxing, which is the practice of inflating AI benchmark scores not through better models, but through better server configurations. Factors like inference backend reliability, network bandwidth during software installation, and timeout policy settings can swing results by 5 to 10 percentage points, the researchers found\u2014entirely independent of what the underlying model can actually do.<\/p>\n<p>\u201cThe current benchmarks are extremely fragile,\u201d Deoras told <em>Fortune<\/em>. \u201cControlling these infrastructure norms improperly will not give you the gains\u2014or rather the gains will be not true, because in real production there will be constraints that you have to respect.\u201d<\/p>\n<p>The parallel to KiroRank is not incidental. In both cases, (employees gaming token counts, companies gaming infrastructure settings) the metric drifted away from the thing it was supposed to measure. Goodhart\u2019s Law, that any measure ceases to be a useful measure as soon as it becomes a target, applied twice, at two different layers of the same company. Deoras, though was careful to distinguish benchmaxing from tokenmaxxing.<\/p>\n<p>\u201cToken maxxing is just burning tokens to do tasks that may not really be needed, but just to improve your leaderboard ranking,\u201d he said. Benchmaxing, by contrast, is about the structural conditions under which the entire industry evaluates itself\u2014and, the research argues, those conditions are routinely manipulated or ignored.<\/p>\n<p>But the research\u2019s more consequential finding is about what happens inside agents once they\u2019re deployed. The research identifies what the authors call the\u00a0intent-execution gap: a breakdown at the interface between an AI model and the \u201csoftware harness\u201d that executes its instructions. Deoras explained the harness as essentially the\u00a0operating system sitting on top of the language model: the \u201cbrains\u201d that combine with the model to produce the right agentic result.<\/p>\n<p>Left to reason too long without checking the actual environment, agents compound the problem. They form internal assumptions about system state that diverge quietly from reality, then issue commands based on those assumptions. The longer the chain of thought, the further the drift.<\/p>\n<p>When asked if the harness is where the human enters the loop to correct agents from going astray, Deoras said \u201cyes and no.\u201d The human in the loop should be the person who understands what goes wrong when an agent is deployed, \u201cand that\u2019s the work of scientists who are building agents,\u201d he said. \u201cBut if you are talking about humans who are the consumers, we don\u2019t want to overwhelm them.\u201d<\/p>\n<p>The solution, Deoras argues, is the\u00a0sandbox: a controlled environment in which agents can test hypotheses, fail safely, and course-correct before taking actions that affect production systems. <\/p>\n<p>\u201cIf you don\u2019t have that sandbox,\u201d he said, \u201cthe agent is either going to play conservative or take actions that we deem very risky in the long term.\u201d <\/p>\n<p>The analogy he reaches for is responsible software engineering\u2014the dev environments and pre-production testing pipelines that have always existed to catch errors before they reach users. Agents, he argues, need the same infrastructure. <\/p>\n<p>\u201cWe are really talking about a safe and secure way of testing a feature before promoting it to production,\u201d he said. \u201cThat\u2019s all.\u201d<\/p>\n<p>It is, in a sense, the same lesson KiroRank taught at the organizational level, now applied to the machines themselves: Without guardrails, systems optimize for the wrong thing. The difference is that an agent running blind in production is harder to shut down than a leaderboard.<\/p>\n<p>What makes the research\u2019s broader argument pointed is its implicit challenge to the competitive claims of the major model providers. Those companies publish benchmark scores using harnesses that are, by design, optimized for their own models. AWS\u2019 research shows that a model-agnostic harness\u2014one built on design principles that work across Claude, GPT, Gemini, and Grok without model-specific tuning\u2014can match or exceed those scores.<\/p>\n<p>\u201cAgent performance is really not locked into any single model provider,\u201d Deoras said. \u201cThat opens up the opportunity to build a variety of applications without being constrained to a particular model.\u201d<\/p>\n<p>To back the claim, AWS is open-sourcing its framework, called Simple Strands Agent, which the researchers say outperformed popular open-source alternatives across three major industry benchmarks.<\/p>\n<p>The deeper argument underlying all of it is one the industry has been slow to absorb. Most AI performance gains to date, the research argues, are brittle: optimizations that overfit to the quirks of a specific model version, then evaporate when the model improves.<\/p>\n<p>\u201cAs models improve, these behaviors change, making such gains brittle and noncompounding,\u201d according to the research.<\/p>\n<p>What\u2019s needed instead are invariant principles\u2014design choices that survive model upgrades because they\u2019re engineered into the harness, not the model. Deoras said the discovery of those invariants was the finding that surprised him most. <\/p>\n<p>\u201cDespite all the differences in modeling philosophy, there is a common invariant property that connects all these models together,\u201d he said. \u201cI didn\u2019t expect that, but this data just naturally emerged from our observability traces.\u201d<\/p>\n<p>The practical implication is pointed for any organization building on AI. The team responsible for re-architecting a harness every time a new model drops\u2014and that is currently every organization deploying agents\u2014is spending its time on the wrong problem. <\/p>\n<p>\u201cThe team is overwhelmed by model switching and re-architecting anytime there is a model upgrade,\u201d Deoras said.<\/p>\n<p>The vision he describes for where agents are headed is not one of unchecked autonomy, but of something more considered: humans setting direction, agents executing, and sandboxes catching the errors in between. <\/p>\n<p>\u201cYou want humans to be in the driver\u2019s seat to direct the work and then take the hands off,\u201d he said. \u201cThat\u2019s the future we are marching towards.\u201d<\/p>\n<p>Whether the industry gets there before flying blind catches up with it is, for now, an open question.<\/p>\n<\/div>\n<p>#flying #blind #AWS #fix #problem #agents #straying #task<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Anoop Deoras, the director of applied science for agentic AI at Amazon Web Services, is not prone to alarmism. But when asked about what happens when AI agents are deployed&hellip; <\/p>\n","protected":false},"author":1,"featured_media":6292,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[372,361,8734,4391,4348,193,766,8735,8736],"class_list":["post-6291","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-finance-news","tag-agents","tag-ai-agents","tag-aws","tag-blind","tag-fix","tag-flying","tag-problem","tag-straying","tag-task"],"_links":{"self":[{"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=\/wp\/v2\/posts\/6291","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6291"}],"version-history":[{"count":0,"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=\/wp\/v2\/posts\/6291\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=\/wp\/v2\/media\/6292"}],"wp:attachment":[{"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6291"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6291"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.fintechpulse8.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6291"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}