‘Flashes of brilliance and frustration’: I let an AI agent run my day


New Scientist. Science news and long reads from expert journalists, covering developments in science, technology, health and the environment on the website and the magazine.

I will never forget the kung pao chicken I sat down to eat a few months ago. Not because the taste blew me away – 20 minutes on the back of a delivery rider’s scooter had sullied that somewhat. What made the meal memorable was that I hadn’t really ordered it at all. Yet there it was, in front of me. 

An AI assistant called Operator, developed by ChatGPT-maker OpenAI, had ordered the food on my behalf. The tech industry has dubbed such assistants “AI agents”, and several are now commercially available. These AI agents have the potential to transform our lives by carrying out mundane tasks, from answering emails to shopping for clothes and ordering food. Microsoft chief financial officer Amy Hood reportedly said in a recent internal memo that agents “are pushing each of us to think differently, work differently” and are “a glimpse of what’s ahead”. In that sense, my kung pao chicken was a taste of the future. 

But what will that future be like? To find out, I decided to put Operator and a rival product named Manus, developed by Chinese start-up Butterfly Effect, through their paces. Working with them was a mixed bag: amid the flashes of brilliance, there were moments of frustration, too. In the process, I also got a glimpse of the risks to which we are exposing ourselves. Because fully embracing these tools requires handing them the keys to our finances and our list of social contacts, as well as trusting them to perform tasks the way we want them to. Are we ready for the world of AI agents, or will they be hard to stomach? 

Since 2023, we have lived in the era of generative AI. Built using large language models (LLMs) and trained on huge volumes of data scraped mainly from web pages, generative AI can create original content such as text or images in response to commands given in everyday language. It would be fair to say that this AI has made quite a splash, judging by the volume of media coverage devoted to the technology, and has already changed the world significantly. 

The rise of agentic AI

Agentic AI promises to take things one step further. It is “empowered with actually doing something for you”, says Peter Stone at the University of Texas at Austin. Over the past few years, many of us have grown used to the idea of asking a generative AI for information – recommendations of favourite dishes available in the neighbourhood, for instance, and contact details for the restaurants from which that food can be ordered. But ask agentic AI, “What should I eat tonight?” and it can pick out dishes it thinks you will like from a restaurant’s website and – if there is an online order form – pay for the food using your credit card, arrange for it to be sent to your home and let you know when to expect the delivery. “That will feel like a fundamentally different experience,” says Stone: AI as an autopilot rather than a copilot. 

Building an agentic AI with this sort of capability is trickier than it might appear. LLMs are still the driving force under the surface, but with agentic AI, they focus their processing power on the decisions they can make and the real-world actions they can take based on the digital tools – including web browsers and other computer-based apps – at their disposal. When given a goal such as “order dinner” or “buy me some shoes”, the AI agent develops a multi-step plan involving those digital tools. It then monitors and analyses how close the output at each step is to the ultimate goal, and reassesses what else needs to be done. This process continues until the agent is satisfied it has reached the ultimate goal – or come as close to doing so as possible. And once the act is done, the system asks whether it achieved the goal successfully, a form of feedback also present in AI chatbots, called reinforcement learning from human feedback. 

Stone, who is the founder and director of the Learning Agents Research Group at his university, has spent decades thinking about the possibility of AI agents. They are, he says, systems that “sense the environment, decide what to do and take an action”. Put in those terms, it may feel as if AI agents have been with us for years. For instance, IBM’s Deep Blue computer appeared to have reacted to events on a real-world chessboard to beat former World Chess Champion Garry Kasparov in 1997. But Deep Blue wasn’t an agentic AI, says Stone. “It was decision-making, but it wasn’t sensing or acting,” he says. It relied on human operators to move chess pieces on its behalf and to inform it about Kasparov’s moves. An AI agent doesn’t need human help to interact with the real world. “Language models that were disembodied or disconnected from the world are now being connected [to it],” says Stone. 

Early versions of these agentic AIs are now available from many tech firms, with each, whether it is Microsoft, Amazon or the software firm Oracle, offering its own. I was eager to see how they work in practice, but doing so isn’t cheap: some come with annual subscription fees running to thousands of dollars. I reached out to OpenAI and Butterfly Effect and asked for a free trial of their products – Operator and Manus, respectively. Both accepted my request. My plan was to use the AIs as personal assistants, taking on my grunt work so I would have more free time. 

A person working in a cafe on a laptop

Will AI agents soon take care of our boring work admin?

Kuan Chang Chen/Millennium Images, UK

The results were mixed. I was due to give a presentation in a few weeks, so I uploaded my slide deck to Manus’s online interface and asked the AI agent to reformat it. Manus seemed to have done a good job, but after opening the slide deck in PowerPoint, I realised that it had placed every line of text in a separate text box, meaning it would be annoyingly fiddly for me to make additional edits myself. Manus did, however, fare better at compiling code for an app I wanted to upload into an app store-ready format, using various tools and its remote computer’s command line to do so. 

Turning to Operator, I began by asking the AI agent to handle my online invoicing system. Like a well-meaning but not particularly helpful intern, it insisted on filling out the form the wrong way: inputting text defining the work for which I was invoicing into a box that could receive only numeric codes. I eventually managed to break it out of that habit, but then Operator got confused when copying over details from my “to invoice” list to the system, with potentially embarrassing results. Notably, it suggested I submit an invoice to the New Scientist accounts team asking for an £8001 payment for a single article. 

It was with some trepidation, then, that I gave Operator a promotion and asked for its help in reporting this story. I had already used ChatGPT to identify AI experts who could comment on the rise of agentic AIs. I asked Operator to send each expert an email on my behalf requesting an interview. The results, which I didn’t see until the emails had already been sent, made me inwardly cringe – not least because Operator decided against acknowledging its role in composing them, giving the impression that I had written them myself. The language the AI agent used was simultaneously naive and too formal, with staccato sentences fired with a semi-hostility that put me – and, in all likelihood, the would-be interviewees – on edge. Operator also failed to mention some key information, including that my story would be published by New Scientist. In that way, it felt a lot like a junior assistant. Not really knowing how to write an email as I would, Operator made many mistakes. 

In Operator’s defence, however, the emails were at least partially successful. It was through an Operator email that I made contact with Stone, for instance, who took the AI-sent email in his stride. Another researcher complimented me on the approach when I later disclosed that the email had been written by Operator. “That’s serious dogfooding!” they said – tech slang for testing experimental new products – although they declined to speak for this story because the funders of a project they were working on wouldn’t let them.

Who does an AI agent really work for?

The tech companies behind these AI agents present the technology as if it is an indefatigable digital assistant. But the truth is that, in my experience, we aren’t quite there yet. Still, assuming the tech is going to improve, how should we view these new tools? To start with, it is worth pondering the commercial incentives that underpin all the hype, says Carissa Véliz at the University of Oxford. “Of course, the AI agent works for a company before they work for you, in the sense that they are produced by a company with financial interests,” she says. “What will happen when there are conflicts of interest between the company who essentially leases the AI agent and your own interests?”  

We can already see examples of this in the early AI agents: OpenAI has signed agreements with companies to collaborate on its system, so when searching for holiday flights, Operator may prefer Skyscanner over competitors, or turn first to the Financial Times and Associated Press if you ask it about the news. Véliz also suggests users consider privacy concerns before leaping headfirst into using agentic AI, given the tech’s access to our personal information. “The essence of cybersecurity is to have different boxes for different things,” says Véliz – using unique passwords for online banking and email, for instance, and never saving those passwords in a single document – but to use an AI agent, we must break down the barriers between those boxes. “We’re giving these agents the key to a system in which everything is connected, and that makes them very unsafe,” she says.  

It is a warning I can appreciate. I wasn’t particularly happy that my trial with Operator necessarily involved ceding control of my email and accounting software to the AI agent – and my level of unease hit new heights when I asked Operator to order the dish of kung pao chicken on my behalf. At one point, the AI agent asked me to type my credit card details into a computer window that had popped up in the Operator chatbot interface. I reluctantly did so, even though I felt I didn’t fully control the window and that I was placing an enormous amount of trust in Operator. 

Moreover, as things stand, it isn’t completely clear that AI agents have earned such trust. By definition, they tend to “access a lot of tools and interact a lot more with the outside world”, says Mehrnoosh Sameki, principal project manager of generative AI evaluation and governance at Microsoft. This makes them vulnerable to certain types of attack.  

Tianshi Li at Northeastern University in Massachusetts recently looked at six leading agents, and studied those vulnerabilities. She and her team found that agents could fall prey to relatively simple tricks. For instance, deep within the text of a privacy policy that few people would read, a malicious actor might hide a request to click a link and insert credit card details. Li’s team found that an AI agent wouldn’t hesitate to carry out the request. “I think there are a lot of very legitimate concerns these agents might not act in accordance with people’s expectations,” she says. “And there is no effective mechanism to allow people to intervene or remind them of this possibility and to avoid the possible consequences.”  

OpenAI declined to comment on the concerns raised by Li’s research – although my experience using Operator suggests the company is aware of the trust-and-control issue. For instance, Operator seemed to go out of its way to constantly ping me notifications to check if the actions it wanted to take aligned with my expectations. The inevitable downside to that strategy, however, is that it made me feel that I was devoting so much time to micromanaging the agent’s work that I would have been quicker just performing the tasks myself. 

CUBA. Guardalavaca. Playa Pesquero. All inclusive resort. 2017.

AI agents can carry out tasks with results in the real world, including booking holidays

MARTIN PARR

“We’re still [in the] early days in a lot of these agentic experiences,” admits Colin Jarvis, who leads OpenAI’s deployed engineering team. Jarvis says the current crop of AI agents are far from achieving their full potential. “It still needs quite a bit of work to get that reliability,” he says.  

Butterfly Effect made a similar point. When I reached out to the firm to discuss my problems using its agent, I was told that “Manus is currently in its beta stage, and we are actively working on optimising and improving its performance and functionality”. 

Tech firms have arguably been struggling to get agentic AI working for several years. In 2018, for instance, Google argued that a version of an AI agent it had developed, called Duplex, was going to change the world. The company touted Duplex’s ability to call up restaurants and reserve tables for its customers. But, for reasons unknown, it never took off as an everyday tool with widespread appeal.  

Beyond the hype

Nevertheless, AI companies and tech analysts alike say the agentic AI revolution is just around the corner. The number of mentions of agentic AI on financial earnings calls at the end of last year was 51 times greater than it was in the first quarter of 2022. The interest here is not merely in using agents to assist human employees, but also to replace them. For example, companies including Salesforce, which helps businesses manage customer relations, are rolling out AI agents to sell services.  

Stone doesn’t think the technology is quite ready for that kind of application. “There’s a lot of overhype right now,” he says. “It’s certainly not going to be within the next few years that all jobs are gone or that autonomous agents are doing everything.” To make good on the most ambitious claims, he says, “fundamental algorithms… would need to be discovered”.

Enthusiasm may be high because tools like ChatGPT perform so well that they have raised expectations of what AI can achieve more generally. “People have extrapolated to say, ‘Oh, if they can do that, they can do everything,’” says Stone. Certainly, I found that agentic AI can work extremely well – some of the time. But Stone says we shouldn’t infer from a few limited examples that AI agents can do it all.

On reflection, I am inclined to agree with him – at least until my version of Operator recognises that I consider no order from a Chinese restaurant truly complete without a side of prawn crackers. 

Topics:



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *