sia.hackernoon.com

Introduction

A few days ago, yet another AI Agent struck. It is magnificently named “Intelligent Internet” or II-agent, claiming to the smartest in the world yet fully open source! So what is it? How did it score so high in the autonomous AI agent smartness tests? How does it stack up to other AI agents such as OpenAI Deep Research, Manus, and GenSpark?

In this article, we will not only answer the above questions, but also carry out a few real-world test cases to see if such AI agents truly deliver.

According to ii-agent’s official blog post, it beats other closed source AI Agents in GAIA score.

Author’s Notes: GAIA is a benchmark for General AI Assistants that proposes real-world questions requiring fundamental abilities such as reasoning, multi-modality handling, web browsing, and tool-use proficiency — questions that are conceptually simple for humans yet challenging for most advanced AIs, with human respondents achieving 92% accuracy versus only 15% for GPT-4 equipped with plugins.

What is II-agent

Unique features and achievements

II-Agent is an open-source intelligent assistant framework that tops benchmarks for available agents while maintaining full transparency and auditability. Its core capabilities span research and fact-checking with multi-step web search, content generation including blog drafts and technical manuals, data analysis with automated reporting, software development across multiple languages, workflow automation, and advanced problem-solving.

The system features sophisticated architecture with Anthropic’s Claude 3.7 Sonnet at its core, intelligent context management that can handle up to 120,000 tokens, advanced browser automation with vision capabilities, and real-time WebSocket communication for interactive use cases.

Competitive Position Against Leading Frameworks:

II-Agent positions itself as a superior open-source alternative to leading closed-source agents like Manus (which achieved 65% on the GAIA benchmark) and GenSpark AI. While previous open-source attempts like OpenManus by Foundation Agent and Suna by Kortix-AI “have attempted to replicate the capabilities of Manus and GenSpark,” none “have come close to matching their performance or achieving comparable benchmark results”.

II-Agent addresses this gap by not only matching but surpassing benchmark performance while maintaining the critical advantages of being open-source, transparent, and extensible — qualities deemed essential for regulated industries and maximum adoption.

Example use cases

Technical architecture

You can find II-agent’s github repository here.

High Level Architecture

II-Agent provides two primary interfaces: a command-line tool (cli.py) for direct terminal-based interactions and a WebSocket server (ws_server.py) that feeds a React-based frontend living in the frontend/ directory (GitHub). At its heart lies a modular agent core in src/ii_agent/, which orchestrates LLM calls (via the llm/ clients), handles planning and reflection logic, and dispatches execution tasks (defined in tools/) while leveraging shared utilities from utils/ (GitHub). Real-time communication is achieved through WebSockets, with isolated agent instances streaming events to the browser UI for a responsive user experience (GitHub).

Detailed Internal Architecture

Let me examine the specific modules within the II-Agent repository structure more closely.

Agents Module (src/ii_agent/agents/): This module contains the core agent implementations that serve as the central orchestration component, managing the agent’s operational cycle through iterative reasoning and capability selection. The agents handle system prompting with dynamically tailored context, maintain comprehensive interaction history, and coordinate with the LLM to interpret user instructions and decide how to proceed with tasks. Each agent instance operates within isolated workspaces and manages the flow from user input through tool execution to final response delivery.

LLM Interface Module (src/ii_agent/llm/): This module provides the interface layer for interacting with large language models, specifically optimized for Anthropic’s Claude 3.7 Sonnet. It handles both direct Anthropic API access and Google Cloud Vertex AI routing, managing authentication, request formatting, and response processing. The module implements intelligent context management with token usage estimation and strategic truncation techniques to handle conversations up to 120,000 tokens while maintaining conversation quality and relevance.

Tools Module (src/ii_agent/tools/): This module encompasses the rich set of execution capabilities that allow the agent to interact with its environment, including file system operations with intelligent code editing, command line execution in secure environments, advanced web interaction and browser automation, and specialized capabilities for various modalities. The tools support PDF text extraction, audio transcription, speech synthesis, image generation, video generation, and deep research integration. The browser automation capabilities are built on Playwright with vision integration, enabling the agent to navigate webpages, capture screenshots, and analyze visual content.

Utilities Module (src/ii_agent/utils/): This module provides supporting infrastructure including the context management system that handles token estimation and optimization, file-based archival mechanisms for large outputs, WebSocket communication protocols for real-time interaction, and various helper functions for data processing and system operations. The utilities also include the planning and reflection capabilities that enable structured reasoning, problem decomposition, and transparent decision-making processes that create an audit trail of the agent’s problem-solving approach.

Dependency on Other Open-Source Frameworks

II-Agent integrates tightly with multiple open-source services and libraries. For its LLM backends, it supports Anthropic Claude via the Anthropic Python SDK and Google’s Vertex AI through the google-cloud-aiplatform client, while also offering plug-and-play search providers such as Tavily, Jina, Firecrawl, and SerpAPI (GitHub). On the frontend and tooling side, it builds atop a React/Node.js stack (Next.js conventions) and leverages community-driven projects — most notably AugmentCode for software-engineering task orchestration, Manus’s prompt-architecture patterns, and the Index Browser Use toolkit for advanced web interactions (GitHub).

Easy test case

First I wanted to try a simple test case to establish a base line. And there is no better simple test case than the Nokia Snake game. Even though it was mentioned in its exampl cases, I added a small twist that it should be a html5 mobile game.

[Prompt]: Recreate the Nokia snake game as a html5 web game

And it was a success from the first attempt! Not only the game is ready to play, it added Nokia feature phone specific design elements that made the game retro authentic.

Author’s Notes: As a reference, the same prompt in Claude Sonnet 3.7 produce similar looking app, but it immediately shows “Game Over” when a user try to play. Most likely Its collision detection algorithm was not properly implemented, and and game was not tested.

Now let’s have a more detailed look at how II-agent achieved this:

First the II-agent created a plan (always important for autonomous agents):

Set up the basic HTML structure with a canvas element
Create CSS styles for the game layout
Implement the core Snake game logic in JavaScript:
- Snake movement and controls
- Food generation
- Collision detection
- Score tracking
- Game over handling
- Snake movement and controls
Add Nokia-style visual elements (pixel art, color scheme)
Implement mobile-friendly controls
Test the game thoroughly
Deploy the game so you can play it

Next it created the relevant files and implemented code. And finally it deploys it as a web app so I can test it out. The II-agent webui is similar to Claude, where you can see three main tabs on the right hand side: Browser, Code, Terminal.

Difficult test case

Since the Nokia Snake game is well documented and tested in many different coding LLM and AI Agents, it would be good to do a test case that goes beyond simple coding.

My plan is to let the II-agent to build a website for Nokia’s historical feature phones, which inherently include research (Nokia feature phone models), design (website design), coding, and browser use (browse the Internet for product images and specs, test the final website).

Well, the results shows that we have still a long way to go.

First try

[Prompt] Make a website for Nokia feature phone museum, showcasing all the Nokia feature phones in a product gallery format. For each phone model, display the product image, launch year, units sold, and one paragraph short description of key features.

It created a decent plan to tackle this challenge

Research Nokia feature phone models to collect information on:
- Product images
- Launch years
- Units sold
- Key features for descriptions
Design the website layout:
- Create a responsive main page with an introduction to Nokia feature phones
- Implement a product gallery section to display all phone models
- Include individual phone detail sections
- Add navigation and search functionality
Develop the website using:
- HTML for structure
- Tailwind CSS for styling
- JavaScript for any interactive elements (if needed)
Test the website for:
- Responsiveness
- Proper display of information
- Navigation functionality
Deploy the website and provide you with the final URL

Then it went on to do research, a lot of research. I can see from the “Browser” tab that is is browding the Internet for a very long time, But eventually it failed and stopped. The reason? One of the browsing task timed out on a particular slow website.

The AI agent’s plan did not include any exception handling and error recovery mechanism.

Second try

Since in the first try, the agent tried too hard to get information on ALL the Nokia feature phones, My second attempt aimed to reduce the scope to top 5 feature phones.

[Prompt]: Make a website for Nokia feature phone museum, showcasing Nokia’s top 5 feature phones in a product gallery format. For each phone model, display the product image, launch year, and one paragraph short description of key features.

This time the agent failed and stopped again, due to a different reason. Anthropic rate limit kicks in, and throws an error. Again the AI agent does not have any exception handing and recovery mechanism,

In addition, there is another glaring error. The AI agent did NOT follow user instruction of “limiting the research to only “Top 5 phones”, and instead going out to researching all the feature phones.

Third try

Thinking third time might be a charm, I thought why not let the agent to create the mockup website first, and then research the phone models? This way we can have at least a Proof of Concept in case the phone model research gets stuck again. Basically we switch the order of two main tasks to achieve the goal.

[Prompt]: Make a website for Nokia feature phone museum, showcasing Nokia’s top five feature phones in a product gallery format. For each phone model, display the product image, launch year, and one paragraph short description of key features. Remember:

1) First build a mock up website without real data

2) When research Nokia phone models, only research top five models, only 5

So the first part of creating the mock-up website seems to be going well:

But when it is time to test the website there is a problem. It uses port 8000 for http server conflicts with ii-agent’s own backend server. Since there is no way to interrupt the II-agent and change the port number on the fly, I have to kick off a new session to add instruction of the changes.

Fourth try

[Prompt]: _Make a website for Nokia feature phone museum, showcasing Nokia’s top five feature phones in a product gallery format. For each phone model, display the product image, launch year, and one paragraph short description of key features. Remember:

First build a mock up website without real data, use port 8888 for the web server
When research Nokia phone models, only research top five models, only 5

The mock-up website actually works, and looks decent.

The automated website testing did not work, probably due to some process management in the Terminal tab. Fortunately after several failed attempt, it went on with life and said

“I attempted to start a local web server on port 8888, but I’m experiencing some issues with it. This doesn’t affect the actual website development, as I’ll now proceed to research Nokia’s top 5 feature phones to add the real data.”

I felt a glimpse of hope for AI agents now.

Eventually the test case still failed, again due to Anthropic’s rate limit. And the AI agent was not able to recover from that.

Fantasy vs Reality — my learnings

We have always wanted very intelligent AI Agents that can be truly autonomous and do wonderful things on our behalf. But the reality is that we are quite far from it. This is not a specific criticism of II-agent, but AI aggents in general.

There are certain things that human do extremely well and almost effortlessly, while all the state of art AI Agents struggle. For example

Plannng: break down complex goals into smaller tasks, carry out feasibility study of each tasks, compile ordering of tasks
Reasoning: analyze what went right and what went wrong, carry out root cause analysis, look for flaws in assumptions and external conditions
Reaction: reduce scope, change order of tasks, try alternative solution, run proof of concepts, ask for help when needed, stop and restart from afresh

So with today’s SOTA models and agents, we have a few options right now:

Co-Pilot: this prove to be very successful both conceptually and commercially. for example, coding with Cursor and GitHub Co-Pilot; Microsoft Co-Pilot for Office 365.
Human in the loop; one design pattern called Agent Inbox by LangChain, where the agent can interrupt humans when they are stuck or need assistance. The challenge still remains that if AI Agents can be smart enough decide when to issue the interrupts.

So there you have it, the second wave of fully Autonomous AI Agent:, from OpenAI’s Deep Research, Manus, GenSpark, and now to II-agent. Baby steps for sure, but we are making progress.

Postscript

One day after I did the first tests and wrote the article, I decided to give the difficult test one more chance. This time it worked (99%): it created the mock-up website, successfully followed instructions to research the top 5 phones, gathered all the info and images, and eventually updated the website!

Why 99% instead of 100%? The very last step after modifying the website: launch for testing failed. After some manual debugging, it seems the II-agent created a new session and cannot find the original website files anymore. But manually running the website by Python works, and the screenshots were taken.

This sounds like a small bug that can be fixed.

Open-Source AI Agent II-Takes the Lead in Benchmark Wars—But Is It Truly Autonomous?