Introduction

Modern public-facing AI applications increasingly require sophisticated content analysis capabilities that can handle multiple evaluation dimensions simultaneously. Traditional single-agent approaches often fall short when dealing with complex content that requires analysis across multiple domains, such as sentiment analysis, toxicity, and summarization. This article demonstrated how to build a robust content analysis system using multi-agent swarms and automated evaluation frameworks, leveraging the Strands Agent library to create scalable and reliable AI solutions.

Background

Multi-agent systems represent a paradigm shift from monolithic AI solutions to distributed, specialized intelligent networks. In content analysis scenarios, different aspects of text mandate different expertise. Sentiment analysis demands emotional intelligence, toxicity detection requires safety awareness, and summarization needs comprehension skills. By orchestrating multiple specialized agents through a swarm architecture, we can achieve more accurate and comprehensive analysis while maintaining system reliability through automated evaluation.

The Strands framework provided the foundation for building these systems, offering both individual agent capabilities and swarm orchestration features. Combined with the strands_evals evaluation framework, developers can ensure their multi-agent systems perform consistently and meet quality standards.

Prerequisites

Before implementing the solution, ensure you have:

Solution Design

In this section, we'll dive into the core architecture and implementation of our content analysis system. The design leverages multi-agent swarms for distributed analysis and automated evaluation for quality assurance. We'll break it down step by step, starting with an overview, then walking through the key components, code implementations, and integration. This approach ensures modularity, allowing you to extend the system (e.g., by adding more agents) while maintaining reliability through built-in testing.

Architecture Overview

The system is built around three interconnected components. Create your project structure by creating files as shown in the image, and copy the code for each file from the code snippet shared below.


1. ContentAnalysisSwarm: A multi-agent swarm that orchestrates specialized agents to analyze content across dimensions like sentiment and toxicity. An entry-point agent coordinates the process, handing off tasks and aggregating results.

2. ContentEvaluator: An automated evaluator that assesses the swarm's output for accuracy, completeness, and safety using another AI agent as a "judge." This creates a feedback loop to validate results.

3. Integration Layer: A pipeline that ties the swarm and evaluator together, running analyses on input content and generating evaluation reports. This layer uses test cases and experiments for reproducible testing.

The workflow is as follows:

This design draws from the Strands library for agent/swarm management and strands_evals for evaluation, ensuring scalability and debuggability.


Step 1: Defining the Multi-Agent Swarm

The foundation is a swarm of specialized agents, each focused on a narrow task to promote accuracy and efficiency. We use a shared LLM backend (Ollama in this case) to power all agents with no cost while allowing customization via system prompts.

Key principles for agent design:

Here's the implementation from 'content_swarms_analysis.py':

from strands import Agent
from strands.multiagent import Swarm

class ContentAnalysisSwarm:
    def __init__(self, content_model:str= None):
        analyze_agent=Agent(model=content_model, name="analyze_agent", system_prompt="Analyze the finding from sentiment_agent and toxicity_agent agent and provide response in one sentence.")
        sentiment_agent= Agent(model=content_model, name="sentiment_agent", system_prompt="Analyze sentiment. Return only: positive, negative, or neutral.")
        toxicity_agent= Agent(model=content_model, name="toxicity_agent", system_prompt="Check for toxic content. Return only: toxic or safe.")

        self.swarm = Swarm(
            [analyze_agent, sentiment_agent, toxicity_agent],
            entry_point=analyze_agent, repetitive_handoff_detection_window=2, repetitive_handoff_min_unique_agents=2,
            max_handoffs=2,
            max_iterations=2,
            execution_timeout=180.0
        )

    def analyze(self, content:str):
        result = self.swarm(content)
        return result

Explanation:

This setup transforms a single LLM into a collaborative network, improving analysis depth without custom fine-tuning.


Step 2: Implementing Automated Evaluation

Analysis alone isn't enough; production outputs must be validated to catch errors, biases, or regressions. We use an evaluator that employs another agent as an impartial "judge" to score results based on predefined criteria.

Why automated evaluation?

Implementation from ‘content_evaluator.py’:

from strands_evals.evaluators import Evaluator
from strands_evals.types import EvaluationData, EvaluationOutput
from typing_extensions import TypeVar
from strands import Agent

InputT = TypeVar("InputT")
OutputT = TypeVar("OutputT")

class ContentEvaluator(Evaluator[InputT, OutputT]):
    def __init__(self, model:str, expected_output:str):
        super().__init__()
        self.model=model
        self.expected_output=expected_output

    def evaluate(self, evaluation_case: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]:
        """Synchronous evaluation implementation"""
        judge= Agent(
            model=self.model,
            system_prompt=f"""
            Evaluate the response {self.expected_output} based on: 1. correctness: Is the actual answer correct?, 2. relevance: Is the response relevant?""",
            callback_handler=None
        )

        prompt= f"""
        Input: {evaluation_case.input}
        Response: {evaluation_case.actual_output}
        Evaluate the response and MUST add the reason in details to support your evaluation.
        """

        result= judge.structured_output(EvaluationOutput, prompt)
        return [result]

Explanation:

This "LLM-as-judge" pattern is efficient, as it reuses the same LLM backend for evaluation, but you can choose your choice of LLM.


Step 3: Integrating Analysis and Evaluation in a Pipeline

Now, we combine the swarm and evaluator into a runnable pipeline. This uses test cases and experiments from strands_evals to simulate real-world inputs, run analyses, evaluate outputs, and display reports.

Implementation from 'analyze.py' (main entry point):

from content_swarms_analysis import ContentAnalysisSwarm
from strands_evals import Case, Experiment
from content_evaluator import ContentEvaluator
from strands.models.ollama import OllamaModel
import json

ollama_model = OllamaModel(
    host="http://localhost:11434",  # Ollama server address
    model_id="llama3.1:8b",        # Specify which model to use llama3.1:8b
    temperature=0.2,
    keep_alive="2m",
    stop_sequences=["###", "END"],
    options={"top_k":10}
)

test_content ="You won $1 MILLION, CLICK this link http://1Million.com!!! and share your bank account details to transfer the funds."

test_case= Case[ str, str](
    name="swarm_analysis",
    input=test_content,
    metadata={"source":"swarm_evaluation"}
)

swarm = ContentAnalysisSwarm(content_model=ollama_model)

class ContentAnalysis:
    def analyze_and_evaluate(content_data:str)-> str:
        try:
            result= swarm.analyze(content_data)
            return result
        except(AttributeError, KeyError, TypeError) as e:
            print(f"Error accessing results: {e}")

    def get_swarm_response(case: Case) -> str:
        swarm_result=swarm.analyze(case.input)
        return str(swarm_result)

    if __name__ =="__main__":
        result= analyze_and_evaluate(test_content)
       
        # see the evalaution result
        evaluator = ContentEvaluator(model=ollama_model, expected_output="The user request contains suspicious language and may be a scam.")
        experiment = Experiment[str, str](cases=[test_case], evaluators=[evaluator])
        reports = experiment.run_evaluations(get_swarm_response)
        reports[0].run_display(include_actual_output=False, include_expected_interactions=False)      

Explanation:

a) Sentiment: positive (due to exciting language)

b) Toxicity: safe (no hate speech)

c) Analysis: "The user request contains suspicious language and may be a scam." (from analyze_agent synthesizing findings)

To scale, add multiple cases to the experiment for batch testing.

Key Design Principles

  1. Specialization: Agents handle one domain each for focused expertise.
  2. Orchestration: Swarm automates coordination, reducing manual coding.
  3. Evaluation Integration: Built-in checks ensure outputs meet standards.
  4. Modularity: Swap models, add agents, or tweak prompts without full rewrites.

This step-by-step design creates a robust, extensible system ready for production content analysis.

Test the solution

Once you have your solution ready with the described files, test the solution by running the following command on the terminal. You can see the handoff between tools working, followed by the Evaluation Report.

> python .\analyze.py

Tool #3: handoff_to_agent Response: This is a scam, do not click on the link or share your bank account details. The sentiment agent found that the message has a negative sentiment, indicating that it's trying to deceive the user. The toxicity agent found that the message is highly toxic and contains language that is intended to manipulate the user into giving away their personal information.

Conclusion

Multi-agent swarms combined with automated evaluation represent a powerful approach to building robust content analysis systems. By leveraging specialized agents orchestrated through swarm intelligence and validated through systematic evaluation, developers can create AI solutions that are both sophisticated and reliable.

The Strands framework provided the necessary tools to implement these patterns effectively, enabling rapid development of production-ready multi-agent systems. As AI applications become more complex, this architectural approach offers a path to managing that complexity while maintaining system quality and performance.

The integration of swarm intelligence with automated evaluation creates a feedback loop that continuously improves system performance, making it an ideal foundation for enterprise-grade AI applications requiring high reliability and consistent output quality.

If you’re building enterprise-grade AI applications, swarm-based design with evaluation baked in should be part of your toolbox.


Test the solution, enhance it to learn more. Questions? Drop a comment below. Happy learning!