Our cloud documentation is almost always out of date. It's not because we're lazy; it's because the cloud moves too fast. A diagram drawn in a sprint planning meeting is obsolete by the time the code hits production. This documentation crisis, that every engineering team faces, is a massive and invisible tax. Nobody talks about it, but we know that manual updates are expensive, error-prone, and always outdated when you need them most. The "cost" isn't just the 2-3 days of senior engineer time every quarter—it's the production incidents that could have been prevented, the security vulnerabilities you didn't know existed, and the new hires who take weeks to understand the system.

I was tired of this cycle. So I built a solution that uses AI agents to automatically scan live AWS environments and generate accurate, multi-audience documentation in minutes—not days. Here's how it works, what I learned, and why this approach unlocks something bigger than just better diagrams.

The Problem

Why Everything We've Tried Has Failed

The Solution

AI Agents That Understand Infrastructure

What we actually needed is a system that can perceive infrastructure like a scanner, understand it like a senior architect, and explain it like a technical writer—automatically. To achieve this, I created a "crew" of specialized AI agents—each with a specific job, just like a real engineering team.

Think of it like this:

All working in parallel, all generating outputs from the same live data, all in minutes.

The Transformation

Before vs. After

Aspect

Before ( Manual Process )

After ( Automated with AI Agents )

⏱️ Time

2-3 days per environment

5-10 minutes per environment

👤 Who

Senior engineer (expensive)

Anyone with AWS access

📄 Output

One diagram, maybe a doc

Diagram + 4 tailored documents

🔄 Update Frequency

Quarterly if you're lucky

On-demand or automated (CI/CD)

🎯 Accuracy

Outdated within weeks

Always reflects current state

😰 Stress Level

High (always out of date)

Low (always accurate)

Quick Start

The entire system is open source. You can have it running in 5 minutes:

# 1. Install the package
git clone <https://github.com/kirPoNik/aws-architecture-diagrams-with-crewai.git>
cd aws-architecture-diagrams-with-crewai
pip install -e .

# 2. Run it (that's it!)
aws-diagram-generator \\
  --name "Production" \\
  --region us-east-1 \\
  --tags "Environment=prod" "App=myapp"

# 3. Check your output/ directory for complete documentation

Prerequisites:

In under 10 minutes, you'll have:

How It Actually Works

Three Key Innovations:

The Architecture

How It All Fits Together

What You Actually Get

Here's what the final markdown file can look like

# AWS Architecture Documentation: Production Environment

## Table of Contents
1. Architecture Diagram
2. Technical Infrastructure Runbook
3. Executive Summary for Leadership
4. Developer Onboarding Guide

## Architecture Diagram
@startuml
!include <awslib/AWSCommon>
!include <awslib/Compute/EC2>
!include <awslib/Database/RDS>

rectangle "VPC: vpc-12345 (10.0.0.0/16)" {
  rectangle "Public Subnet: subnet-abc" {
    ElasticLoadBalancing(alb, "Application LB", "")
  }
  rectangle "Private Subnet: subnet-def" {
    EC2(web1, "Web Server 1", "t3.medium")
    EC2(web2, "Web Server 2", "t3.medium")
  }
  rectangle "DB Subnet: subnet-ghi" {
    RDS(db, "PostgreSQL", "db.t3.large")
  }
}

alb --> web1
alb --> web2
web1 --> db
web2 --> db
@enduml

## Technical Infrastructure Runbook

### Compute Resources
**EC2 Instance: i-0abc123** (Web Server 1)
- Instance Type: t3.medium
- Private IP: 10.0.1.10
- Security Groups: sg-web123 (allows 80/443 from ALB)
- IAM Role: web-server-role
- Tags: Environment=production, Tier=web

[... detailed configs for every resource ...]

## Executive Summary
This production environment hosts our customer-facing web application using a
highly available, three-tier architecture. The system consists of:

- **Web Tier:** Redundant web servers behind a load balancer for high availability
- **Database Tier:** Managed PostgreSQL database with automated backups
- **Security:** Private subnets, restricted security groups, encrypted data

The architecture supports approximately 10,000 daily users with 99.9% uptime...

## Developer Onboarding Guide
### Quick Start
**Application URL:** <https://my-app-prod-123.us-east-1.elb.amazonaws.com>

**Database Connection:**
```bash
Host: mydb.cluster-abc.us-east-1.rds.amazonaws.com
Port: 5432
Database: production_db
User: app_user

## **Environment Variables:**
[... practical connection details ...]

💭 Final Thoughts and Next Steps

This approach is powerful, but it's not magic. Here are the real-world considerations:

  1. Dependency: The AWS Config discovery method is robust, but it relies on AWS Config being enabled and correctly configured to record all the resource types you care about.
  2. Cost: This makes heavy use of a powerful LLM (like Claude 3.5 Sonnet or GPT-4). Running it on-demand is fine, but running it every 10 minutes on a massive environment could get expensive.
  3. API Rate Limits: AWS Bedrock has very strong limits, especially on Anthropic Models ( 1-2 requests per minute). To work around we use models via inference profile. Also the Use-Case submission is required.
  4. Non-Determinism: LLMs are non-deterministic. The Analyst might occasionally misinterpret a relationship or the Draftsman might make a syntax error. This requires prompt refinement and testing.

Once you have AI agents that can perceive and understand your infrastructure, you unlock an entire category of use cases:

📚 Resources