The Problem: Low Actionable Feedback Rates
Here's a common pattern in e-commerce feedback (and these are realistic for most websites, not edge cases):
Launch product → 10,000 customers buy → 10–20% leave reviews → 20–30% of those contain actionable insights.
For example, apparel tends to sit toward the higher end of review volume, consumer electronics toward the lower end, and SaaS higher on volume but uneven on quality.
Product decisions often rely on feedback from roughly 2–6% of the shopper base. The silent 94–98% remain largely unknown. Writing reviews requires effort, which naturally filters out customers without strong opinions.
Why traditional feedback struggles?
What businesses see:
- Selection bias: The ecstatic (5-star) and frustrated (1-star) customers are most likely to write reviews. The nuanced middle ("Good, but the battery could be better") usually stays silent.
- Low-signal reviews: Many reviews lack specificity—"Great product!" or "Terrible quality"—without context that helps inform product decisions.
- Star ratings without context: A 3-star review could mean "Mediocre product" or "Great product with one fixable flaw." Understanding the difference requires manually writing the reviews.
- Delayed pattern detection: By the time a pattern becomes obvious in reviews, thousands—or tens of thousands—more units may already be in customers' hands with the same issue.
- Limited follow-up: "Battery life is poor" raises questions it can't answer. How long does it last? What were you doing? What did you expect? Written feedback rarely allows deeper probing unless the user chooses to elaborate.
What users experience:
- High cognitive load: Writing a coherent review takes 3–6 minutes. Simpler products trend faster; complex products take longer. Most customers won't invest this time unless they feel strongly committed.
- Generic forms: The same questions are asked for wireless headphones and yoga mats, with no acknowledgment of purchase history or preferences.
- Uncertain impact: Submit feedback → Hear nothing → Product may not improve → Reduced motivation to provide future feedback.
The Solution: Intelligent MCQ Interviews
What if we could extract high-signal feedback effortlessly—driving product improvement, building platform trust, and making it easy for users to shape better products?
Instead of asking users to write reviews, we interview them with hyper-personalized MCQs. Not generic forms. Not hardcoded decision trees. Adaptive, context-aware conversations where each question is shaped by:
- Product intelligence: What are the known pain points for this specific product? (Analyzed from existing reviews)
- User intelligence: What's their purchase history? How do they typically review? (Behavioral profiling)
- Peer intelligence: What did similar buyers complain about? (Vector similarity across cohorts)
The core insight: MCQs collapse cognitive load while preserving signal. One-click responses can generate the same depth as written reviews—if the questions are smart enough.
The potential value:
- Vendors: More diverse feedback through guided interviews, faster pattern detection for product iteration
- Platforms: Increased engagement through lower-friction feedback collection, richer review data
- Users: 2-minute MCQ interactions replace lengthy text writing, direct input into product improvements
Introducing Survey Sensei—a multi-agent system that implements this approach with four specialized agents:
- ProductContextAgent → Analyzes reviews/metadata to understand product strengths and weaknesses
- CustomerContextAgent → Builds behavioral profile from purchase history and review patterns
- SurveyAgent (adaptive MCQ engine) → Generates tailored questions, reshapes follow-ups based on each answer
- ReviewGenAgent → Synthesizes MCQ selections into natural language reviews in the user's voice
The full workflow (end-to-end example):
Step 1 (Product Intelligence): Analyze 213 reviews of the purchased laptop
→ Battery: 67% complain "dies mid-afternoon"
→ Keyboard: 82% praise "excellent typing experience"
→ Performance: 45% mention "handles multitasking well"
Step 2 (User Intelligence): Pull purchase history for this customer
→ Bought 3 laptops in past 2 years (power user pattern)
→ Reviews 85% of purchases, critical but fair (3.6★ average)
→ Detail-oriented: past reviews averaged 120 words
Step 3 (Adaptive Question 1):
"You've purchased 3 laptops recently. What drove this upgrade?"
○ Better performance
● Longer battery life ← USER SELECTED
○ Lighter/more portable
○ Other: [text]
Step 4 (Adaptive Follow-up):
"How long does the battery last on a typical workday?"
○ All day (12+ hours)
● 4-8 hours ← USER SELECTED
○ Less than 4 hours
Step 5 (Probing Deeper):
"Does this meet your battery expectations?"
○ Exceeds expectations
● Falls slightly short ← USER SELECTED
○ Major disappointment
[Agent continues for 10-12 total questions, probing keyboard quality,
performance, portability based on this user's priorities...]
Step 6 (Review Synthesis): Convert MCQ selections → natural language
"Upgraded hoping for better battery. Lasts 4-8 hours—falls short
of all-day claims, but manageable for office work. Keyboard is
outstanding for typing. Performance handles multitasking well."
Contrast with generic tools:
- Generic survey: "Rate this product 1-5. Any comments?" (No context, high friction)
- Survey Sensei: Adaptive interview that knows you're a repeat buyer with battery concerns and generates 10-12 contextual MCQs to extract nuanced feedback in 2-3 minutes.
Why Now? What Changed to Make This Possible
The convergence of cheaper, more intelligent models with rapidly declining token costs has made AI-powered personalization economically viable at scale.
2020 (GPT-3 era):
- GPT-3: Inconsistent outputs, required extensive prompt engineering
- Embeddings: Research tools without production infrastructure
- Stateful agents: Custom state machines for each workflow
2025 (GPT-4/GPT-5 era and evolving):
- GPT-4o-mini: More reliable structured outputs at lower cost (used in this prototype)
- Vector databases: pgvector, Pinecone, Weaviate with mature ecosystems
- Agent frameworks: LangGraph handles state management and orchestration
What This Enables
1. Per-user personalization:
- Questions adapt to purchase history and behavioral patterns
- Depth and complexity match user engagement level
- Topic selection reflects individual product concerns
2. Adaptive vs. static workflows:
- Traditional: Fixed question sequences regardless of responses
- AI-powered: Follow-up questions probe deeper based on answers
3. Natural language synthesis:
- Traditional templates: Generic phrasing, obvious patterns
- AI synthesis: Contextual details, varied expression
4. Economic accessibility:
- Earlier LLM costs limited surveys to high-value scenarios
- Current pricing makes this solution viable
- Lower costs enable experimentation and iteration
Two-Part Architecture
Before diving into the details, it's critical to understand how the project is structured. The diagram below shows the complete system architecture—from the UI layer through the orchestrator to the multi-agent framework, along with data pipelines and database schema:
Before diving into the details, it's critical to understand how the project is structured:
Part 1: Simulation Infrastructure (Testing Layer)
Purpose: Development scaffolding—test the core system without production data.
MockDataOrchestrator creates semi-realistic e-commerce ecosystems:
- Products: RapidAPI fetch (real Amazon data) + 5 similar products (LLM) + 3 diverse products (LLM)
- Users: Main user (you) + N mock personas
- Reviews: RapidAPI reviews (real) + LLM-generated reviews
- Transactions: 40-60% have reviews (matches reality)
- Embeddings: Batch parallel processing
In production: Skip this entirely and integrate with real e-commerce databases and pipelines.
Part 2: Agentic Survey Framework (Core USP)
Purpose: The actual product—adaptive survey generation + authentic review synthesis.
This is the heart of Survey Sensei. The 4-agent system decomposes into specialized agents, each with focused responsibility:
Agent 1: ProductContextAgent
Build a mental model of the product before generating questions.
Three-path adaptive logic:
- Direct Reviews Path (Confidence: 70-95%)
- Condition: Product already has reviews in database
- Ranking heuristic: Recency (50% weight, exponential decay with 180-day half-life) + Quality (40%, review length) + Diversity (10%, bonus for 3-4 star reviews)
- Confidence formula: 0.70 + (num_reviews / 100), capped at 0.95
- Extracts: Key features, pain points, use cases, pros/cons, sentiment patterns
- Similar Products Path (Confidence: 55-80%)
- Condition: No reviews for this product, but vector-similar products exist
- Process: Cosine similarity search via pgvector (threshold: 0.7)
- Ranking heuristic: Similarity (40%) + Recency (35%) + Quality (20%) + Diversity (5%)
- Extracts: Inferred experience from analogous products
- Generic/Description Only Path (Confidence: 40-50%)
- Condition: New product, zero reviews anywhere
- Process: Parse title, description, category metadata
- Extracts: Educated guesses (e.g., "Wireless device → ask about battery")
Output schema:
class ProductContext:
key_features: List[str]
major_concerns: List[str]
pros: List[str]
cons: List[str]
common_use_cases: List[str]
context_type: str
confidence_score: float
Agent 2: CustomerContextAgent
Build behavioral profiles to personalize question depth and tone.
Three-path adaptive logic:
- Exact Interaction Path (Confidence: 85-95%)
- User bought THIS exact product before
- Ground truth on what they thought
- Similar Products Path (Confidence: 55-80%)
- Ranking heuristic: Similarity (45%) + Recency (30%) + Engagement (25%)
- Infer preferences from purchase patterns (e.g., "Bought 3 noise-canceling headphones → cares about ANC quality")
- Demographics Path (Confidence: 35-45%)
- Brand new user, zero purchase history
- Generic baseline persona
Output schema:
class CustomerContext:
purchase_patterns: List[str]
review_behavior: List[str]
product_preferences: List[str]
primary_concerns: List[str]
expectations: List[str]
pain_points: List[str]
engagement_level: str # highly_engaged | moderately_engaged | passive_buyer | new_user
sentiment_tendency: str # positive | critical | balanced | polarized | neutral
review_engagement_rate: float
confidence_score: float
Personalization:
- Critical + highly engaged → Deep technical MCQs
- Passive buyer → Simple MCQs
Agent 3: SurveyAgent (Stateful)
Conducts adaptive surveys where questions evolve based on answers. Uses LangGraph StateGraph for conversation state.
Performance optimization: Survey state is cached in-memory during the survey (no database writes on every answer). State is only persisted to the database at two points:
- Survey start: Initial contexts frozen to
product_contextandcustomer_contextJSONB columns - Survey completion: Final Q&A written to
questions_and_answersJSONB, complete state tosession_contextJSONB
All intermediate answers are logged asynchronously to survey_details table for analytics (fire-and-forget, non-blocking).
The interview flow:
┌─ Survey Start ─────────────────────────────────────────────┐
│ │
│ 1. Fetch contexts in parallel: │
│ ├─ ProductContextAgent → What to ask about │
│ └─ CustomerContextAgent → How to ask it │
│ │
│ 2. Generate initial MCQs (3 questions baseline) │
│ │
│ 3. Stateful conversation loop: │
│ ┌───────────────────────────────────────────┐ │
│ │ Present MCQ │ │
│ │ ↓ │ │
│ │ Wait for user selection │ │
│ │ ↓ │ │
│ │ Process answer → Update internal state │ │
│ │ ↓ │ │
│ │ Route decision: │ │
│ │ ├─ Need follow-up? → Generate adaptive │ │
│ │ ├─ Move to next topic? → Next question │ │
│ │ └─ Survey complete? → Save & exit │ │
│ └───────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
Survey completion rules:
initial_questions_count: 3 # Start with 3 baseline MCQs
min_answered_questions: 10 # User must answer ≥10
max_answered_questions: 15 # Hard stop at 15
max_survey_questions: 20 # Total questions asked
max_consecutive_skips: 3 # 3 consecutive skips → must answer to continue
Adaptive questioning example:
Question 5: "How long does the battery last on a typical workday?"
● Less than 4 hours ← USER SELECTED
[Agent's internal state update:
- Battery performance: Below average
- Action: Generate follow-up to quantify impact]
Follow-up Question 6: "When does the battery typically die?"
● Mid-afternoon (2-4pm) ← USER SELECTED
[Agent's internal state update:
- Specific pain point: Dies at 2-4pm (work hours)
- Severity: High (impacts productivity)
- Action: Probe importance for review weighting]
Follow-up Question 7: "How important is longer battery life to you?"
● Very important - major inconvenience (USER SELECTED)
Why adaptive matters: Without adaptive AI, you'd build a rigid decision tree: "Battery life: Excellent | Good | Fair | Poor". That tells you what they think—but misses the critical detail: "Dies at 2pm during work hours, and it's a major inconvenience." The AI doesn't just branch—it regenerates the next question based on evolving context.
Agent 4: ReviewGenAgent
Convert MCQ selections into natural language reviews matching user's writing style.
Three-stage synthesis:
- Sentiment Classification → Analyze MCQ answers → Classify as
good|okay|bad - Voice Matching → Fetch historical reviews → Extract tone, vocabulary, sentence structure
- Generate 3 variations with different star ratings within sentiment band:
def _get_star_ratings(sentiment_band: str) -> List[int]:
if sentiment_band == "good":
return [5, 4]
elif sentiment_band == "okay":
return [4, 3, 2]
else: # bad
return [2, 1]
Example output (sentiment: "okay", user: concise + critical):
[4-star] "Solid build quality and excellent screen. Battery dies around 3pm—acceptable for office use where I have charging access. Keyboard is comfortable for long typing. Performance handles multitasking well. Worth it on sale."
[3-star] "Mixed feelings. Build quality and screen are great, but battery is the main letdown—dies at 3pm despite 'all-day' claims. Keyboard is excellent. If battery isn't a dealbreaker, it's decent."
[2-star] "Disappointed with battery life. Product page advertised all-day battery, but it dies by 3pm daily with moderate use. Screen and keyboard are good, but battery is a major problem for anyone working away from chargers."
User picks framing, edits if needed, submits. 2 minutes of MCQ clicks → rich, authentic review.
Technical Implementation
Tech Stack
Backend: FastAPI (Python 3.11), LangChain + LangGraph, OpenAI GPT-4o-mini ($0.15/1M tokens), Pydantic, Supabase/PostgreSQL + pgvector
Frontend: Next.js 14, TypeScript, Tailwind CSS, Supabase Client
AI/ML: OpenAI embeddings (1536-dim), batch generation (100 texts in 2-3s), IVFFlat indexes (2-3% recall loss for 100x speed)
Database Schema
-- 1. PRODUCTS: Catalog with semantic embeddings
products (
item_id VARCHAR(20) PRIMARY KEY,
title, brand, description,
price, star_rating, num_ratings,
review_count INTEGER,
embeddings vector(1536), -- Semantic search
is_mock BOOLEAN
)
-- 2. USERS: Behavioral profiles
users (
user_id UUID PRIMARY KEY,
user_name, email_id, age, gender, base_location,
embeddings vector(1536),
total_purchases INTEGER,
total_reviews INTEGER,
review_engagement_rate DECIMAL(4,3),
avg_review_rating DECIMAL(3,2),
sentiment_tendency VARCHAR(20),
engagement_level VARCHAR(30),
is_main_user BOOLEAN
)
-- 3. TRANSACTIONS: Purchase history
transactions (
transaction_id UUID PRIMARY KEY,
item_id → products,
user_id → users,
order_date, delivery_date,
original_price, retail_price,
transaction_status
)
-- 4. REVIEWS: Multi-source feedback
reviews (
review_id UUID PRIMARY KEY,
item_id → products,
user_id → users,
transaction_id → transactions,
review_title, review_text, review_stars,
source VARCHAR(20), -- 'rapidapi' | 'agent_generated' | 'user_survey'
embeddings vector(1536)
)
-- 5. SURVEY_SESSIONS: Stateful survey orchestration
survey_sessions (
session_id UUID PRIMARY KEY,
user_id, item_id, transaction_id,
product_context JSONB, -- Agent 1 output
customer_context JSONB, -- Agent 2 output
session_context JSONB, -- LangGraph state
questions_and_answers JSONB,
review_options JSONB,
status VARCHAR(20)
)
-- 6. SURVEY_DETAILS: Event log
survey_details (
detail_id UUID PRIMARY KEY,
session_id → survey_sessions,
event_type VARCHAR(50),
event_detail JSONB,
created_at TIMESTAMP
)
Design decisions:
- JSONB for flexibility → Agent outputs evolve without migrations
- Vector indexes → IVFFlat gives 100x speed for 2-3% recall loss
- Source tracking →
rapidapi(real) |agent_generated(mock) |user_survey(golden path) - Event sourcing →
survey_detailslogs every interaction for debugging
Vector Similarity
All text → 1536-dim embeddings via text-embedding-3-small.
Find similar products:
SELECT item_id, title,
1 - (embeddings <=> query_embedding) AS similarity
FROM products
WHERE 1 - (embeddings <=> query_embedding) > 0.7
ORDER BY similarity DESC LIMIT 5;
Why vectors beat traditional categories:
Traditional hierarchies (Electronics → Audio → Headphones → Wireless) are rigid. Vector embeddings cluster products by intent and use case:
- Category-based: "Wireless headphones" → Returns ALL wireless headphones
- Vector-based: "Premium noise-canceling headphones for travel" → Returns semantically similar products solving the same problem (Bose QC45, AirPods Max, Sennheiser Momentum 4)
Vector embeddings naturally cluster by intent rather than superficial attributes. "Noise-canceling Bluetooth headphones" is closer to "wireless earbuds with ANC" than to "studio monitor headphones"—even though all three are technically "headphones."
Performance benchmarks:
- Vector search (IVFFlat): 10,000 products in ~50ms
- Batch embeddings: 100 texts in 2-3 seconds
- Brute-force: 10,000 products in ~5 seconds (100× slower)
API Design
The API provides six endpoints that cover the end-to-end survey workflow:
Error handling and edge cases:
1. Session expiration:
- Survey sessions expire after 24 hours of inactivity
- Prevents abandoned surveys from cluttering the database
- User can't submit answers to expired sessions (returns HTTP 410 Gone)
2. Idempotency:
- Answering the same question twice → Updates the answer (no duplicates)
- Submitting the same review twice → Ignored (review already posted)
- Prevents accidental double-submissions from network retries
3. Pydantic validation:
- All API requests/responses validated with Pydantic schemas
- Fail fast: Invalid data rejected at API boundary (before hitting agents)
- Example:
answermust be one of the provided options, not arbitrary text
Running It Locally
Source: github.com/arnavvj/survey-sensei
Prerequisites: Python 3.11+, Node.js 18+, Supabase account (free), OpenAI API key (~$5)
Backend Setup
1. Clone the repo and create a python environment:
git clone https://github.com/arnavvj/survey-sensei.git
cd survey-sensei/backend
conda env create -f environment.yml # Installs all deps (FastAPI, LangChain, etc.)
conda activate survey-sensei
2. Configure environment variables:
cp .env.local.example .env.local
Edit .env.local with your credentials:
OPENAI_API_KEY=sk-proj-... # From platform.openai.com
SUPABASE_URL=https://xxxxx.supabase.co # From Supabase dashboard
SUPABASE_SERVICE_ROLE_KEY=eyJhbGciOiJIUzI1NiIs... # From Supabase Settings → API
RAPID_API_KEY=your_rapidapi_key # Optional: From rapidapi.com
3. Initialize database:
python database/init/apply_migrations.py # Applies migrations
# Execute SQL code from `backend\database\_combined_migrations.sql` in your supabase project
4. Start the backend:
uvicorn main:app --reload --port 8000
Frontend Setup
1. Navigate to frontend:
cd survey-sensei/frontend
2. Configure environment variables:
cp .env.local.example .env.local
Edit .env.local:
NEXT_PUBLIC_SUPABASE_URL=https://xxxxx.supabase.co
NEXT_PUBLIC_SUPABASE_ANON_KEY=eyJhbGciOiJIUzI1NiIs...
OPENAI_API_KEY=sk-proj-...
3. Install dependencies and start dev server:
npm install
npm run dev # Open http://localhost:3000
Testing the Flow
Step 1: Submit product and user information
Enter an Amazon product URL (must include ASIN) and generate mock data (takes 3-4 minutes).
The MockDataOrchestrator builds a realistic e-commerce simulation in your supabase project. For example:
|
Entity |
Count |
Composition |
Purpose |
|---|---|---|---|
|
Products |
11 |
1 real (RapidAPI) + 6 similar (LLM) + 4 diverse (LLM) |
Market context for ProductContextAgent |
|
Users |
13-25 |
1 main user + 12-24 mock personas (varied ages, locations, purchase patterns) |
Behavioral diversity for CustomerContextAgent |
|
Reviews |
30-100+ |
10-15 real (RapidAPI) + 20-85 LLM-generated (70% positive, 20% neutral, 10% negative) |
Signal for ProductContextAgent analysis |
|
Transactions |
80-170+ |
Each review → 1 transaction; additional no-review purchases (40% sparsity); 1 "current" delivery (triggers survey) |
Realistic purchase patterns |
|
Embeddings |
200-300 |
All entities → 1536-dim vectors (batch parallel via text-embedding-3-small) |
Semantic similarity search |
Step 2: Launch the survey
Click "Start Survey" and wait 3-5 seconds.
What's happening behind the scenes:
- ProductContextAgent → Analyzes reviews with weighted ranking (e.g., recency 50%, quality 40%, diversity 10%)
- CustomerContextAgent → Profiles purchase behavior and review patterns (e.g., repeat buyer, high review rate, critical tone)
- SurveyAgent → Combines both contexts to generate initial personalized MCQs (e.g., First set of 3 questions tailored to user+product pair)
Step 3: Answer questions (MCQ-based)
Answer 10-12 adaptive MCQ questions. Each response triggers follow-up questions that probe deeper into your concerns (e.g., "battery life" → "how long does it last?" → "does this meet expectations?").
Step 4: Generate and submit review
ReviewGenAgent synthesizes your MCQ responses into 3 natural language review variations (different star ratings, same sentiment). Pick one, optionally edit, and submit.
Real-World Impact Simulations
Note: These are projections based on industry benchmarks and reasonable assumptions. Actual results will vary significantly based on implementation, industry vertical, and user behavior. The scenarios below illustrate potential impact, not guaranteed outcomes.
Scenario 1: Mid-Market E-Commerce Business ($8M revenue)
Baseline (Traditional Reviews):
- 50,000 monthly orders → 1,500 reviews → 300 actionable (0.6% of customers)
- Iteration lag: 4-6 weeks to detect patterns
- Returns cost: $200k/year
With Survey Sensei (projected):
- 50,000 orders → 7,500 surveys (15% completion target) → 5,250 actionable (potential 17.5× gain)
- Iteration speed: Potential to detect issues in Week 1-2
- Returns: 2.5% → 1.8% (assumes early issue detection reduces returns)
Potential financial impact:
- Returns reduction: $56k/year potential savings
- Increased repeat purchases: +4% repeat rate → $3.84M/year potential additional revenue
- Better conversion: +0.5% → $40k/year potential additional revenue
- Total potential annual benefit: $3.94M/year
ROI (assuming full impact realization):
- Annual costs: OpenAI API ($180/year) + Infrastructure ($1,440/year) = $1,620/year
- ROI calculation: ($3.94M - $1.6k) / $1.6k = 243,000%
- Even at 10% of projected impact, ROI > 24,000%
Scenario 2: Survey-as-a-Service for Small Businesses
Potential business model:
- Target: 250 small e-commerce businesses
- Pricing: $199/month per business
- Projected ARR: $597k/year
Projected unit economics:
- Revenue: $597k/year
- Estimated costs: $307k/year (API, infrastructure, support, sales)
- Potential gross profit: $290k/year (48.6% margin)
Customer acquisition (estimated):
- Projected CAC: $500/business
- Estimated LTV: $199 × 24 months = $4,776
- Target LTV:CAC ratio: 9.5:1 (healthy SaaS benchmark: 3:1)
From Ideation to Market Adoption: A Potential 4-Month Journey
Month 0: Current MVP State
- Single FastAPI + Supabase + synchronous LLM calls
- 4-agent system functional with adaptive MCQ generation
- Cost: $0.002/survey, handles 100-200 surveys/day
- Works with mock data; needs real-world validation
Month 1: Production Hardening + Initial Testing
Infrastructure improvements:
- Redis session state, background job queues (Celery/BullMQ)
- Rate limiting, retry logic, circuit breakers
- Observability: Prometheus + Grafana + Sentry
- Deploy: Cloud Run + Vercel + Supabase (~$120/month for 10k surveys/day)
Batch data pipelines:
- Setup batch context generation: Pre-compute ProductContext and CustomerContext nightly for all active products/users
- Implementation: Celery/Airflow jobs running nightly, results cached in Redis/PostgreSQL JSONB
- Benefits: Survey start latency drops from 3-5s to <500ms (contexts already pre-generated)
- Cost optimization: Generate contexts once daily vs. on-demand per survey (~10× API cost reduction at scale)
Early validation:
- Target: 5-10 small Shopify stores (500-2k monthly orders)
- Deployment: Standalone SaaS with manual CSV upload
- Goal: 1,000 real surveys → Measure completion rates, agent quality, user satisfaction
- Key unknowns: Real completion rates vs. mock data, agent consistency across categories
Month 2: Platform Integrations (If Early Metrics Look Promising)
Service layer architecture:
# Embedded API integration example
@app.route('/webhooks/order_delivered', methods=['POST'])
def handle_order_delivered(order_data):
response = requests.post('https://api.surveysensei.io/v1/surveys/generate', json={
'transaction_id': order_data['id'],
'user_id': order_data['customer_id'],
'product_id': order_data['product_id'],
'user_context': {...},
'product_context': {...}
})
survey_url = response.json()['survey_url']
send_email(to=order_data['customer']['email'], body=survey_url)
Initial connectors:
- Shopify app: OAuth + webhooks for order.completed events
- WooCommerce plugin: WordPress plugin triggering post-delivery surveys
- Zapier connector: No-code integration for quick wins
Early data patterns (if scale permits):
- 5-10k surveys → Begin identifying product category patterns
- Embeddings infrastructure: Initial indexing of products and users
- Reality check: Integration complexity often exceeds estimates; OAuth flows require extensive testing
Month 3: Analytics Layer + Scale Testing
Basic intelligence features:
- Admin dashboard: Completion rates, question distribution, sentiment trends
- Simple alerts: Spike detection for complaint patterns
- Search interface: Semantic grouping of similar feedback
Scale validation:
- Test batch processing: Pre-generate contexts nightly
- Target: 30-50k surveys/day infrastructure capacity
- Key validation: Can agent quality maintain at scale? Do costs stay predictable?
Market positioning refinement:
- Focus on validated use cases (likely e-commerce to start)
- Document what works and what needs improvement
Month 4+: Iterative Improvement
Realistic expectations:
- Simulations suggested 10-17× improvement; real-world results will vary
- Pattern detection speed depends on survey volume and data quality
- Network effects require significant scale to materialize
Competitive considerations:
- Embeddings tuning: Weeks of optimization work
- Domain calibration: Requires hundreds of surveys per category
- Deep integrations: 1-2 months per major platform
- First-mover advantage exists but isn't insurmountable
What needs ongoing work:
- Agent prompt tuning based on real user feedback
- Cost optimization as volume scales
- Product-category-specific customization
- Handling edge cases and error modes
- Begin exploring adjacent verticals based on early feedback
Conclusion
Survey Sensei demonstrates a practical path toward better customer feedback by combining modern AI capabilities with structured data collection:
What we built:
- 4-agent architecture: ProductContext + CustomerContext + Survey + ReviewGen agents working together
- Adaptive MCQ generation: Questions that evolve based on answers, not rigid decision trees
- Vector similarity search: Semantic matching via pgvector for contextual personalization
- Economic viability: $0.002/survey makes AI-powered surveys accessible for mid-market e-commerce
Improvements over traditional reviews:
- Structured feedback through guided MCQ interviews (reduces cognitive load)
- Real-time pattern detection with agent-based analysis
- Voice-matched review synthesis (maintains authenticity while saving time)
- Simulations suggest 10-17× more actionable feedback from the same customer base
What needs validation:
- Real-world completion rates with live customers
- Agent consistency across product categories
- Long-term ROI at production scale
The system works today. Clone the repo, run the setup, and test a survey in 10 minutes. The architecture shows how multi-agent patterns handle complex, context-dependent workflows—not just for surveys, but for any system requiring personalization at scale.
If you're building customer feedback systems, recommendation engines, or personalization tools, this architecture offers a concrete reference implementation. The pattern (context gathering → adaptive decision-making → personalized output) generalizes well:
- Customer support triage: Ticket context + User history → Prioritized response
- Personalized onboarding: Product features + User goals → Custom walkthrough
- Content recommendations: User preferences + Content embeddings → Ranked suggestions
The shift toward specialized agents collaborating on tasks represents a practical middle ground between monolithic models and over-engineered microservices. It's early, but the economics and technical patterns are sound enough to build on.
References and Further Reading
Questions? Ideas? Feedback?
- GitHub: github.com/arnavvj/survey-sensei
- Open an issue on GitHub
Academic Papers
- Automated Survey Collection with LLM-based Conversational Agents (arXiv, 2024) - Framework for phone-based surveys using conversational LLMs with 98% extraction accuracy
- Embedding in Recommender Systems: A Survey (arXiv, 2023) - Comprehensive survey on embedding techniques for recommender systems at scale
- Unified embedding: Battle-tested feature representations for Web-scale ML systems (Google Research, NIPS '23) - Real-world deployment of embeddings in production systems
Industry Reports
- Global Study: How Consumers Share Feedback, 2025 (Qualtrics XM Institute) - Analysis of 23,000+ consumers showing direct feedback declining while indirect feedback increased 60%
- The State of Customer Experience Management, 2025 (Qualtrics XM Institute) - Annual CX management trends and priorities
- Power forward: Five make-or-break truths about next-gen e-commerce (McKinsey, 2024) - E-commerce trends with focus on AI and generative AI adoption
- The agentic commerce opportunity (McKinsey, 2024) - How AI agents are transforming consumer shopping experiences
Multi-Agent Systems Resources
- LangGraph: Multi-Agent Workflows (LangChain Blog, 2024) - Building multi-agent systems with LangGraph
- Multi-agent network tutorial (LangChain Documentation) - Step-by-step guide to multi-agent collaboration patterns
Technical Documentation
- LangChain: python.langchain.com
- LangGraph: langchain-ai.github.io/langgraph
- pgvector: github.com/pgvector/pgvector
- OpenAI Embeddings: platform.openai.com/docs/guides/embeddings
All code examples and architecture diagrams are from the Survey Sensei codebase.