Gen AI chatbots have changed how we should analyze user intent. Before AI chatbots, we relied more on structured interactions—clicks, impressions, page views. Now, we’re dealing with free-form conversations.
This shift in how intent is expressed creates several challenges, outlined below:

Previously, the recommendation systems assumed structured inputs, with LLM’s they need actual conversation signals to make them productive and for training the models.

System needed for Ingesting ChatBot data

  1. A real-time PII processor using both regex rules and contextual NLP in the ingest pipeline
  2. A privacy-aware data warehouse supporting analytics and legal compliance with data encryption
  3. Conversation metrics that improve models without requiring raw data access

Building a Better Framework

Data Ingestion

Our system processes incoming chat data through a high-throughput pipeline from applications:

class SecureMessage(BaseModel):
    chat_id: UUID                  # Conversation session
    request_id: UUID               # User question identifier
    response_id: UUID              # LLM response identifier
    timestamp: datetime            # Event time
    encrypted_pii: bytes           # GPG-encrypted raw text  
    clean_text: str                # De-identified content
    metadata: Dict[str, float]     # Non-PII features (sentiment, intent)
    vector_embedding: List[float]  # Semantic representation (768-dim)
    session_context: Dict          # Device, region, user segment

The magic below happens in the PII detection system in the ingestion pipeline:

All detected PII is secured with envelope encryption using rotating AES-256 data keys, with master keys stored in GSM or some cloud secret manager with strict access controls.

Multi-Temperature Storage

All the data might not need the same treatment, so a tiered approach for storage is a great idea. Here’s our system:

Tier

Technology

Retention

Use Case

Access Pattern

Hot

Redis + Elasticsearch

7 days

Real-time A/B testing

High-throughput, low latency

Warm

Parquet on Cloud Storage

90 days

Model fine-tuning

Batch processing, ML pipelines

Cold

Compressed Parquet + Glacier

5+ years

Legal/regulatory audits

Infrequent, compliance-driven

Data should be partitioned by time, geography, and conversation topic—optimized for both analytical queries and targeted lookups. Access controls enforce least privilege principles with just-in-time access provisioning and full audit logging.

Overcoming Technical Hurdles

Building this system has its challenges:

  1. Scaling Throughput: Scaling Kafka consumers to achieve 100ms end-to-end latency to power models with real-time data
  2. Accurate PII Detection: Our use of NLP and Regex Regex-based PII system helped us ensure privacy
  3. Maintaining Data Utility: Semantic preservation techniques (replacing real addresses with similar fictional ones) retained 95% analytical utility with zero PII exposure

Measuring What Matters

Hallucination Detection That Actually Works

We calculate a Hallucination Score (H) as:

H = 1 - (sim(R, S) / max(sim(R, D)))

Where:

Conversation Quality Metrics

Our framework tracks:

Compliance on Autopilot

Privacy regulations shouldn't require manual processes. Our system automates:

Making AI/ML Better

The framework generates de-identified features:

Privacy You Can Count On

Our framework delivers both cryptographic and statistical privacy guarantees:

The Road Ahead

We're continuing to improve the framework with: