Digital Twin Agent Training Report
Achieving 73.90% Human Behavioral Similarity
Key Results
Abstract
This report documents the methodology, implementation, and results of training AI agents to simulate human e-commerce behavior. Using a novel combination of personality-driven prompting, LLM-based evaluation, and behavioral clustering, we achieved 73.90% weighted optimized similarity (exceeding our 70% target). The system uses the REES46 e-commerce dataset (13,492 subjects) and Google's Gemini 3 Flash model via OpenRouter for cost-effective LLM evaluation.
Training Configuration
| Parameter | Quick Test | Full Training |
|---|---|---|
| Total Subjects | 13,492 | 13,492 |
| Training Subjects | 100 | 10,793 (80%) |
| Validation Subjects | 50 | 2,699 (20%) |
| Clusters | 3 | 3 |
| Max Iterations | 3 | 15 |
| LLM Model | google/gemini-3-flash-preview | |
1. Introduction
Problem Statement
Creating AI agents that accurately simulate human behavior in e-commerce environments is crucial for:
- A/B Testing: Testing UI changes without real user traffic
- User Research: Understanding how different personality types interact with products
- Predictive Analytics: Forecasting conversion rates and user journeys
The challenge is ensuring these agents behave like real humans, not idealized or random actors.
2. Dataset: REES46 E-Commerce
We used the REES46 eCommerce Behavior Dataset from Kaggle, one of the largest publicly available e-commerce clickstream datasets.
Action Types Distribution
3. Behavioral Identity Persona (BIP) Model
The BIP model consists of 6 layers that define an agent's complete behavioral profile:
The 6 Layers
OCEAN Personality Model
| Trait | Low Behavior | High Behavior | Shopping Impact |
|---|---|---|---|
| Openness | Conventional | Creative, curious | Exploration breadth |
| Conscientiousness | Spontaneous | Detail-oriented | Research depth |
| Extraversion | Reserved | Outgoing | Social proof weight |
| Agreeableness | Skeptical | Trusting | Price sensitivity |
| Neuroticism | Calm | Anxious | Friction tolerance |
4. Similarity Metrics
The overall similarity score is a weighted combination of seven metrics:
Overall Score = 0.25 × Sequence Similarity (LCS)
+ 0.20 × Action Distribution Match
+ 0.15 × Outcome Match
+ 0.15 × Intent Sequence Similarity
+ 0.10 × Temporal Similarity
+ 0.10 × Step Count Penalty
+ 0.05 × N-gram Similarity25% weight — LCS algorithm comparing action sequences
15% weight — Focus on purchase-intent actions
5. Training Methodology
Pipeline Overview
User Clustering Results
| Cluster | Name | Size | Description |
|---|---|---|---|
| 0 | High Converters | ~7-14% | High purchase rate, focused behavior |
| 1 | Deep Browsers | ~16-23% | Low conversion, extensive product viewing |
| 2 | Quick Abandoners | ~63-77% | Brief sessions, rapid exit |
6. Results
Per-Cluster Results
Key Discovery: Conscientiousness Drives Abandoner Behavior
Increasing conscientiousness to 0.65 improved Deep Browsers from 58.40% to 75.90% — a +17.50% improvement. Users who browse extensively but don't purchase aren't "random clickers" — they're deliberate researchers.
7. Conclusion
We successfully trained AI agents to achieve 73.90% behavioral similarity with real human e-commerce users, exceeding our 70% target. Key achievements:
- LLM-Based Evaluation: Using Gemini 3 Flash for cost-effective, accurate evaluation
- Per-Cluster Training: Recognizing that different user types need different personas
- Conscientiousness Discovery: High conscientiousness (0.65) is key for modeling "researcher" behavior
- Robust Metrics: Added intent sequence similarity for better outcome prediction
Ready to try it yourself?
See how digital twins can transform your user research and A/B testing.
Request Early Access