Research Whitepaper

Digital Twin Agent Training Report

Achieving 73.90% Human Behavioral Similarity

January 7, 2026

15 min read

MimixLabs Research Team

Key Results

73.90%

Weighted Optimized Similarity

+12.50%

Improvement vs Baseline

42M

Events Analyzed

Abstract

This report documents the methodology, implementation, and results of training AI agents to simulate human e-commerce behavior. Using a novel combination of personality-driven prompting, LLM-based evaluation, and behavioral clustering, we achieved 73.90% weighted optimized similarity (exceeding our 70% target). The system uses the REES46 e-commerce dataset (13,492 subjects) and Google's Gemini 3 Flash model via OpenRouter for cost-effective LLM evaluation.

Training Configuration

Parameter	Quick Test	Full Training
Total Subjects	13,492	13,492
Training Subjects	100	10,793 (80%)
Validation Subjects	50	2,699 (20%)
Clusters	3	3
Max Iterations	3	15
LLM Model	google/gemini-3-flash-preview

1. Introduction

Problem Statement

Creating AI agents that accurately simulate human behavior in e-commerce environments is crucial for:

A/B Testing: Testing UI changes without real user traffic
User Research: Understanding how different personality types interact with products
Predictive Analytics: Forecasting conversion rates and user journeys

The challenge is ensuring these agents behave like real humans, not idealized or random actors.

2. Dataset: REES46 E-Commerce

We used the REES46 eCommerce Behavior Dataset from Kaggle, one of the largest publicly available e-commerce clickstream datasets.

Total Events

42+ million

Unique Users

1.4+ million

Action Types Distribution

view_product~85%

cart~10%

purchase~5%

3. Behavioral Identity Persona (BIP) Model

The BIP model consists of 6 layers that define an agent's complete behavioral profile:

The 6 Layers

Layer 1: PERSONALITY — OCEAN traits + derived

Layer 2: SKILLS — Learned action patterns

Layer 3: HEURISTICS — Quick decision rules

Layer 4: EMOTIONS — Dynamic state + thresholds

Layer 5: MEMORY — What to remember

Layer 6: PROVENANCE — Training data source

OCEAN Personality Model

Trait	Low Behavior	High Behavior	Shopping Impact
Openness	Conventional	Creative, curious	Exploration breadth
Conscientiousness	Spontaneous	Detail-oriented	Research depth
Extraversion	Reserved	Outgoing	Social proof weight
Agreeableness	Skeptical	Trusting	Price sensitivity
Neuroticism	Calm	Anxious	Friction tolerance

4. Similarity Metrics

The overall similarity score is a weighted combination of seven metrics:

Overall Score = 0.25 × Sequence Similarity (LCS)
              + 0.20 × Action Distribution Match
              + 0.15 × Outcome Match
              + 0.15 × Intent Sequence Similarity
              + 0.10 × Temporal Similarity
              + 0.10 × Step Count Penalty
              + 0.05 × N-gram Similarity

Sequence Similarity

25% weight — LCS algorithm comparing action sequences

Intent Sequence

15% weight — Focus on purchase-intent actions

5. Training Methodology

Pipeline Overview

1. CLUSTER Users

→

2. OPTIMIZE Per-Cluster

→

3. EVALUATE Training

→

4. VALIDATE Holdout

User Clustering Results

Cluster	Name	Size	Description
0	High Converters	~7-14%	High purchase rate, focused behavior
1	Deep Browsers	~16-23%	Low conversion, extensive product viewing
2	Quick Abandoners	~63-77%	Brief sessions, rapid exit

6. Results

Per-Cluster Results

High Converters

Baseline: 85.89%85.89%

Deep Browsers

Baseline: 58.4%75.9%+17.5%

Quick Abandoners

Baseline: 59.8%72.4%+12.6%

Key Discovery: Conscientiousness Drives Abandoner Behavior

Increasing conscientiousness to 0.65 improved Deep Browsers from 58.40% to 75.90% — a +17.50% improvement. Users who browse extensively but don't purchase aren't "random clickers" — they're deliberate researchers.

7. Conclusion

We successfully trained AI agents to achieve 73.90% behavioral similarity with real human e-commerce users, exceeding our 70% target. Key achievements:

LLM-Based Evaluation: Using Gemini 3 Flash for cost-effective, accurate evaluation
Per-Cluster Training: Recognizing that different user types need different personas
Conscientiousness Discovery: High conscientiousness (0.65) is key for modeling "researcher" behavior
Robust Metrics: Added intent sequence similarity for better outcome prediction

Ready to try it yourself?

See how digital twins can transform your user research and A/B testing.

Request Early Access