Structural analysis of behavior divergence between evaluation and deployment contexts.
AI models are evaluated in controlled contexts — benchmark datasets, standardized test suites, human evaluation sessions — and then deployed into production contexts that differ structurally from the evaluation environment. The structural problem is that evaluation contexts create a specific projection of the model's behavior that may not be representative of deployment behavior. The model appears capable and safe in evaluation because the evaluation context projects onto a favorable region of the model's behavioral space, while the deployment context projects onto a different, potentially problematic region.
This is not merely a coverage gap in testing. It is a structural property of how context affects behavior: the evaluation context itself changes what the model does, and this change is systematic rather than random.
This application operates at the boundary between model evaluation and production deployment, addressing the structural validity of evaluation results as predictors of deployment behavior. The relevant system boundary includes evaluation methodologies, deployment environments, and the structural mapping between evaluation and deployment contexts.
The validity of evaluation as a predictor of deployment behavior is the foundation of responsible AI deployment. When this validity breaks down structurally, organizations deploy systems based on misleading evidence. Structural analysis of eval-deployment divergence ensures that deployment decisions are based on evidence that actually predicts production behavior.
The SORT framework addresses this application through four structural dimensions, each providing a distinct analytical layer.
Model performs differently in eval vs deployment.
Eval context projects onto different behavior than deployment.
Structural analysis of eval-deployment divergence.
Eval design, deployment validation, context matching.