ai.01 · Scenario S3

Heterogeneous Accelerator Fabric

Straggler cascade effects in mixed-generation accelerator deployments with asymmetric interconnect capabilities.

Scenario Definition

System Class

Mixed accelerator fleet with heterogeneous execution characteristics

Scale

Straggler-dominated regime with reactive overprovisioning

Operational Mode

Mixed training and inference across GPU, TPU, and NPU devices

Device Heterogeneity

High variance in progress rates across device populations

Recognition Pattern

Overprovisioning helps temporarily, straggler discussions intensify, performance diagnosis becomes political, fleet efficiency declines despite investment.

Structural Observations

Straggler cascades are not device failures but structural coupling effects where scheduling decisions incompatible with device heterogeneity create systematic slowdowns.

  • Device capability differences create systematic rather than random straggler patterns
  • Homogeneous scheduling policies applied to heterogeneous fleets amplify coupling
  • Overprovisioning increases device diversity and may worsen coupling effects
  • Straggler identification based on device identity misses structural root cause

Stability Projection

Baseline

Unstable
Reserve: Negative

With Structural Control

Marginal
Reserve: Limited

Transition type: Partial stabilization via device-aware coupling control

Aggregated Metrics

Normalized ratios without absolute units. Baseline values crossed out, comparison values highlighted.

Fleet Utilization Efficiency
0.52 0.74
Device Idle Time Ratio
0.41 0.19
Straggler Cascade Freq.
0.38 0.12
Progress Rate Variance
0.47 0.21
Overprovisioning Effect.
0.34 0.67
Scheduling Coherence
0.48 0.78

Decision Implication

Primary insight: If your heterogeneous accelerator fleet shows declining efficiency despite investment, with straggler discussions becoming political and overprovisioning providing only temporary relief, you have a structural coupling problem rooted in scheduling-device mismatch.

Monitoring limitation: Device-level metrics show individual units performing to spec. The problem exists in the interaction between scheduling decisions and device population heterogeneity.

Scaling consideration: Additional capacity increases device diversity and may worsen coupling effects. The problem cannot be provisioned away.

Evidence & Artefacts

Pre-computed analysis outputs for this scenario.

Such structural findings are typically contextualized through a scoped architecture risk assessment.