Structural separation of semantic content from control signals across modalities, analyzing cross-modal safety boundaries.
Multimodal AI systems process inputs from multiple modalities — text, images, audio, video — through shared representation spaces. The structural problem is that control signals (instructions that direct model behavior) and semantic content (information to be processed) are not structurally separated across modalities. An image can contain embedded text that the model interprets as an instruction. An audio input can carry encoded commands. These cross-modal injection paths exist because the model's multimodal fusion creates coupling between modalities that does not respect the content-control distinction.
This application addresses multimodal AI systems where inputs from different modalities are fused into a shared processing space. The relevant system boundary includes modality-specific encoders, fusion mechanisms, the shared representation space, and the model's instruction-following behavior that can be triggered through any modality.
Multimodal AI is expanding rapidly into production applications. Cross-modal injection represents a structural security vulnerability that single-modality defenses cannot address. Structural isolation diagnostics provide the foundation for securing multimodal applications against the expanding attack surface that multimodality creates.
The SORT framework addresses this application through four structural dimensions, each providing a distinct analytical layer.
Cross-modal inputs can take over control.
Insufficient separation between semantic content and control signals.
Structural analysis of cross-modal isolation.
Multimodal architecture, safety boundaries, input validation.