Prompt-Conditioned FiLM and Multi-Scale Fusion on MedSigLIP for Low-Dose CT Quality Assessment

Summary

This work studies how prompt-conditioned FiLM layers and multi-scale fusion can adapt a multimodal backbone to low-dose CT image quality assessment. Language prompts are used to modulate the visual encoder so the model can respond to specific quality criteria rather than a single monolithic notion of quality.

The paper is more targeted than a general multimodal evaluation study: it uses prompting not just at the input level, but as a conditioning mechanism inside the model. That makes the quality predictor sensitive to which criteria are being emphasized.

Method

The architecture builds on MedSigLIP and adds FiLM-style prompt conditioning together with multi-scale feature fusion. The goal is to combine local texture evidence and broader anatomical context while letting text prompts shape which quality attributes the network prioritizes.

Why it matters

This work is important because it treats quality assessment as a criteria-aware task rather than a single global score. It pushes the lab’s IQA direction toward models that can respond to explicit language about what kind of quality matters in a given setting.

Highlights

Uses language-conditioned feature modulation for medical IQA.
Combines local texture cues with broader anatomical context through multi-scale fusion.
Tries to make model predictions align more closely with radiologist-style quality language.