Summary
This work studies how prompt-conditioned FiLM layers and multi-scale fusion can adapt a multimodal backbone to low-dose CT image quality assessment. Language prompts are used to modulate the visual encoder so the model can respond to specific quality criteria rather than a single monolithic notion of quality.
The paper is more targeted than a general multimodal evaluation study: it uses prompting not just at the input level, but as a conditioning mechanism inside the model. That makes the quality predictor sensitive to which criteria are being emphasized.
Method
The architecture builds on MedSigLIP and adds FiLM-style prompt conditioning together with multi-scale feature fusion. The goal is to combine local texture evidence and broader anatomical context while letting text prompts shape which quality attributes the network prioritizes.
Why it matters
This work is important because it treats quality assessment as a criteria-aware task rather than a single global score. It pushes the lab’s IQA direction toward models that can respond to explicit language about what kind of quality matters in a given setting.
Highlights
- Uses language-conditioned feature modulation for medical IQA.
- Combines local texture cues with broader anatomical context through multi-scale fusion.
- Tries to make model predictions align more closely with radiologist-style quality language.