Hugging Face: Vision-Language Models: CLIP-style and PrefixLM approaches via Transformers for multimodal tasks | SignalBreak | SignalBreak