In this talk, we will delve into the forefront of foundational multimodal models, exploring the latest systems that combine text, images, videos, audio, etc. We will analyze the new opportunities unlocked by these generative AI systems and how practitioners and industry experts can leverage open-source AI for their use cases. In particular, we will dive into the construction of Idefics2, a state-of-the-art multimodal model that punches above its weight. With 8B parameters, it is competitive with alternatives with 4x more parameters. With one of the author, we will explore the lessons learned while building Idefics2 and investigate how one can fine-tune Idefics2 for their own use case.
Victor Sanh, a Lead Research Scientist at Hugging Face, focuses on pushing the boundaries of multimodal generative AI. He is dedicated to ensuring that his research is accessible and beneficial to the entire AI community. This commitment to open-source and open science is reflected in his efforts to share research artifacts like datasets and models, as well as uncovering the recipes required to construct these artifacts. As a founding member of Hugging Face, Victor has played a pivotal role in developing highly utilized ML tools and models, including DistilBERT, the HF Transformers library, the BLOOM model, and the T0 model, demonstrating his tremendous impact on the field.