Where: Main hall
When: Tuesday Jan 27 from 10:45 AM to 12:00 PM
This session offers an in‑depth look at how the Italian MMSP community is contributing to the emerging paradigm of Agentic AI, i.e., AI systems capable of autonomous perception, reasoning, and action across multimodal environments. By bridging foundational MMSP research with the requirements of next‑generation intelligent agents, the session showcases a rich spectrum of perspectives and innovations from leading Italian groups.
The program spans key areas shaping Agentic AI: from audio and acoustic intelligence, with a critical discussion of advantages, limits, and risks; to AI-driven translation of compressed latent streams enabling universal media processing and reconstruction; to multimodal representations of touch and perception through visual and haptic latent spaces. Further contributions explore how Agentic AI can redefine multimedia forensics, moving beyond static deepfake detection, and how federated intelligence at the edge, from drones and vehicles to large-scale models, can empower autonomous, distributed agents.
Together, these talks provide a comprehensive overview of research directions, challenges, and expected outcomes for Agentic AI in Italy, highlighting the central role of Multimedia Signal Processing in enabling the agents of tomorrow.
| Start | End | Title | Presenter |
|---|---|---|---|
| 10:45 | 11:00 | Agentic AI in Audio and Acoustics: Advantages, Limits and Dangers | Augusto Sarti (PoliMi) |
| 11:00 | 11:15 | AI Translation of Compressed Latent Streams for Universal Processing/Reconstruction | Alessandro Gnutti (UniBS) |
| 11:15 | 11:30 | Binding Vision to Touch for Haptic-aware Digital Twins | Antonio Stefani (UniTN) |
| 11:30 | 11:45 | Agentic AI for Multimedia Forensics: Beyond Static Deepfake Detection | Paolo Bestagini (PoliMi) |
| 11:45 | 12:00 | Federated Intelligence at the Edge: From Drones and Cars to Foundation Models | Matteo Caligiuri (UniPD) |
Session chair: Prof. M.Barni (Università degli Studi di Siena)
Agentic AI in Audio and Acoustics: Advantages, Limits and Dangers
Prof. A.Sarti (Politecnico di Milano)
AI has brought ground-shifting changes into the research world, and transformations of this magnitude come with great advantages but also great dangers. Agentic AI offers a way out, but it needs to be used responsibly. In my presentation, I would like to discuss such aspects and show how this approach can be advantageous in the areas of Audio and Acoustics, and where we should draw the line.
AI Translation of Compressed Latent Streams for Universal Processing/Reconstruction
Prof. A.Gnutti (Università degli Studi di Brescia)
This talk explores a "beyond visual reconstruction" paradigm, where visual understanding is performed directly in the compressed latent domain. We present face and object detection operating on codec latents (e.g., JPEG AI), avoiding explicit image reconstruction. We further present a bridge between AI codec latents and multimodal large language models, enabling efficient multimodal reasoning. Finally, we explore AI transcoding across heterogeneous codecs and latent-domain data fusion between visual and LiDAR signals, highlighting the latent space as a unified representation for efficient vision and multimodal processing.
Binding Vision to Touch for Haptic-aware Digital Twins
Dr. A.L.Stefani (Università degli Studi di Trento)
Extended Reality (XR) systems are increasingly incorporating multi-sensory stimuli to enhance realism and user immersion. Among these, the integration of tactile feedback plays a crucial role. Yet, the pipeline for acquiring, processing, and rendering haptic information—especially in synchrony with visual stimuli—remains largely unstandardized. A common strategy for capturing tactile data involves encoding it as haptic maps, essentially image-based representations of touch. However, the effectiveness of both visual and tactile modalities in modeling perceptual haptic properties is not yet fully understood.
Agentic AI for Multimedia Forensics: Beyond Static Deepfake Detection
Prof. P.Bestagini (Politecnico di Milano)
The increasing realism of AI-generated audio and video poses significant challenges to multimedia forensics and deepfake detection. Despite recent advances in learning-based detectors, current approaches are often limited by poor generalization, sensitivity to distribution shifts, and a largely static view of the forensic process.
This presentation explores how Agentic AI (i.e., autonomous and potentially multi-agent systems capable of reasoning, planning, adaptation, and coordination) can offer a complementary perspective for multimedia forensics. Rather than focusing on a single detection model, agentic systems enable adaptive analysis workflows that combine multiple tools, modalities, and sources of evidence, and that can evolve in response to uncertain or incomplete observations.
Federated Intelligence at the Edge: From Drones and Cars to Foundation Models
Dr. M.Caligiuri (Università degli Studi di Padova)
Federated Learning (FL) enables collaborative learning without sharing private data, but large Foundation Models (FM) are too heavy for edge devices. Modern research trends focus on federated knowledge transfer, where lightweight client models learn new domains and efficiently pass knowledge to a central FM, balancing privacy, generalization, and efficiency. As FL moves toward real-world perception, new directions explore heterogeneous agents (e.g., cars and drones) and adaptation to adverse environments such as extreme weather, unlocking novel visual domains previously inaccessible to centralized training.
This talk highlights the evolution from traditional FL to scalable FM collaboration at the edge, and the emerging challenges of multi-agent and real-world adaptation, outlining a vision of federated intelligence capable of learning in the wild while staying efficient and privacy-first.