| Prof.Yuxin PengPeking University IEEE/CCF/CAAI/CIE/CSIG Fellow, National Science Fund for Distinguished Young Scholars Yuxin Peng, IEEE/CCF/CAAI/CIE/CSIG Fellow, is the Boya Distinguished Professor at Wangxuan Institute of Computer Technology, Peking University. He was a recipient of the National Science Fund for Distinguished Young Scholars of China in 2019 and its continued funding in 2025. He received the Ph.D. degree in computer application technology from Peking University, Beijing, China, in 2003. His research interests mainly include multimedia analysis, computer vision and artificial intelligence. He has authored over 260 papers, including more than 170 papers in top-tier journals and conference proceedings. He has been granted 40 invention patents. He led his team to win the First Place in the video semantic search evaluation of TRECVID ten times. He won the First Prize of the Beijing Science and Technology Award in 2016 and the First Prize of the Scientific and Technological Progress Award of the Chinese Institute of Electronics in 2020 as the lead recipient. He was a recipient of the Best Paper award at MMM 2019 and NCIG 2018, and serves as the associate editor of IEEE TMM, TCSVT, etc. Title: Fine-Grained Understanding and Physically-Grounded Generation Abstract: Multimodal large language models (MLLMs) and diffusion models, as two representative types of foundation models, have demonstrated strong capabilities in visual content understanding and generation respectively, but they also face important challenges. For visual content understanding, MLLMs struggle to recognize fine-grained categories of real-world objects; for visual content generation, diffusion models have difficulty generating visual content that conforms to real-world physical laws. To address these challenges, this report first introduces our recent research progress in fine-grained recognition with MLLMs, hierarchical recognition based on fine-grained trees, and physics-driven video generation. Then, we present our latest advances in two application scenarios: aesthetic understanding and virtual try-on. Finally, we discuss the application of visual content understanding and generation technologies in the era of foundation models, and provide an outlook on future research directions for MLLMs and diffusion models. |
| Prof. Min LiuHunan University National Science Fund for Distinguished Young Scholars Liu Min, a secondary professor at Hunan University and Party Committee Secretary of the School of Artificial Intelligence and Robotics, is a recipient of the National Outstanding Youth Science Fund, a Youth Changjiang Scholar of the Ministry of Education, and the lead scientist of the National Key R&D Program. He holds a bachelor's degree from Peking University and a Ph.D. from the University of California, Riverside. He serves as the Deputy Director of the Hunan Provincial Automation Society, Director of the Key Laboratory of Advanced Manufacturing Vision Inspection and Control Technology in the Machinery Industry, and Vice Director of the Youth Working Committee of the China Image and Graphics Association. Title:Preliminary exploration of embodied surgical robots Abstract:The breakthrough and comprehensive intelligent transformation and upgrading of core technologies in high-endmedical equipment such as surgical robots is a major national strategic task aimed at the forefront of world technology, major national needs, and people's lives and health. It provides decisive guarantee and strong support for breaking the technological monopoly of high-end digital medical equipment in Europe and America. The existing surgical robots lack an effective multimodal surgical target collaborative perception system and have high requirements for doctor operation, which seriously restricts their promotion and application in emergency response to major national emergencies such as national defense security and epidemic disasters. Embodied intelligence builds a closed-loop interaction mechanism of "perception cognition action", enabling surgical robots to understand the surgical environment, adapt to complex scenes, and make intelligent decisions like human doctors. This is the key path to achieving a leap in their autonomous capabilities. In response to the challenging issues mentioned above, this lecture provides an in-depth introduction to the basic principles and key methods of multimodal perception of surgical robots from preoperative, intraoperative to postoperative stages. It also showcases some preliminary progress made by our team in autonomous operation of surgical robots driven by embodied intelligence, providing important guarantees for reducing medical accidents in China. |