Kevin Lin

Scholar | Github | LinkedIn | Huggingface | X (Twitter)

Kevin is a postdoc researcher at University of Oxford, advised by Prof. Philip Torr. He is fortunate to be a visiting researcher at Stanford University, advised by Prof. James Zou.

Kevin obtained his PhD from National University of Singapore, advised by Prof. Mike Shou.

He has interned at Tencent, Meta AI, Meta Reality Labs, and Microsoft Research.

Research

His research goal is to develop multimodal intelligent agents that collaborate with humans.

Perception: can agents understand multimodal context (EgoVLP, UniVTG, VideoMind, VideoLLM-online, Show-o)
Interaction: can agents automate tasks in embodied environments (ShowUI, ShowUI-π, AssistGPT, GroundCUA, GameWorld)
Reasoning: can agents think adaptively and imagine like humans (Think or Not, VCode, Code2World)

Bringing them, a bigger vision is to let agents to advance autoresearch (Paper2Poster, Paper2Video) and education (Code2Video, Violin).

Blogs

When Vision Meets Code
Code offers a new lens on the computer vision. This blog shares insights from (i) Code as Visual Representation; (ii) Video Generation via Programming; (iii) Coder as World Model.

Violin: An open-source video translation skill
Multimodal AI break down language barriers, making educational videos more accessible to global audiences
[demo] [github] [media]
130K twitter views. 800 github stars.

Selected Papers

† indicates equal contribution. Denotes student I mentored.

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
Kevin QH. Lin, Batu El, Yuhong Shi, Pan Lu, Philip Torr, James Zou.

Preprint
ACM CAIS 2026 workshop Spotlight
[project] [paper] [code]

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Kevin QH. Lin†, Yuhao Zheng†, Hangyu Ran†, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex JP. Wang.

Preprint
CVPR 2026 visual concepts workshop Oral
[project] [paper] [code] [demo] [twitter]
#1 Huggingface daily paper.

Egocentric Video-Language Pretraining
Kevin QH. Lin, Alex JP. Wang, M. Soldan, M. Wray, R. Yan, Eric ZC. Xu, D. Gao, R. Tu, W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, Mike Z. Shou.

NeurIPS 2022 Spotlight (1.7%)
[project] [paper] [EgoVLPv2] [code] [poster] [twitter] [media]
CVPR EgoVis Distinguished Paper Award.
PREMIA Best Student Paper Award, Gold Award.
Double champions in Ego4D & Epic-Kitchens CVPR 2022 challenges.

ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Kevin QH. Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan WX. Lei, Lijuan Wang, Mike Z. Shou.

CVPR 2025
NeurIPS 2024 open-world agents workshop Oral
[paper] [code] [huggingface] [dataset] [demo] [twitter]
#1 Huggingface daily paper.
Outstanding Paper Award, NeurIPS 2024 OWA workshop.
The model has been downloaded for over 280,000 times. 1.8K github stars.

VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Kevin QH. Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Z. Shou.

NeurIPS 2025 Spotlight
[project] [paper] [code] [twitter]

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Kevin QH. Lin, Mike Z. Shou.

CVPR 2025
[paper] [code] [twitter]
580 github stars.

UniVTG: Towards Unified Video-Language Temporal Grounding
Kevin QH. Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex JP. Wang, Rui Yan, Mike Z. Shou.

ICCV 2023
[paper] [code] [demo] [twitter]
370 github stars. 300 citations.

Learning Video Context as Interleaved Multimodal Sequences
Kevin QH. Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Z. Shou.

ECCV 2024
[paper] [code]

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Wei Pang†, Kevin QH. Lin†✉, Xiangru Jian†, Xi He✉, Philip Torr.

NeurIPS 2025
ICML 2025 multi-agent systems workshop Oral
[project] [paper] [code] [datasets] [demo] [poster] [twitter]
3.7K github stars. 1.2K twitter likes.

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Ye Liu†, Kevin QH. Lin†, Chang Wen Chen, Mike Z. Shou.

ICLR 2026
NeurIPS 2025 LAW workshop Spotlight
[project] [paper] [code] [dataset] [demo] [twitter]
330 github stars.

ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
Siyuan Hu†, Kevin QH. Lin†, Mike Z. Shou.

CVPR 2026
[project] [paper] [code] [huggingface] [dataset]

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Jiaqi Wang†, Kevin QH. Lin†, James Cheng, Mike Z. Shou.

NeurIPS 2025
[paper] [code] [huggingface] [twitter]

Paper2Video: Automatic Video Generation from Scientific Papers
Zeyu Zhu†, Kevin QH. Lin†, Mike Z. Shou.

CVPR 2026 workshop
[project] [paper] [code] [dataset] [twitter]
#2 Huggingface daily paper.
2.3K github stars. 1M twitter views. Highlighted by YC Hacker News

Code2Video: A Code-centric Paradigm for Educational Video Generation
Yanzhe Chen†, Kevin QH. Lin†, Mike Z. Shou.

ICML 2026
[project] [paper] [code] [dataset] [twitter]
1.8K github stars.

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
Mingyu Ouyang, Kevin QH. Lin, Mike Z. Shou, Hwee Tou Ng.

CVPR 2026
[project] [paper] [code]
CVPR Compute Transparency Champion award.

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie†, Weijia Mao†, Zechen Bai†, David JH. Zhang†, Weihao Wang, Kevin QH. Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Z. Shou.

ICLR 2025
[project] [paper] [code] [huggingface] [demo] [twitter]
Most Influential ICLR Papers #4
Featured in MIT course “6.S978: Deep Generative Models”
1.9K github stars. 600 citations.

VideoLLM-online: Online Video Large Language Model for Streaming Video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin QH. Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Z. Shou.

CVPR 2024
[project] [paper] [VideoLLM-MoD] [code] [dataset] [twitter]
660 github stars. 200 citations.

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin QH. Lin, Jae Won Cho, Yale Song, Juho Kim

CVPR 2026
CVPR 2026 MMRAgI workshop Oral
[project] [paper] [dataset]
Outstanding Paper Award, CVPR 2026 MMRAgI workshop

AssistGPT: Towards Multi-modal Agent for Human-Centric AI Assistant
Difei Gao, Siyuan Hu, Kevin QH. Lin, Mike Z. Shou.

ACMMM 2024 Human-centric Multimedia Analysis
[project] [paper] [twitter]
Best Demo Paper Award

Honors

Forbes 30 Under 30 Asia (Healthcare & Science)

2026
Outstanding Paper Award, CVPR MMRagI workshop

2026
Tinker Research Grant, Thinking Machines Lab

2025
DAAD AINeT Fellowship

2025
CVPR Doctoral Consortium

2025
Outstanding Paper Award, NeurIPS Open-World Agents

2024
NeurIPS Top Reviewers

2024
Best Demo Paper Award, ACM Multimedia HCMA

2024
CVPR Egocentric Vision (EgoVis) Distinguished Paper Award

2024
CVPR Outstanding Reviewers (Top 2%)

2024
PREMIA Best Student Paper Awards, Gold Award

2023
NeurIPS Scholar Award

2022
Tencent Rhino-Bird Research Scholarship, Second Prize

2022
1st Place on Ego4D - Object State Change Classiﬁcation Challenge, CVPR

2022
1st Place on EPIC-Kitchens - Multi-Instance Retrieval Challenge, CVPR
2022
Show Lab Annual Award

2022, 2024
China National Scholarship

2018, 2021

Service

Area Chair: NeurIPS 2025, NeurIPS 2026, ICML 2026 AI4Science.
Workshop Organizer: Open Multimodal Gathering @ NUS; Multimodal Video Agent @ CVPR 25; Bridging Vision, Language, and Action @ CVPR 26
Conference Reviewer: CVPR (2024 Outstanding Reviewers), ICCV, ECCV, NeurIPS (2024 Top Reviewers), ICML, ICLR, etc.
Journal Reviewer: TPAMI, IJCV, TMLR, TNNLS, TMM, etc.
Guest Lecture: Multimodal Agent @ NUS CS6212; Multimodal Agent @ NUS EE6934
Teaching Assistant: EE6934, EE6733, EE4212
Co-organizer of The AI Talks.