A community-driven registry for Claude, Cursor, Windsurf, Cline & more. Not affiliated with Anthropic.
Are you the author? Sign in to claim
[NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
TL;DR: A Multi-modal Benchmark for Visual-centric GUI Automation from Instructional Videos.

Visual-centric softwares and tasks: VideoGUI focuses on professional and novel software like PR and AE for video editing, or Stable Diffusion and Runway for visual creation. Besides, the task query emphasizes visual preview rather than textual instructions.
Instructional videos with human demonstration: We source novel tasks from high-quality instructional videos, with annotators replicating these to reproduce effects.
Hierarchical planning and actions: We provide detailed annotations with planning procedures and recorded actions for hierarchical evaluation.
VideoGUI/
└── DaVinci/ # software
└── DV_8/ # task
└── keylog/ # recording
├── 2024-05-27_11-02-22-126188/ # mid-level
├── 2024-05-27_11-03-06-590299/
├── 2024-05-27_11-30-30-996960/
└── 2024-05-27_11-05-04-143397/
Each task directory contains:
.drp for DaVinci.Each mid-level directory contains:
.mkv video file, which is the screen recording_full.json file that stores the corresponding action metadata.arranged/ directory, which store the screenshot in order.If you want to set up the online environment, refer to the tutorial by GUI-Thinker.
If you find our work helpful, please kindly consider citing our paper. Thank you!
@inproceedings{linvideogui,
title={VideoGUI: A Benchmark for GUI Automation from Instructional Videos},
author={Lin, Kevin Qinghong and Li, Linjie and Gao, Difei and Qinchen, WU and Yan, Mingyi and Yang, Zhengyuan and Wang, Lijuan and Shou, Mike Zheng},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}
}
💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
An AI-powered custom node for ComfyUI designed to enhance workflow automation and provide intelligent assistance
Deterministic multi-agent pipeline for end-to-end software development, orchestrating CLI-based AI tools (e.g. Gemini, C
Pocket Flow: Codebase to Tutorial