Build video datasets tooling for video generation using video2dataset pipeline
AI Impact Summary
This post outlines a three-stage tooling stack to build video datasets for fine-tuning video generation models. It mirrors image-data tooling by using video2dataset for scalable downloads, yt-dlp for retrieval, and a multi-stage captioning/filtering pipeline (Florence-2, Qwen2.5, OCR) to surface metadata and content quality. The approach enables controlled filtering (watermark, aesthetic scores, OCR regions) to balance dataset size against safety and usefulness, with an example targeting CogVideoX-5B fine-tuning. Adoption will impact data engineering and model fine-tuning workflows, but introduces dependencies on external models and potential copyright/NSFW governance considerations.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info