An adaptive chunking methodology for lecture videos using CLIP embeddings and SSIM to construct multimodal chunks for enhanced RAG performance.
Semantic segmentation using CLIP embeddings and SSIM to detect slide transitions and construct coherent chunks.
Curated collection of Persian & English lectures with mid/long duration and 50 QA pairs per video.
Efficient multimodal integration using VLMs with FAISS-based retrieval and temporal mapping.
Comprehensive evaluation using Answer Relevance, Context Relevance, and Faithfulness metrics. Multimodal (text+image) performs best.
We present EduViQA, a bilingual educational dataset comprising slide-based lecture videos with diverse topics and durations. Each video is enriched with 50 synthetic QA pairs to support RAG evaluation and training.
Dataset composition highlighting topic distribution and lecture duration proportions.
We segment videos into semantically coherent chunks by detecting slide transitions using cosine similarity of CLIP embeddings combined with SSIM (Structural Similarity Index). This approach ensures that each chunk captures a complete visual concept.
For each chunk, we select three keyframes: the maximum entropy frame (capturing the most visual information), the first frame, and the last frame. This balances information density with temporal coverage.
Audio transcripts are recursively split into text chunks using semantic boundaries. These chunks are temporally mapped to visual chunks to create multimodal alignments.
We use OpenAI's text-embedding-3-large model to generate embeddings for transcript chunks. FAISS enables efficient cosine similarity search, retrieving top-3 transcript chunks and mapping them to corresponding visual chunks.
We evaluate using RAGAS metrics (Answer Relevance, Context Relevance, Faithfulness) across three scenarios: text-only, image-only, and multimodal (image+text) retrieval.
Pipeline demonstrating adaptive chunking with CLIP embeddings, frame selection, and audio-transcript alignment for optimal RAG integration.
| Method | Cost | Semantic Awareness | Robustness | Notes |
|---|---|---|---|---|
| Heuristic | Low | ❌ | Low | Fixed time intervals |
| CLIP | Medium | ✅ | Medium | Semantic embeddings |
| BLIP | High | ✅ | High | Vision-language model |
| Saliency | Medium | ⚠️ | Medium | Attention-based |
| SSIM | Low | ❌ | Medium | Image similarity |
| Frame Diff | Low | ❌ | Low | Pixel difference |
| Ours (CLIP+SSIM) | Medium | ✅ | High | Best performance |
Our adaptive chunking approach consistently outperforms simple slicing across all RAGAS metrics. The multimodal (image+text) scenario achieves the strongest results, demonstrating the value of combining visual and textual information in RAG pipelines.
| Chunking | Scenario | GPT-4o | Llama 3.2 | ||||
|---|---|---|---|---|---|---|---|
| AR | CR | F | AR | CR | F | ||
| Adaptive (Ours) | Image+Text | 0.87 | 0.82 | 0.91 | 0.85 | 0.79 | 0.88 |
| Text-only | 0.81 | 0.75 | 0.85 | 0.78 | 0.72 | 0.82 | |
| Image-only | 0.74 | 0.68 | 0.78 | 0.71 | 0.65 | 0.75 | |
| Simple Slicing | Image+Text | 0.75 | 0.70 | 0.80 | 0.72 | 0.67 | 0.77 |
| Text-only | 0.72 | 0.67 | 0.76 | 0.69 | 0.64 | 0.73 | |
| Image-only | 0.68 | 0.63 | 0.72 | 0.65 | 0.60 | 0.69 | |
Install VideoRAC from PyPI and follow the steps below to chunk videos and generate Q&A.
# ⚙️ Installation
pip install VideoRAC
# 🚀 Usage Example — 1️⃣ Hybrid Chunking
from VideoRAC.Modules import HybridChunker
chunker = HybridChunker(
clip_model="openai/clip-vit-base-patch32",
alpha=0.6,
threshold_embedding=0.85,
threshold_ssim=0.8,
interval=1,
)
# Returns: chunks (list), timestamps (start/end pairs), and total duration (seconds)
chunks, timestamps, duration = chunker.chunk("lecture.mp4")
# Optional: print evaluation metrics or visualizations
chunker.evaluate()
# 🚀 Usage Example — 2️⃣ Q&A Generation
# Make sure your OpenAI API key is set:
# export OPENAI_API_KEY=your_api_key_here
from VideoRAC.Modules import VideoQAGenerator
def my_llm_fn(messages):
# Example using OpenAI client (replace with your preferred LLM)
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
return resp.choices[0].message.content
urls = ["https://www.youtube.com/watch?v=2uYu8nMR5O4"]
qa = VideoQAGenerator(video_urls=urls, llm_fn=my_llm_fn)
# Processes input videos and generates Q&A
qa.process_videos()
{
"video_id": "lec_001_persian",
"language": "persian",
"duration_seconds": 1800,
"qa_pairs": [
{
"id": "qa_001",
"question": "What are the key principles of computer architecture?",
"answer": "Computer architecture principles include performance, cost, and compatibility...",
"context_video_segment": "0-120",
"context_frames": [0, 45, 120],
"metadata": {
"question_type": "conceptual",
"difficulty": "medium"
}
}
]
}
Department of Computer Engineering, University of Isfahan
We thank Mehran Rezaei, Ajarn Olli, Chris Drew, and Rick Hill for providing educational video content that made this dataset possible.
If you use Video-RAC or the EduViQA dataset in your research, please cite our work:
@INPROCEEDINGS{10967455,
author={Hemmat, Arshia and Vadaei, Kianoosh and Shirian, Melika and Heydari, Mohammad Hassan and Fatemi, Afsaneh},
booktitle={2025 29th International Computer Conference, Computer Society of Iran (CSICC)},
title={Adaptive Chunking for VideoRAG Pipelines with a Newly Gathered Bilingual Educational Dataset},
year={2025},
volume={},
number={},
pages={1-7},
keywords={Measurement;Visualization;Large language models;Pipelines;Retrieval augmented generation;Education;Question answering (information retrieval);Multilingual;Standards;Context modeling;Video QA;Datasets Preparation;Academic Question Answering;Multilingual},
doi={10.1109/CSICC65765.2025.10967455}}
}