Video-RAC: Retrieval Adaptive Chunking for Lecture Video RAG

An adaptive chunking methodology for lecture videos using CLIP embeddings and SSIM to construct multimodal chunks for enhanced RAG performance.

VideoQA RAG Multimodal Dataset (EN/Persian)
+15%
RAGAS Score
Bilingual
Persian & English
Adaptive
CLIP-SSIM Chunking

Key Contributions

Adaptive Video Chunking

Semantic segmentation using CLIP embeddings and SSIM to detect slide transitions and construct coherent chunks.

Bilingual Educational Dataset

Curated collection of Persian & English lectures with mid/long duration and 50 QA pairs per video.

Low-Resource Friendly Pipeline

Efficient multimodal integration using VLMs with FAISS-based retrieval and temporal mapping.

RAGAS Evaluation

Comprehensive evaluation using Answer Relevance, Context Relevance, and Faithfulness metrics. Multimodal (text+image) performs best.

Dataset

We present EduViQA, a bilingual educational dataset comprising slide-based lecture videos with diverse topics and durations. Each video is enriched with 50 synthetic QA pairs to support RAG evaluation and training.

20
Videos
10 Persian, 10 English
50
QA Pairs
Per video
40%
Mid Duration
20-40 minutes

Topics Covered

Computer Architecture Data Structures System Dynamics and Control Teaching Skills Descriptive Research Regions in Human Geography Differentiated Instruction Business
Dataset composition showing topic distribution and lecture duration proportions

Dataset composition highlighting topic distribution and lecture duration proportions.

Method

Adaptive Chunking (CLIP + SSIM)

We segment videos into semantically coherent chunks by detecting slide transitions using cosine similarity of CLIP embeddings combined with SSIM (Structural Similarity Index). This approach ensures that each chunk captures a complete visual concept.

Frame Selection by Entropy

For each chunk, we select three keyframes: the maximum entropy frame (capturing the most visual information), the first frame, and the last frame. This balances information density with temporal coverage.

Audio Transcript Chunking

Audio transcripts are recursively split into text chunks using semantic boundaries. These chunks are temporally mapped to visual chunks to create multimodal alignments.

Retrieval

We use OpenAI's text-embedding-3-large model to generate embeddings for transcript chunks. FAISS enables efficient cosine similarity search, retrieving top-3 transcript chunks and mapping them to corresponding visual chunks.

Generation & Evaluation

We evaluate using RAGAS metrics (Answer Relevance, Context Relevance, Faithfulness) across three scenarios: text-only, image-only, and multimodal (image+text) retrieval.

Video Input
CLIP+SSIM Chunking
Keyframe Selection
FAISS Retrieval
RAG Generation
Pipeline demonstration of adaptive chunking for multimodal video processing

Pipeline demonstrating adaptive chunking with CLIP embeddings, frame selection, and audio-transcript alignment for optimal RAG integration.

Algorithms Compared

Method Cost Semantic Awareness Robustness Notes
Heuristic Low Low Fixed time intervals
CLIP Medium Medium Semantic embeddings
BLIP High High Vision-language model
Saliency Medium ⚠️ Medium Attention-based
SSIM Low Medium Image similarity
Frame Diff Low Low Pixel difference
Ours (CLIP+SSIM) Medium High Best performance

Results

Our adaptive chunking approach consistently outperforms simple slicing across all RAGAS metrics. The multimodal (image+text) scenario achieves the strongest results, demonstrating the value of combining visual and textual information in RAG pipelines.

Chunking Scenario GPT-4o Llama 3.2
AR CR F AR CR F
Adaptive (Ours) Image+Text 0.87 0.82 0.91 0.85 0.79 0.88
Text-only 0.81 0.75 0.85 0.78 0.72 0.82
Image-only 0.74 0.68 0.78 0.71 0.65 0.75
Simple Slicing Image+Text 0.75 0.70 0.80 0.72 0.67 0.77
Text-only 0.72 0.67 0.76 0.69 0.64 0.73
Image-only 0.68 0.63 0.72 0.65 0.60 0.69

Faithfulness Comparison

Key Findings

  • ✅ Adaptive chunking outperforms simple slicing by +12-15% across all metrics
  • Multimodal (image+text) retrieval consistently achieves the highest RAGAS scores
  • ✅ Our approach demonstrates strong performance with both GPT-4o and Llama 3.2
  • ✅ The combination of CLIP embeddings and SSIM provides optimal chunk boundary detection

Usage

Install VideoRAC from PyPI and follow the steps below to chunk videos and generate Q&A.

bash / python
# ⚙️ Installation
pip install VideoRAC

# 🚀 Usage Example — 1️⃣ Hybrid Chunking
from VideoRAC.Modules import HybridChunker

chunker = HybridChunker(
    clip_model="openai/clip-vit-base-patch32",
    alpha=0.6,
    threshold_embedding=0.85,
    threshold_ssim=0.8,
    interval=1,
)

# Returns: chunks (list), timestamps (start/end pairs), and total duration (seconds)
chunks, timestamps, duration = chunker.chunk("lecture.mp4")

# Optional: print evaluation metrics or visualizations
chunker.evaluate()


# 🚀 Usage Example — 2️⃣ Q&A Generation
# Make sure your OpenAI API key is set:
# export OPENAI_API_KEY=your_api_key_here

from VideoRAC.Modules import VideoQAGenerator

def my_llm_fn(messages):
    # Example using OpenAI client (replace with your preferred LLM)
    from openai import OpenAI
    client = OpenAI()
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    return resp.choices[0].message.content

urls = ["https://www.youtube.com/watch?v=2uYu8nMR5O4"]
qa = VideoQAGenerator(video_urls=urls, llm_fn=my_llm_fn)

# Processes input videos and generates Q&A
qa.process_videos()

Resources

Dataset Schema

json
{
  "video_id": "lec_001_persian",
  "language": "persian",
  "duration_seconds": 1800,
  "qa_pairs": [
    {
      "id": "qa_001",
      "question": "What are the key principles of computer architecture?",
      "answer": "Computer architecture principles include performance, cost, and compatibility...",
      "context_video_segment": "0-120",
      "context_frames": [0, 45, 120],
      "metadata": {
        "question_type": "conceptual",
        "difficulty": "medium"
      }
    }
  ]
}

Team & Acknowledgments

Authors

Department of Computer Engineering, University of Isfahan

Arshia Hemmat
Kianoosh Vadaei
Melika Shirian
Mohammad Hassan Heydari
Afsaneh Fatemi

Acknowledgments

We thank Mehran Rezaei, Ajarn Olli, Chris Drew, and Rick Hill for providing educational video content that made this dataset possible.

Citation

If you use Video-RAC or the EduViQA dataset in your research, please cite our work:

bibtex
@INPROCEEDINGS{10967455,
  author={Hemmat, Arshia and Vadaei, Kianoosh and Shirian, Melika and Heydari, Mohammad Hassan and Fatemi, Afsaneh},
  booktitle={2025 29th International Computer Conference, Computer Society of Iran (CSICC)}, 
  title={Adaptive Chunking for VideoRAG Pipelines with a Newly Gathered Bilingual Educational Dataset}, 
  year={2025},
  volume={},
  number={},
  pages={1-7},
  keywords={Measurement;Visualization;Large language models;Pipelines;Retrieval augmented generation;Education;Question answering (information retrieval);Multilingual;Standards;Context modeling;Video QA;Datasets Preparation;Academic Question Answering;Multilingual},
  doi={10.1109/CSICC65765.2025.10967455}}
}