Files
llm-rag-ds-optimizer/data

Dataset Sources and Licenses

This document describes the datasets used for benchmarking the LLM RAG Data Structures Optimizer. All datasets are publicly available and suitable for research use.

Datasets

Datasets with Published Benchmark Results

We benchmark on three publicly available datasets with published results:

1. BEIR FIQA (Financial Question Answering)

Source: BEIR Paper | Hugging Face Datasets

Description: Financial question-answering dataset from BEIR benchmark suite. 50,000 documents with financial Q&A pairs. Used as primary evaluation dataset in our research.

License: Varies by task. Most BEIR tasks use CC-BY or similar open licenses. Check individual task licenses.

Download:

python scripts/download_corpus.py --source beir:fiqa --output data/raw/beir/fiqa

Citation:

Thakur, N., et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.

2. Amazon Reviews 2023 (McAuley Lab)

Source: Hugging Face - McAuley-Lab/Amazon-Reviews-2023

Description: Large corpus of Amazon product reviews with metadata (ratings, categories, product IDs). Excellent for e-commerce-style RAG workloads. Benchmark results available for 10k subset.

License: CC BY 4.0

Download:

python scripts/download_corpus.py --source amazon23 --output data/raw/amazon23 --limit 500000

Note: Full dataset is very large (>100M reviews). Use --limit for manageable subsets. Benchmark results use 10k document subset.

3. MS MARCO (Microsoft Machine Reading Comprehension)

Source: MS MARCO Datasets

Description: Large-scale passage ranking dataset with 8.8M passages and 1M queries. Widely used as a canonical information retrieval benchmark. Benchmark results available for 10k subset.

License: Research use only. See MS MARCO Terms for details.

Download:

python scripts/download_corpus.py --source msmarco --output data/raw/msmarco

Citation:

Bajaj, P., et al. (2016). MS MARCO: A human generated machine reading comprehension dataset.

Additional Available Datasets

The following datasets are available in the codebase but do not yet have published benchmark results:

4. Yelp Open Dataset

Source: Yelp Open Dataset

Description: Business listings and reviews from Yelp. Useful for local business and review-based RAG.

License: See Yelp Dataset License. Research use allowed.

Download:

# First accept license at https://www.yelp.com/dataset/download
python scripts/download_corpus.py --source yelp --output data/raw/yelp

5. Wikipedia (English)

Source: Wikimedia Downloads

Description: English Wikipedia pages-articles dump. Broad factual corpus for general knowledge RAG.

License: CC BY-SA 3.0 and GFDL

Download:

python scripts/download_corpus.py --source wikipedia --output data/raw/wikipedia

Note: Latest dump is ~20GB compressed. Extracts plain text and titles.

6. Common Crawl (Optional)

Source: Common Crawl | cc-downloader

Description: Web-scale corpus from billions of web pages. Use for large-scale testing.

License: Public domain / various site licenses

Download:

# Be respectful of bandwidth - use specific months
python scripts/download_corpus.py --source commoncrawl --cc-month CC-MAIN-2025-14 --output data/raw/cc --limit 10M

Note: Common Crawl is extremely large. Use --limit and specific months for reproducible, manageable subsets.

Data Format

All datasets are normalized to JSONL format:

{"id": "doc_123", "text": "Document text content...", "meta": {"field1": "value1", "field2": 42}}

Each line contains:

  • id: Unique document identifier
  • text: Main text content
  • meta: Optional metadata (ratings, categories, timestamps, etc.)

Checksums

Dataset checksums are stored in data/dataset_cards/ as YAML files:

name: amazon_reviews_2023
source: https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023
license: CC BY 4.0
sha256: <checksum>
size_bytes: <size>
download_date: 2024-10-30

Quick Start

Download All Datasets

# Create directories
mkdir -p data/raw data/processed data/indices data/embeddings data/dataset_cards

# Download datasets (start with smaller ones)
python scripts/download_corpus.py --source beir:fiqa --output data/raw/beir/fiqa
python scripts/download_corpus.py --source amazon23 --output data/raw/amazon23 --limit 500000
python scripts/download_corpus.py --source msmarco --output data/raw/msmarco

Prepare Embeddings

python scripts/prepare_embeddings.py \
    --input data/raw/beir/fiqa/corpus.jsonl \
    --output data/embeddings/fiqa.npy \
    --dim 384 \
    --seed 42

Build Indices

python scripts/build_indices.py \
    --corpus data/raw/beir/fiqa/corpus.jsonl \
    --emb data/embeddings/fiqa.npy \
    --index-dir data/indices/fiqa \
    --bm25 \
    --hnsw \
    --ef 200 \
    --M 16

Run Benchmarks

python scripts/run_benchmarks.py \
    --corpus fiqa \
    --sizes 10k 50k 100k \
    --ef 50 100 200 \
    --M 8 16 32 \
    --repetitions 5

License Compliance

Important:

  • Always check individual dataset licenses before use
  • MS MARCO: Research use only
  • Amazon Reviews: CC BY 4.0
  • BEIR (FIQA): Varies by task, typically CC-BY or similar open licenses

Do NOT:

  • Scrape websites without permission
  • Redistribute datasets without proper attribution
  • Use commercial datasets for commercial purposes without checking licenses

Reproducibility

All dataset processing is deterministic:

  • Fixed random seeds (42) for sampling and embeddings
  • SHA256 checksums for verification
  • Versioned dataset cards with download dates
  • Exact corpus sizes documented in benchmark results

Dataset Statistics

Datasets with Published Results

Dataset Documents Size License Use Case Benchmark Results
BEIR (FIQA) 50,000 ~13MB Varies Financial QA Yes (10k, 25k, 50k subsets)
Amazon Reviews 2023 100M+ ~500GB+ CC BY 4.0 E-commerce Yes (10k subset)
MS MARCO 8.8M passages ~30GB Research IR benchmark Yes (10k subset)

Available Datasets (No Published Results Yet)

Dataset Documents Size License Use Case Status
Yelp ~8M businesses ~8GB Yelp License Local business Data available, no results
Wikipedia 6.7M articles ~20GB CC BY-SA 3.0 General knowledge Data available, no results
Common Crawl Billions TB+ Public domain Web-scale Code available, optional

Note: Benchmark results are available for 10k document subsets of FIQA, Amazon23, and MS MARCO. FIQA has additional results for 25k and 50k document subsets. Yelp and Wikipedia datasets are available in the codebase but do not yet have published benchmark results.

Full dataset statistics are approximate and vary by version. Benchmark results use manageable subsets for reproducible evaluation.