Files
sheepOp/docs/REPOSITORY_DOWNLOAD_GUIDE.md
Carlos Gutierrez 3d2da94ce2 Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch
- Training pipeline with gradient accumulation and mixed precision
- Optimized inference with KV caching
- Multi-format data processing (PDFs, images, code, text)
- Comprehensive documentation
- Apache 2.0 license
- Example training plots included in docs/images/
2025-11-06 22:07:41 -05:00

17 KiB

Repository Download Guide

This guide explains how to automatically download GitHub repositories with open licenses for code training using the repository downloader scripts.

Overview

The repository downloader allows you to automatically find and clone GitHub repositories based on:

  • Categories: Neovim configs, Lua repos, Bash scripts, Zsh configs, Python repos, ethical hacking tools, security tools, and all open-license repos
  • Languages: Python, JavaScript, Go, Rust, and 15+ more
  • Licenses: MIT, Apache, BSD, GPL, and other open source licenses
  • Quality: Filter by minimum stars (popularity)
  • Size Limits: Automatic stopping when reaching storage limits (default: 1 TB)

Scripts

There are two scripts available:

  1. download_all_repos.py - Convenience script to download all common categories at once
  2. download_repos.py - Full-featured script with all options and flexibility

Quick Start

The easiest way to download all repository categories:

python3 download_all_repos.py

This will download:

  • 📦 Neovim configurations and plugins
  • 📦 Lua programming repositories
  • 📦 Bash/shell script repositories
  • 📦 Zsh configuration and plugins
  • 📦 Python programming repositories
  • 📦 Ethical hacking and cybersecurity tools

Default settings:

  • Max repos per category: 50
  • Min stars: 100
  • Output directory: data/repos
  • Size limit: 1 TB (1024 GB)
  • Shallow clones (faster, less disk space)

Download Specific Categories

python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50

Download All Open-License Repos

Download repositories with any open license (any language):

python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0

Download by Language

python3 download_repos.py --language python --max-repos 100

Installation

No additional dependencies required! The script uses:

  • Python standard library (urllib, json, subprocess)
  • tqdm (already in requirements.txt)
  • git (should be installed on your system)

Available Categories

Neovim (nvim)

Neovim configuration files and plugins written in Lua.

python3 download_repos.py --categories nvim --max-repos 100

What it searches for:

  • neovim OR nvim-config OR neovim-config
  • MIT licensed repositories (default)
  • 100+ stars minimum (default)

Lua (lua)

Lua programming language repositories.

python3 download_repos.py --categories lua --max-repos 50

What it searches for:

  • Language: Lua
  • MIT licensed repositories (default)
  • 100+ stars minimum (default)

Bash (bash)

Bash and shell script repositories.

python3 download_repos.py --categories bash --max-repos 50

What it searches for:

  • Language: Shell
  • MIT licensed repositories (default)
  • 100+ stars minimum (default)

Zsh (zsh)

Zsh configuration files and plugins (Oh My Zsh, etc.).

python3 download_repos.py --categories zsh --max-repos 50

What it searches for:

  • zsh-config OR oh-my-zsh OR zsh-plugin
  • MIT licensed repositories (default)
  • 100+ stars minimum (default)

Python (python)

Python programming language repositories.

python3 download_repos.py --categories python --max-repos 100

What it searches for:

  • Language: Python
  • MIT licensed repositories (default)
  • 100+ stars minimum (default)

Ethical Hacking (hacking)

Ethical hacking and cybersecurity tools.

python3 download_repos.py --categories hacking --max-repos 100

What it searches for:

  • ethical-hacking OR cybersecurity OR penetration-testing OR security-tools OR red-team
  • MIT licensed repositories (default)
  • 100+ stars minimum (default)

Security (security)

General security and cybersecurity repositories.

python3 download_repos.py --categories security --max-repos 50

What it searches for:

  • security-tools OR cybersecurity OR penetration-testing OR red-team OR blue-team
  • MIT licensed repositories (default)
  • 100+ stars minimum (default)

All Open Licenses (all-open)

All repositories with open licenses, any language. This is useful for downloading a diverse set of repositories.

python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0

What it searches for:

  • Any open-license repository (no language filter)
  • No specific license filter (searches all open licenses)
  • 100+ stars minimum (default)

Note: This category searches broadly and may return repositories with various licenses. You can still specify --license to filter to a specific license type.

Command-Line Options

download_repos.py Options

python3 download_repos.py [OPTIONS]

Options:

  • --output DIR - Output directory (default: data/repos)
  • --categories CAT1 CAT2 ... - Categories to download: nvim, lua, bash, zsh, python, hacking, security, all-open
  • --language LANG - Single language to filter by
  • --languages LANG1 LANG2 ... - Multiple languages to download
  • --license LICENSE - License type (default: mit)
  • --min-stars N - Minimum stars (default: 100)
  • --max-repos N - Maximum repos per category/language (default: 50)
  • --max-size N - Maximum total size in GB (stops downloading when reached, e.g., 1024.0 for 1 TB)
  • --full-clone - Do full clone instead of shallow (slower but includes full history)

download_all_repos.py Options

python3 download_all_repos.py [OPTIONS]

Options:

  • --max-repos N - Maximum repos per category (default: 50)
  • --min-stars N - Minimum stars (default: 100)
  • --output DIR - Output directory (default: data/repos)
  • --max-size N - Maximum total size in GB (default: 1024.0 = 1 TB)
  • --full-clone - Do full clone instead of shallow

Example:

python3 download_all_repos.py --max-repos 100 --min-stars 200 --max-size 2048.0

Available Licenses

  • mit (default)
  • apache-2.0
  • bsd-3-clause
  • bsd-2-clause
  • isc
  • unlicense
  • mpl-2.0
  • lgpl-2.1
  • lgpl-3.0
  • gpl-2.0
  • gpl-3.0

Available Languages

  • python
  • javascript
  • typescript
  • java
  • cpp
  • c
  • go
  • rust
  • ruby
  • php
  • swift
  • kotlin
  • scala
  • r
  • sql
  • lua
  • shell (for bash/shell scripts)

Examples

Example 1: Download All Categories (Simple)

python3 download_all_repos.py

Downloads all categories (nvim, lua, bash, zsh, python, hacking) with default settings and 1 TB size limit.

Example 2: Download All Categories with Custom Settings

python3 download_all_repos.py --max-repos 100 --min-stars 200 --max-size 2048.0

Downloads all categories with:

  • 100 repos per category
  • Minimum 200 stars
  • 2 TB size limit

Example 3: Download Specific Categories

python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50

Downloads specific categories with 50 repos each.

Example 4: Download All Open-License Repos with Size Limit

python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0

Downloads up to 1000 repositories with any open license, stopping at 1 TB.

Example 5: Download High-Quality Repos

python3 download_repos.py --categories nvim lua bash zsh python hacking --min-stars 1000 --max-repos 20

Downloads only highly popular repositories (1000+ stars).

Example 6: Download Multiple Languages

python3 download_repos.py --languages python javascript go rust --max-repos 50

Downloads repositories in multiple programming languages.

Example 7: Download with Apache License

python3 download_repos.py --categories nvim --license apache-2.0 --max-repos 50

Downloads Neovim repos with Apache 2.0 license.

Example 8: Custom Output Directory

python3 download_repos.py --categories nvim lua bash zsh python hacking --output /path/to/repos

Saves repositories to a custom directory.

Example 9: Full Clone (with History)

python3 download_repos.py --categories nvim --full-clone --max-repos 10

Does full clone including full git history (slower but more complete).

Example 10: Size-Limited Download

python3 download_repos.py --categories all-open --max-repos 2000 --max-size 512.0

Downloads repositories but stops when reaching 512 GB (0.5 TB).

Progress Tracking

The scripts include visual progress bars showing:

  • Category progress: Overall progress across all categories
  • Repository progress: Progress for each category
  • Real-time statistics: Current repo, stars, language, cloned/failed counts
  • Size tracking: Current total size and size limit (when --max-size is used)

Example output:

📊 Current directory size: 45.23 GB
📊 Size limit: 1024.00 GB
📦 Processing 6 categories...
Category: nvim: 100%|████████████| 6/6 [15:23<00:00, Size=156.78 GB, Total Cloned=300, Total Failed=2]
Cloning nvim: 45%|████████████████▌                    | 23/50 [02:15<03:45, Current=awesome-nvim, Stars=5.2k, Lang=Lua, Cloned=22, Failed=1, Size=12.45 GB]

Size limit reached:

When the size limit is reached, the script will stop downloading and show:

⚠️  Size limit reached: 1024.00 GB >= 1024.00 GB
   Stopping all downloads.

GitHub API Rate Limits

GitHub API has rate limits:

  • Unauthenticated: 60 requests/hour
  • Authenticated: 5,000 requests/hour

Using a GitHub Token

To increase rate limits, set a GitHub Personal Access Token:

export GITHUB_TOKEN=your_token_here
python3 download_repos.py --categories nvim lua bash hacking

How to create a token:

  1. Go to GitHub Settings → Developer settings → Personal access tokens
  2. Generate new token (classic)
  3. Select scope: public_repo (read-only is enough)
  4. Copy token and set as environment variable

Size Limits

The repository downloader includes automatic size limit checking to prevent running out of disk space.

How It Works

  • Default limit: 1 TB (1024 GB) for download_all_repos.py
  • Customizable: Use --max-size to set any limit
  • Real-time tracking: Size is checked before each repository clone
  • Automatic stopping: Downloads stop when limit is reached
  • Progress display: Current size shown in progress bars

Setting Size Limits

With download_all_repos.py:

# Default 1 TB
python3 download_all_repos.py

# Custom limit (2 TB)
python3 download_all_repos.py --max-size 2048.0

# Smaller limit (500 GB)
python3 download_all_repos.py --max-size 512.0

With download_repos.py:

# No limit (downloads until max-repos reached)
python3 download_repos.py --categories nvim --max-repos 100

# With 1 TB limit
python3 download_repos.py --categories nvim --max-repos 1000 --max-size 1024.0

Size Calculation

The script calculates total size by:

  • Scanning all files in the output directory (data/repos by default)
  • Summing file sizes recursively
  • Checking before each new repository clone
  • Displaying human-readable sizes (B, KB, MB, GB, TB)

Note: Size checking happens before cloning, so the actual size may be slightly less than the limit when stopping.

Cache and Resuming

The scripts automatically:

  • Skips existing repos: If a repository already exists, it's skipped (no re-download)
  • Resumes downloads: You can run the script multiple times safely
  • Progress tracking: Shows what's already downloaded
  • Size awareness: Accounts for existing repositories when checking size limits

After downloading repositories, they're automatically processed during training:

# Download repos
python3 download_all_repos.py

# Train with all data (text + code)
python3 train.py --data data/ --config config.json --device cuda

The training script will:

  1. Process all your text data (Wiki, Books, Amazon reviews, etc.)
  2. Process all code repositories
  3. Combine everything into training data

Supported File Types

The data processor automatically handles code files from repositories:

  • Text files: .txt, .md, .rst, .log, .csv, .json, .jsonl, .xml, .html, .htm
  • Code files: .py, .js, .ts, .java, .cpp, .c, .go, .rs, .rb, .php, .swift, .lua, .sh, and 30+ more
  • PDF files: .pdf (if pdfplumber is installed)
  • Images: .png, .jpg, etc. (if OCR is set up)

Troubleshooting

Rate Limit Exceeded

Error: Rate limit exceeded

Solution:

  1. Wait a few minutes and try again
  2. Use a GitHub token: export GITHUB_TOKEN=your_token
  3. Reduce --max-repos to download fewer repos per run

Repository Clone Fails

Error: Failed to clone repository

Possible causes:

  • Repository was deleted or made private
  • Network issues
  • Repository is too large (timeout)

Solution:

  • The script continues with other repos
  • Failed repos are counted and reported at the end
  • You can re-run the script to retry failed repos

No Repositories Found

Error: No repositories found

Possible causes:

  • Search query too restrictive
  • License filter too narrow
  • Minimum stars too high

Solution:

  • Lower --min-stars threshold
  • Try different --license options
  • Check if category name is correct

Best Practices

1. Start Small

Test with a small number first:

python3 download_repos.py --categories nvim --max-repos 10

2. Use Size Limits

Always set a size limit to prevent running out of disk space:

# Recommended: 1 TB limit
python3 download_all_repos.py --max-size 1024.0

# Or custom limit based on available space
python3 download_repos.py --categories all-open --max-size 512.0

3. Use Shallow Clones

Shallow clones are faster and use less disk space:

# Default (shallow clone)
python3 download_repos.py --categories nvim

# Full clone (only if you need history)
python3 download_repos.py --categories nvim --full-clone

4. Filter by Quality

Use --min-stars to get quality repositories:

python3 download_repos.py --categories nvim --min-stars 500 --max-repos 50

5. Use GitHub Token

For large downloads, use a GitHub token:

export GITHUB_TOKEN=your_token_here
python3 download_all_repos.py --max-repos 100

6. Monitor Disk Space

Check available disk space before starting:

df -h data/repos

7. Use all-open Category Wisely

The all-open category downloads broadly. Consider:

  • Setting a reasonable --max-repos limit
  • Using --min-stars to filter quality
  • Setting --max-size to prevent excessive downloads
python3 download_repos.py --categories all-open --max-repos 500 --min-stars 200 --max-size 1024.0

Storage Considerations

Size Limits and Disk Space Management

  • Default: 1 TB (1024 GB) for download_all_repos.py
  • Recommended: Set based on available disk space
  • Monitoring: Script shows current size vs limit in progress bars

Shallow vs Full Clones

Shallow clones (default):

  • Faster download
  • Less disk space (~10-50% of full clone)
  • No git history
  • Good for training data

Full clones:

  • Slower download
  • More disk space (includes full history)
  • Includes full git history
  • Useful if you need version history

Typical sizes (shallow clones):

  • Small repo: 1-10 MB
  • Medium repo: 10-100 MB
  • Large repo: 100 MB - 1 GB
  • Very large repo: 1-10 GB+

Example: Downloading 300 repositories with shallow clones typically uses 5-30 GB, depending on repository sizes.

Estimating Storage Needs

To estimate how many repositories you can download:

  1. Check current size:

    du -sh data/repos
    
  2. Calculate average repo size:

    • Small repos: ~5 MB average
    • Medium repos: ~50 MB average
    • Large repos: ~500 MB average
  3. Estimate:

    • 100 small repos: ~500 MB
    • 100 medium repos: ~5 GB
    • 100 large repos: ~50 GB
    • 1000 mixed repos: ~50-200 GB
  4. Set appropriate limit:

    # For 1 TB available space, use 900 GB limit (leave buffer)
    python3 download_all_repos.py --max-size 900.0
    

Summary

The repository downloader makes it easy to:

  • Automatically find high-quality open-source repositories
  • Filter by category, language, license, and popularity
  • Download with progress tracking and size monitoring
  • Set size limits to prevent running out of disk space
  • Integrate seamlessly with training pipeline
  • Resume interrupted downloads

Available categories:

  • nvim - Neovim configurations and plugins
  • lua - Lua programming repositories
  • bash - Bash/shell script repositories
  • zsh - Zsh configuration and plugins
  • python - Python programming repositories
  • hacking - Ethical hacking and cybersecurity tools
  • security - Security and cybersecurity repositories
  • all-open - All repositories with open licenses (any language)

Quick commands to get started:

# Download all categories with 1 TB limit (recommended)
python3 download_all_repos.py

# Download specific categories
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50

# Download all open-license repos with size limit
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0

This downloads repositories and prepares them for training!