- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
17 KiB
Repository Download Guide
This guide explains how to automatically download GitHub repositories with open licenses for code training using the repository downloader scripts.
Overview
The repository downloader allows you to automatically find and clone GitHub repositories based on:
- Categories: Neovim configs, Lua repos, Bash scripts, Zsh configs, Python repos, ethical hacking tools, security tools, and all open-license repos
- Languages: Python, JavaScript, Go, Rust, and 15+ more
- Licenses: MIT, Apache, BSD, GPL, and other open source licenses
- Quality: Filter by minimum stars (popularity)
- Size Limits: Automatic stopping when reaching storage limits (default: 1 TB)
Scripts
There are two scripts available:
download_all_repos.py- Convenience script to download all common categories at oncedownload_repos.py- Full-featured script with all options and flexibility
Quick Start
Download All Categories (Recommended)
The easiest way to download all repository categories:
python3 download_all_repos.py
This will download:
- 📦 Neovim configurations and plugins
- 📦 Lua programming repositories
- 📦 Bash/shell script repositories
- 📦 Zsh configuration and plugins
- 📦 Python programming repositories
- 📦 Ethical hacking and cybersecurity tools
Default settings:
- Max repos per category: 50
- Min stars: 100
- Output directory:
data/repos - Size limit: 1 TB (1024 GB)
- Shallow clones (faster, less disk space)
Download Specific Categories
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50
Download All Open-License Repos
Download repositories with any open license (any language):
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0
Download by Language
python3 download_repos.py --language python --max-repos 100
Installation
No additional dependencies required! The script uses:
- Python standard library (
urllib,json,subprocess) tqdm(already in requirements.txt)git(should be installed on your system)
Available Categories
Neovim (nvim)
Neovim configuration files and plugins written in Lua.
python3 download_repos.py --categories nvim --max-repos 100
What it searches for:
neovim OR nvim-config OR neovim-config- MIT licensed repositories (default)
- 100+ stars minimum (default)
Lua (lua)
Lua programming language repositories.
python3 download_repos.py --categories lua --max-repos 50
What it searches for:
- Language: Lua
- MIT licensed repositories (default)
- 100+ stars minimum (default)
Bash (bash)
Bash and shell script repositories.
python3 download_repos.py --categories bash --max-repos 50
What it searches for:
- Language: Shell
- MIT licensed repositories (default)
- 100+ stars minimum (default)
Zsh (zsh)
Zsh configuration files and plugins (Oh My Zsh, etc.).
python3 download_repos.py --categories zsh --max-repos 50
What it searches for:
zsh-config OR oh-my-zsh OR zsh-plugin- MIT licensed repositories (default)
- 100+ stars minimum (default)
Python (python)
Python programming language repositories.
python3 download_repos.py --categories python --max-repos 100
What it searches for:
- Language: Python
- MIT licensed repositories (default)
- 100+ stars minimum (default)
Ethical Hacking (hacking)
Ethical hacking and cybersecurity tools.
python3 download_repos.py --categories hacking --max-repos 100
What it searches for:
ethical-hacking OR cybersecurity OR penetration-testing OR security-tools OR red-team- MIT licensed repositories (default)
- 100+ stars minimum (default)
Security (security)
General security and cybersecurity repositories.
python3 download_repos.py --categories security --max-repos 50
What it searches for:
security-tools OR cybersecurity OR penetration-testing OR red-team OR blue-team- MIT licensed repositories (default)
- 100+ stars minimum (default)
All Open Licenses (all-open)
All repositories with open licenses, any language. This is useful for downloading a diverse set of repositories.
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0
What it searches for:
- Any open-license repository (no language filter)
- No specific license filter (searches all open licenses)
- 100+ stars minimum (default)
Note: This category searches broadly and may return repositories with various licenses. You can still specify --license to filter to a specific license type.
Command-Line Options
download_repos.py Options
python3 download_repos.py [OPTIONS]
Options:
--output DIR- Output directory (default:data/repos)--categories CAT1 CAT2 ...- Categories to download:nvim,lua,bash,zsh,python,hacking,security,all-open--language LANG- Single language to filter by--languages LANG1 LANG2 ...- Multiple languages to download--license LICENSE- License type (default:mit)--min-stars N- Minimum stars (default: 100)--max-repos N- Maximum repos per category/language (default: 50)--max-size N- Maximum total size in GB (stops downloading when reached, e.g.,1024.0for 1 TB)--full-clone- Do full clone instead of shallow (slower but includes full history)
download_all_repos.py Options
python3 download_all_repos.py [OPTIONS]
Options:
--max-repos N- Maximum repos per category (default: 50)--min-stars N- Minimum stars (default: 100)--output DIR- Output directory (default:data/repos)--max-size N- Maximum total size in GB (default: 1024.0 = 1 TB)--full-clone- Do full clone instead of shallow
Example:
python3 download_all_repos.py --max-repos 100 --min-stars 200 --max-size 2048.0
Available Licenses
mit(default)apache-2.0bsd-3-clausebsd-2-clauseiscunlicensempl-2.0lgpl-2.1lgpl-3.0gpl-2.0gpl-3.0
Available Languages
pythonjavascripttypescriptjavacppcgorustrubyphpswiftkotlinscalarsqlluashell(for bash/shell scripts)
Examples
Example 1: Download All Categories (Simple)
python3 download_all_repos.py
Downloads all categories (nvim, lua, bash, zsh, python, hacking) with default settings and 1 TB size limit.
Example 2: Download All Categories with Custom Settings
python3 download_all_repos.py --max-repos 100 --min-stars 200 --max-size 2048.0
Downloads all categories with:
- 100 repos per category
- Minimum 200 stars
- 2 TB size limit
Example 3: Download Specific Categories
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50
Downloads specific categories with 50 repos each.
Example 4: Download All Open-License Repos with Size Limit
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0
Downloads up to 1000 repositories with any open license, stopping at 1 TB.
Example 5: Download High-Quality Repos
python3 download_repos.py --categories nvim lua bash zsh python hacking --min-stars 1000 --max-repos 20
Downloads only highly popular repositories (1000+ stars).
Example 6: Download Multiple Languages
python3 download_repos.py --languages python javascript go rust --max-repos 50
Downloads repositories in multiple programming languages.
Example 7: Download with Apache License
python3 download_repos.py --categories nvim --license apache-2.0 --max-repos 50
Downloads Neovim repos with Apache 2.0 license.
Example 8: Custom Output Directory
python3 download_repos.py --categories nvim lua bash zsh python hacking --output /path/to/repos
Saves repositories to a custom directory.
Example 9: Full Clone (with History)
python3 download_repos.py --categories nvim --full-clone --max-repos 10
Does full clone including full git history (slower but more complete).
Example 10: Size-Limited Download
python3 download_repos.py --categories all-open --max-repos 2000 --max-size 512.0
Downloads repositories but stops when reaching 512 GB (0.5 TB).
Progress Tracking
The scripts include visual progress bars showing:
- Category progress: Overall progress across all categories
- Repository progress: Progress for each category
- Real-time statistics: Current repo, stars, language, cloned/failed counts
- Size tracking: Current total size and size limit (when
--max-sizeis used)
Example output:
📊 Current directory size: 45.23 GB
📊 Size limit: 1024.00 GB
📦 Processing 6 categories...
Category: nvim: 100%|████████████| 6/6 [15:23<00:00, Size=156.78 GB, Total Cloned=300, Total Failed=2]
Cloning nvim: 45%|████████████████▌ | 23/50 [02:15<03:45, Current=awesome-nvim, Stars=5.2k, Lang=Lua, Cloned=22, Failed=1, Size=12.45 GB]
Size limit reached:
When the size limit is reached, the script will stop downloading and show:
⚠️ Size limit reached: 1024.00 GB >= 1024.00 GB
Stopping all downloads.
GitHub API Rate Limits
GitHub API has rate limits:
- Unauthenticated: 60 requests/hour
- Authenticated: 5,000 requests/hour
Using a GitHub Token
To increase rate limits, set a GitHub Personal Access Token:
export GITHUB_TOKEN=your_token_here
python3 download_repos.py --categories nvim lua bash hacking
How to create a token:
- Go to GitHub Settings → Developer settings → Personal access tokens
- Generate new token (classic)
- Select scope:
public_repo(read-only is enough) - Copy token and set as environment variable
Size Limits
The repository downloader includes automatic size limit checking to prevent running out of disk space.
How It Works
- Default limit: 1 TB (1024 GB) for
download_all_repos.py - Customizable: Use
--max-sizeto set any limit - Real-time tracking: Size is checked before each repository clone
- Automatic stopping: Downloads stop when limit is reached
- Progress display: Current size shown in progress bars
Setting Size Limits
With download_all_repos.py:
# Default 1 TB
python3 download_all_repos.py
# Custom limit (2 TB)
python3 download_all_repos.py --max-size 2048.0
# Smaller limit (500 GB)
python3 download_all_repos.py --max-size 512.0
With download_repos.py:
# No limit (downloads until max-repos reached)
python3 download_repos.py --categories nvim --max-repos 100
# With 1 TB limit
python3 download_repos.py --categories nvim --max-repos 1000 --max-size 1024.0
Size Calculation
The script calculates total size by:
- Scanning all files in the output directory (
data/reposby default) - Summing file sizes recursively
- Checking before each new repository clone
- Displaying human-readable sizes (B, KB, MB, GB, TB)
Note: Size checking happens before cloning, so the actual size may be slightly less than the limit when stopping.
Cache and Resuming
The scripts automatically:
- Skips existing repos: If a repository already exists, it's skipped (no re-download)
- Resumes downloads: You can run the script multiple times safely
- Progress tracking: Shows what's already downloaded
- Size awareness: Accounts for existing repositories when checking size limits
After downloading repositories, they're automatically processed during training:
# Download repos
python3 download_all_repos.py
# Train with all data (text + code)
python3 train.py --data data/ --config config.json --device cuda
The training script will:
- Process all your text data (Wiki, Books, Amazon reviews, etc.)
- Process all code repositories
- Combine everything into training data
Supported File Types
The data processor automatically handles code files from repositories:
- Text files:
.txt,.md,.rst,.log,.csv,.json,.jsonl,.xml,.html,.htm - Code files:
.py,.js,.ts,.java,.cpp,.c,.go,.rs,.rb,.php,.swift,.lua,.sh, and 30+ more - PDF files:
.pdf(if pdfplumber is installed) - Images:
.png,.jpg, etc. (if OCR is set up)
Troubleshooting
Rate Limit Exceeded
Error: Rate limit exceeded
Solution:
- Wait a few minutes and try again
- Use a GitHub token:
export GITHUB_TOKEN=your_token - Reduce
--max-reposto download fewer repos per run
Repository Clone Fails
Error: Failed to clone repository
Possible causes:
- Repository was deleted or made private
- Network issues
- Repository is too large (timeout)
Solution:
- The script continues with other repos
- Failed repos are counted and reported at the end
- You can re-run the script to retry failed repos
No Repositories Found
Error: No repositories found
Possible causes:
- Search query too restrictive
- License filter too narrow
- Minimum stars too high
Solution:
- Lower
--min-starsthreshold - Try different
--licenseoptions - Check if category name is correct
Best Practices
1. Start Small
Test with a small number first:
python3 download_repos.py --categories nvim --max-repos 10
2. Use Size Limits
Always set a size limit to prevent running out of disk space:
# Recommended: 1 TB limit
python3 download_all_repos.py --max-size 1024.0
# Or custom limit based on available space
python3 download_repos.py --categories all-open --max-size 512.0
3. Use Shallow Clones
Shallow clones are faster and use less disk space:
# Default (shallow clone)
python3 download_repos.py --categories nvim
# Full clone (only if you need history)
python3 download_repos.py --categories nvim --full-clone
4. Filter by Quality
Use --min-stars to get quality repositories:
python3 download_repos.py --categories nvim --min-stars 500 --max-repos 50
5. Use GitHub Token
For large downloads, use a GitHub token:
export GITHUB_TOKEN=your_token_here
python3 download_all_repos.py --max-repos 100
6. Monitor Disk Space
Check available disk space before starting:
df -h data/repos
7. Use all-open Category Wisely
The all-open category downloads broadly. Consider:
- Setting a reasonable
--max-reposlimit - Using
--min-starsto filter quality - Setting
--max-sizeto prevent excessive downloads
python3 download_repos.py --categories all-open --max-repos 500 --min-stars 200 --max-size 1024.0
Storage Considerations
Size Limits and Disk Space Management
- Default: 1 TB (1024 GB) for
download_all_repos.py - Recommended: Set based on available disk space
- Monitoring: Script shows current size vs limit in progress bars
Shallow vs Full Clones
Shallow clones (default):
- Faster download
- Less disk space (~10-50% of full clone)
- No git history
- Good for training data
Full clones:
- Slower download
- More disk space (includes full history)
- Includes full git history
- Useful if you need version history
Typical sizes (shallow clones):
- Small repo: 1-10 MB
- Medium repo: 10-100 MB
- Large repo: 100 MB - 1 GB
- Very large repo: 1-10 GB+
Example: Downloading 300 repositories with shallow clones typically uses 5-30 GB, depending on repository sizes.
Estimating Storage Needs
To estimate how many repositories you can download:
-
Check current size:
du -sh data/repos -
Calculate average repo size:
- Small repos: ~5 MB average
- Medium repos: ~50 MB average
- Large repos: ~500 MB average
-
Estimate:
- 100 small repos: ~500 MB
- 100 medium repos: ~5 GB
- 100 large repos: ~50 GB
- 1000 mixed repos: ~50-200 GB
-
Set appropriate limit:
# For 1 TB available space, use 900 GB limit (leave buffer) python3 download_all_repos.py --max-size 900.0
Summary
The repository downloader makes it easy to:
- ✅ Automatically find high-quality open-source repositories
- ✅ Filter by category, language, license, and popularity
- ✅ Download with progress tracking and size monitoring
- ✅ Set size limits to prevent running out of disk space
- ✅ Integrate seamlessly with training pipeline
- ✅ Resume interrupted downloads
Available categories:
nvim- Neovim configurations and pluginslua- Lua programming repositoriesbash- Bash/shell script repositorieszsh- Zsh configuration and pluginspython- Python programming repositorieshacking- Ethical hacking and cybersecurity toolssecurity- Security and cybersecurity repositoriesall-open- All repositories with open licenses (any language)
Quick commands to get started:
# Download all categories with 1 TB limit (recommended)
python3 download_all_repos.py
# Download specific categories
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50
# Download all open-license repos with size limit
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0
This downloads repositories and prepares them for training!