Initial commit: SheepOp LLM - Transformer-based language model implementation
- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
This commit is contained in:
656
docs/REPOSITORY_DOWNLOAD_GUIDE.md
Normal file
656
docs/REPOSITORY_DOWNLOAD_GUIDE.md
Normal file
@@ -0,0 +1,656 @@
|
||||
# Repository Download Guide
|
||||
|
||||
This guide explains how to automatically download GitHub repositories with open licenses for code training using the repository downloader scripts.
|
||||
|
||||
## Overview
|
||||
|
||||
The repository downloader allows you to automatically find and clone GitHub repositories based on:
|
||||
- **Categories**: Neovim configs, Lua repos, Bash scripts, Zsh configs, Python repos, ethical hacking tools, security tools, and all open-license repos
|
||||
- **Languages**: Python, JavaScript, Go, Rust, and 15+ more
|
||||
- **Licenses**: MIT, Apache, BSD, GPL, and other open source licenses
|
||||
- **Quality**: Filter by minimum stars (popularity)
|
||||
- **Size Limits**: Automatic stopping when reaching storage limits (default: 1 TB)
|
||||
|
||||
## Scripts
|
||||
|
||||
There are two scripts available:
|
||||
|
||||
1. **`download_all_repos.py`** - Convenience script to download all common categories at once
|
||||
2. **`download_repos.py`** - Full-featured script with all options and flexibility
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Download All Categories (Recommended)
|
||||
|
||||
The easiest way to download all repository categories:
|
||||
|
||||
```bash
|
||||
python3 download_all_repos.py
|
||||
```
|
||||
|
||||
This will download:
|
||||
- 📦 Neovim configurations and plugins
|
||||
- 📦 Lua programming repositories
|
||||
- 📦 Bash/shell script repositories
|
||||
- 📦 Zsh configuration and plugins
|
||||
- 📦 Python programming repositories
|
||||
- 📦 Ethical hacking and cybersecurity tools
|
||||
|
||||
**Default settings:**
|
||||
- Max repos per category: 50
|
||||
- Min stars: 100
|
||||
- Output directory: `data/repos`
|
||||
- Size limit: 1 TB (1024 GB)
|
||||
- Shallow clones (faster, less disk space)
|
||||
|
||||
### Download Specific Categories
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50
|
||||
```
|
||||
|
||||
### Download All Open-License Repos
|
||||
|
||||
Download repositories with any open license (any language):
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0
|
||||
```
|
||||
|
||||
### Download by Language
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --language python --max-repos 100
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
No additional dependencies required! The script uses:
|
||||
- Python standard library (`urllib`, `json`, `subprocess`)
|
||||
- `tqdm` (already in requirements.txt)
|
||||
- `git` (should be installed on your system)
|
||||
|
||||
## Available Categories
|
||||
|
||||
### Neovim (`nvim`)
|
||||
Neovim configuration files and plugins written in Lua.
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories nvim --max-repos 100
|
||||
```
|
||||
|
||||
**What it searches for:**
|
||||
- `neovim OR nvim-config OR neovim-config`
|
||||
- MIT licensed repositories (default)
|
||||
- 100+ stars minimum (default)
|
||||
|
||||
### Lua (`lua`)
|
||||
Lua programming language repositories.
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories lua --max-repos 50
|
||||
```
|
||||
|
||||
**What it searches for:**
|
||||
- Language: Lua
|
||||
- MIT licensed repositories (default)
|
||||
- 100+ stars minimum (default)
|
||||
|
||||
### Bash (`bash`)
|
||||
Bash and shell script repositories.
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories bash --max-repos 50
|
||||
```
|
||||
|
||||
**What it searches for:**
|
||||
- Language: Shell
|
||||
- MIT licensed repositories (default)
|
||||
- 100+ stars minimum (default)
|
||||
|
||||
### Zsh (`zsh`)
|
||||
Zsh configuration files and plugins (Oh My Zsh, etc.).
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories zsh --max-repos 50
|
||||
```
|
||||
|
||||
**What it searches for:**
|
||||
- `zsh-config OR oh-my-zsh OR zsh-plugin`
|
||||
- MIT licensed repositories (default)
|
||||
- 100+ stars minimum (default)
|
||||
|
||||
### Python (`python`)
|
||||
Python programming language repositories.
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories python --max-repos 100
|
||||
```
|
||||
|
||||
**What it searches for:**
|
||||
- Language: Python
|
||||
- MIT licensed repositories (default)
|
||||
- 100+ stars minimum (default)
|
||||
|
||||
### Ethical Hacking (`hacking`)
|
||||
Ethical hacking and cybersecurity tools.
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories hacking --max-repos 100
|
||||
```
|
||||
|
||||
**What it searches for:**
|
||||
- `ethical-hacking OR cybersecurity OR penetration-testing OR security-tools OR red-team`
|
||||
- MIT licensed repositories (default)
|
||||
- 100+ stars minimum (default)
|
||||
|
||||
### Security (`security`)
|
||||
General security and cybersecurity repositories.
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories security --max-repos 50
|
||||
```
|
||||
|
||||
**What it searches for:**
|
||||
- `security-tools OR cybersecurity OR penetration-testing OR red-team OR blue-team`
|
||||
- MIT licensed repositories (default)
|
||||
- 100+ stars minimum (default)
|
||||
|
||||
### All Open Licenses (`all-open`)
|
||||
All repositories with open licenses, any language. This is useful for downloading a diverse set of repositories.
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0
|
||||
```
|
||||
|
||||
**What it searches for:**
|
||||
- Any open-license repository (no language filter)
|
||||
- No specific license filter (searches all open licenses)
|
||||
- 100+ stars minimum (default)
|
||||
|
||||
**Note:** This category searches broadly and may return repositories with various licenses. You can still specify `--license` to filter to a specific license type.
|
||||
|
||||
## Command-Line Options
|
||||
|
||||
### `download_repos.py` Options
|
||||
|
||||
```bash
|
||||
python3 download_repos.py [OPTIONS]
|
||||
```
|
||||
|
||||
**Options:**
|
||||
|
||||
- `--output DIR` - Output directory (default: `data/repos`)
|
||||
- `--categories CAT1 CAT2 ...` - Categories to download: `nvim`, `lua`, `bash`, `zsh`, `python`, `hacking`, `security`, `all-open`
|
||||
- `--language LANG` - Single language to filter by
|
||||
- `--languages LANG1 LANG2 ...` - Multiple languages to download
|
||||
- `--license LICENSE` - License type (default: `mit`)
|
||||
- `--min-stars N` - Minimum stars (default: 100)
|
||||
- `--max-repos N` - Maximum repos per category/language (default: 50)
|
||||
- `--max-size N` - Maximum total size in GB (stops downloading when reached, e.g., `1024.0` for 1 TB)
|
||||
- `--full-clone` - Do full clone instead of shallow (slower but includes full history)
|
||||
|
||||
### `download_all_repos.py` Options
|
||||
|
||||
```bash
|
||||
python3 download_all_repos.py [OPTIONS]
|
||||
```
|
||||
|
||||
**Options:**
|
||||
|
||||
- `--max-repos N` - Maximum repos per category (default: 50)
|
||||
- `--min-stars N` - Minimum stars (default: 100)
|
||||
- `--output DIR` - Output directory (default: `data/repos`)
|
||||
- `--max-size N` - Maximum total size in GB (default: 1024.0 = 1 TB)
|
||||
- `--full-clone` - Do full clone instead of shallow
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
python3 download_all_repos.py --max-repos 100 --min-stars 200 --max-size 2048.0
|
||||
```
|
||||
|
||||
### Available Licenses
|
||||
|
||||
- `mit` (default)
|
||||
- `apache-2.0`
|
||||
- `bsd-3-clause`
|
||||
- `bsd-2-clause`
|
||||
- `isc`
|
||||
- `unlicense`
|
||||
- `mpl-2.0`
|
||||
- `lgpl-2.1`
|
||||
- `lgpl-3.0`
|
||||
- `gpl-2.0`
|
||||
- `gpl-3.0`
|
||||
|
||||
### Available Languages
|
||||
|
||||
- `python`
|
||||
- `javascript`
|
||||
- `typescript`
|
||||
- `java`
|
||||
- `cpp`
|
||||
- `c`
|
||||
- `go`
|
||||
- `rust`
|
||||
- `ruby`
|
||||
- `php`
|
||||
- `swift`
|
||||
- `kotlin`
|
||||
- `scala`
|
||||
- `r`
|
||||
- `sql`
|
||||
- `lua`
|
||||
- `shell` (for bash/shell scripts)
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Download All Categories (Simple)
|
||||
|
||||
```bash
|
||||
python3 download_all_repos.py
|
||||
```
|
||||
|
||||
Downloads all categories (nvim, lua, bash, zsh, python, hacking) with default settings and 1 TB size limit.
|
||||
|
||||
### Example 2: Download All Categories with Custom Settings
|
||||
|
||||
```bash
|
||||
python3 download_all_repos.py --max-repos 100 --min-stars 200 --max-size 2048.0
|
||||
```
|
||||
|
||||
Downloads all categories with:
|
||||
- 100 repos per category
|
||||
- Minimum 200 stars
|
||||
- 2 TB size limit
|
||||
|
||||
### Example 3: Download Specific Categories
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50
|
||||
```
|
||||
|
||||
Downloads specific categories with 50 repos each.
|
||||
|
||||
### Example 4: Download All Open-License Repos with Size Limit
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0
|
||||
```
|
||||
|
||||
Downloads up to 1000 repositories with any open license, stopping at 1 TB.
|
||||
|
||||
### Example 5: Download High-Quality Repos
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories nvim lua bash zsh python hacking --min-stars 1000 --max-repos 20
|
||||
```
|
||||
|
||||
Downloads only highly popular repositories (1000+ stars).
|
||||
|
||||
### Example 6: Download Multiple Languages
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --languages python javascript go rust --max-repos 50
|
||||
```
|
||||
|
||||
Downloads repositories in multiple programming languages.
|
||||
|
||||
### Example 7: Download with Apache License
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories nvim --license apache-2.0 --max-repos 50
|
||||
```
|
||||
|
||||
Downloads Neovim repos with Apache 2.0 license.
|
||||
|
||||
### Example 8: Custom Output Directory
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories nvim lua bash zsh python hacking --output /path/to/repos
|
||||
```
|
||||
|
||||
Saves repositories to a custom directory.
|
||||
|
||||
### Example 9: Full Clone (with History)
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories nvim --full-clone --max-repos 10
|
||||
```
|
||||
|
||||
Does full clone including full git history (slower but more complete).
|
||||
|
||||
### Example 10: Size-Limited Download
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories all-open --max-repos 2000 --max-size 512.0
|
||||
```
|
||||
|
||||
Downloads repositories but stops when reaching 512 GB (0.5 TB).
|
||||
|
||||
## Progress Tracking
|
||||
|
||||
The scripts include visual progress bars showing:
|
||||
|
||||
- **Category progress**: Overall progress across all categories
|
||||
- **Repository progress**: Progress for each category
|
||||
- **Real-time statistics**: Current repo, stars, language, cloned/failed counts
|
||||
- **Size tracking**: Current total size and size limit (when `--max-size` is used)
|
||||
|
||||
**Example output:**
|
||||
|
||||
```text
|
||||
📊 Current directory size: 45.23 GB
|
||||
📊 Size limit: 1024.00 GB
|
||||
📦 Processing 6 categories...
|
||||
Category: nvim: 100%|████████████| 6/6 [15:23<00:00, Size=156.78 GB, Total Cloned=300, Total Failed=2]
|
||||
Cloning nvim: 45%|████████████████▌ | 23/50 [02:15<03:45, Current=awesome-nvim, Stars=5.2k, Lang=Lua, Cloned=22, Failed=1, Size=12.45 GB]
|
||||
```
|
||||
|
||||
**Size limit reached:**
|
||||
|
||||
When the size limit is reached, the script will stop downloading and show:
|
||||
|
||||
```text
|
||||
⚠️ Size limit reached: 1024.00 GB >= 1024.00 GB
|
||||
Stopping all downloads.
|
||||
```
|
||||
|
||||
## GitHub API Rate Limits
|
||||
|
||||
GitHub API has rate limits:
|
||||
- **Unauthenticated**: 60 requests/hour
|
||||
- **Authenticated**: 5,000 requests/hour
|
||||
|
||||
### Using a GitHub Token
|
||||
|
||||
To increase rate limits, set a GitHub Personal Access Token:
|
||||
|
||||
```bash
|
||||
export GITHUB_TOKEN=your_token_here
|
||||
python3 download_repos.py --categories nvim lua bash hacking
|
||||
```
|
||||
|
||||
**How to create a token:**
|
||||
1. Go to GitHub Settings → Developer settings → Personal access tokens
|
||||
2. Generate new token (classic)
|
||||
3. Select scope: `public_repo` (read-only is enough)
|
||||
4. Copy token and set as environment variable
|
||||
|
||||
## Size Limits
|
||||
|
||||
The repository downloader includes automatic size limit checking to prevent running out of disk space.
|
||||
|
||||
### How It Works
|
||||
|
||||
- **Default limit**: 1 TB (1024 GB) for `download_all_repos.py`
|
||||
- **Customizable**: Use `--max-size` to set any limit
|
||||
- **Real-time tracking**: Size is checked before each repository clone
|
||||
- **Automatic stopping**: Downloads stop when limit is reached
|
||||
- **Progress display**: Current size shown in progress bars
|
||||
|
||||
### Setting Size Limits
|
||||
|
||||
**With `download_all_repos.py`:**
|
||||
```bash
|
||||
# Default 1 TB
|
||||
python3 download_all_repos.py
|
||||
|
||||
# Custom limit (2 TB)
|
||||
python3 download_all_repos.py --max-size 2048.0
|
||||
|
||||
# Smaller limit (500 GB)
|
||||
python3 download_all_repos.py --max-size 512.0
|
||||
```
|
||||
|
||||
**With `download_repos.py`:**
|
||||
```bash
|
||||
# No limit (downloads until max-repos reached)
|
||||
python3 download_repos.py --categories nvim --max-repos 100
|
||||
|
||||
# With 1 TB limit
|
||||
python3 download_repos.py --categories nvim --max-repos 1000 --max-size 1024.0
|
||||
```
|
||||
|
||||
### Size Calculation
|
||||
|
||||
The script calculates total size by:
|
||||
- Scanning all files in the output directory (`data/repos` by default)
|
||||
- Summing file sizes recursively
|
||||
- Checking before each new repository clone
|
||||
- Displaying human-readable sizes (B, KB, MB, GB, TB)
|
||||
|
||||
**Note:** Size checking happens before cloning, so the actual size may be slightly less than the limit when stopping.
|
||||
|
||||
## Cache and Resuming
|
||||
|
||||
The scripts automatically:
|
||||
|
||||
- **Skips existing repos**: If a repository already exists, it's skipped (no re-download)
|
||||
- **Resumes downloads**: You can run the script multiple times safely
|
||||
- **Progress tracking**: Shows what's already downloaded
|
||||
- **Size awareness**: Accounts for existing repositories when checking size limits
|
||||
|
||||
After downloading repositories, they're automatically processed during training:
|
||||
|
||||
```bash
|
||||
# Download repos
|
||||
python3 download_all_repos.py
|
||||
|
||||
# Train with all data (text + code)
|
||||
python3 train.py --data data/ --config config.json --device cuda
|
||||
```
|
||||
|
||||
The training script will:
|
||||
1. Process all your text data (Wiki, Books, Amazon reviews, etc.)
|
||||
2. Process all code repositories
|
||||
3. Combine everything into training data
|
||||
|
||||
## Supported File Types
|
||||
|
||||
The data processor automatically handles code files from repositories:
|
||||
|
||||
- **Text files**: `.txt`, `.md`, `.rst`, `.log`, `.csv`, `.json`, `.jsonl`, `.xml`, `.html`, `.htm`
|
||||
- **Code files**: `.py`, `.js`, `.ts`, `.java`, `.cpp`, `.c`, `.go`, `.rs`, `.rb`, `.php`, `.swift`, `.lua`, `.sh`, and 30+ more
|
||||
- **PDF files**: `.pdf` (if pdfplumber is installed)
|
||||
- **Images**: `.png`, `.jpg`, etc. (if OCR is set up)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Rate Limit Exceeded
|
||||
|
||||
**Error:** `Rate limit exceeded`
|
||||
|
||||
**Solution:**
|
||||
1. Wait a few minutes and try again
|
||||
2. Use a GitHub token: `export GITHUB_TOKEN=your_token`
|
||||
3. Reduce `--max-repos` to download fewer repos per run
|
||||
|
||||
### Repository Clone Fails
|
||||
|
||||
**Error:** `Failed to clone repository`
|
||||
|
||||
**Possible causes:**
|
||||
- Repository was deleted or made private
|
||||
- Network issues
|
||||
- Repository is too large (timeout)
|
||||
|
||||
**Solution:**
|
||||
- The script continues with other repos
|
||||
- Failed repos are counted and reported at the end
|
||||
- You can re-run the script to retry failed repos
|
||||
|
||||
### No Repositories Found
|
||||
|
||||
**Error:** `No repositories found`
|
||||
|
||||
**Possible causes:**
|
||||
- Search query too restrictive
|
||||
- License filter too narrow
|
||||
- Minimum stars too high
|
||||
|
||||
**Solution:**
|
||||
- Lower `--min-stars` threshold
|
||||
- Try different `--license` options
|
||||
- Check if category name is correct
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Start Small
|
||||
|
||||
Test with a small number first:
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories nvim --max-repos 10
|
||||
```
|
||||
|
||||
### 2. Use Size Limits
|
||||
|
||||
Always set a size limit to prevent running out of disk space:
|
||||
|
||||
```bash
|
||||
# Recommended: 1 TB limit
|
||||
python3 download_all_repos.py --max-size 1024.0
|
||||
|
||||
# Or custom limit based on available space
|
||||
python3 download_repos.py --categories all-open --max-size 512.0
|
||||
```
|
||||
|
||||
### 3. Use Shallow Clones
|
||||
|
||||
Shallow clones are faster and use less disk space:
|
||||
|
||||
```bash
|
||||
# Default (shallow clone)
|
||||
python3 download_repos.py --categories nvim
|
||||
|
||||
# Full clone (only if you need history)
|
||||
python3 download_repos.py --categories nvim --full-clone
|
||||
```
|
||||
|
||||
### 4. Filter by Quality
|
||||
|
||||
Use `--min-stars` to get quality repositories:
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories nvim --min-stars 500 --max-repos 50
|
||||
```
|
||||
|
||||
### 5. Use GitHub Token
|
||||
|
||||
For large downloads, use a GitHub token:
|
||||
|
||||
```bash
|
||||
export GITHUB_TOKEN=your_token_here
|
||||
python3 download_all_repos.py --max-repos 100
|
||||
```
|
||||
|
||||
### 6. Monitor Disk Space
|
||||
|
||||
Check available disk space before starting:
|
||||
|
||||
```bash
|
||||
df -h data/repos
|
||||
```
|
||||
|
||||
### 7. Use `all-open` Category Wisely
|
||||
|
||||
The `all-open` category downloads broadly. Consider:
|
||||
- Setting a reasonable `--max-repos` limit
|
||||
- Using `--min-stars` to filter quality
|
||||
- Setting `--max-size` to prevent excessive downloads
|
||||
|
||||
```bash
|
||||
python3 download_repos.py --categories all-open --max-repos 500 --min-stars 200 --max-size 1024.0
|
||||
```
|
||||
|
||||
## Storage Considerations
|
||||
|
||||
### Size Limits and Disk Space Management
|
||||
|
||||
- **Default**: 1 TB (1024 GB) for `download_all_repos.py`
|
||||
- **Recommended**: Set based on available disk space
|
||||
- **Monitoring**: Script shows current size vs limit in progress bars
|
||||
|
||||
### Shallow vs Full Clones
|
||||
|
||||
**Shallow clones (default):**
|
||||
- Faster download
|
||||
- Less disk space (~10-50% of full clone)
|
||||
- No git history
|
||||
- Good for training data
|
||||
|
||||
**Full clones:**
|
||||
- Slower download
|
||||
- More disk space (includes full history)
|
||||
- Includes full git history
|
||||
- Useful if you need version history
|
||||
|
||||
**Typical sizes (shallow clones):**
|
||||
- Small repo: 1-10 MB
|
||||
- Medium repo: 10-100 MB
|
||||
- Large repo: 100 MB - 1 GB
|
||||
- Very large repo: 1-10 GB+
|
||||
|
||||
**Example:** Downloading 300 repositories with shallow clones typically uses 5-30 GB, depending on repository sizes.
|
||||
|
||||
### Estimating Storage Needs
|
||||
|
||||
To estimate how many repositories you can download:
|
||||
|
||||
1. **Check current size:**
|
||||
```bash
|
||||
du -sh data/repos
|
||||
```
|
||||
|
||||
2. **Calculate average repo size:**
|
||||
- Small repos: ~5 MB average
|
||||
- Medium repos: ~50 MB average
|
||||
- Large repos: ~500 MB average
|
||||
|
||||
3. **Estimate:**
|
||||
- 100 small repos: ~500 MB
|
||||
- 100 medium repos: ~5 GB
|
||||
- 100 large repos: ~50 GB
|
||||
- 1000 mixed repos: ~50-200 GB
|
||||
|
||||
4. **Set appropriate limit:**
|
||||
```bash
|
||||
# For 1 TB available space, use 900 GB limit (leave buffer)
|
||||
python3 download_all_repos.py --max-size 900.0
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
The repository downloader makes it easy to:
|
||||
- ✅ Automatically find high-quality open-source repositories
|
||||
- ✅ Filter by category, language, license, and popularity
|
||||
- ✅ Download with progress tracking and size monitoring
|
||||
- ✅ Set size limits to prevent running out of disk space
|
||||
- ✅ Integrate seamlessly with training pipeline
|
||||
- ✅ Resume interrupted downloads
|
||||
|
||||
**Available categories:**
|
||||
- `nvim` - Neovim configurations and plugins
|
||||
- `lua` - Lua programming repositories
|
||||
- `bash` - Bash/shell script repositories
|
||||
- `zsh` - Zsh configuration and plugins
|
||||
- `python` - Python programming repositories
|
||||
- `hacking` - Ethical hacking and cybersecurity tools
|
||||
- `security` - Security and cybersecurity repositories
|
||||
- `all-open` - All repositories with open licenses (any language)
|
||||
|
||||
**Quick commands to get started:**
|
||||
|
||||
```bash
|
||||
# Download all categories with 1 TB limit (recommended)
|
||||
python3 download_all_repos.py
|
||||
|
||||
# Download specific categories
|
||||
python3 download_repos.py --categories nvim lua bash zsh python hacking --max-repos 50
|
||||
|
||||
# Download all open-license repos with size limit
|
||||
python3 download_repos.py --categories all-open --max-repos 1000 --max-size 1024.0
|
||||
```
|
||||
|
||||
This downloads repositories and prepares them for training!
|
||||
Reference in New Issue
Block a user