- Complete transformer implementation from scratch - Training pipeline with gradient accumulation and mixed precision - Optimized inference with KV caching - Multi-format data processing (PDFs, images, code, text) - Comprehensive documentation - Apache 2.0 license - Example training plots included in docs/images/
29 lines
1.2 KiB
Plaintext
29 lines
1.2 KiB
Plaintext
# IMPORTANT: On modern Debian/Ubuntu systems (Python 3.12+), you MUST use a virtual environment
|
|
# before installing these packages. Run: python3 -m venv venv && source venv/bin/activate
|
|
# Or use the automated setup script: ./setup.sh
|
|
|
|
torch>=2.0.0
|
|
transformers>=4.30.0
|
|
numpy>=1.24.0
|
|
tqdm>=4.65.0
|
|
tensorboard>=2.13.0
|
|
matplotlib>=3.7.0
|
|
|
|
# Optional dependencies for data processing
|
|
# Install these if you want to process PDFs or images:
|
|
# For PDF processing (choose one - pdfplumber is recommended for better quality):
|
|
pdfplumber>=0.9.0 # Recommended: better text extraction quality
|
|
# PyPDF2>=3.0.0 # Alternative PDF library (lighter weight but less accurate)
|
|
|
|
# For image OCR (requires Tesseract OCR engine installed on system):
|
|
# pytesseract>=0.3.10 # For OCR
|
|
# Pillow>=10.0.0 # Required for image processing with pytesseract
|
|
#
|
|
# To install Tesseract OCR engine:
|
|
# Ubuntu/Debian: sudo apt-get install tesseract-ocr
|
|
# macOS: brew install tesseract
|
|
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
|
|
|
|
# For downloading datasets from Hugging Face (used by download_large_data.py):
|
|
datasets>=2.14.0 # Optional: for downloading WikiText, OpenWebText, BookCorpus, etc.
|