first commit

This commit is contained in:
suherdy yacob 2025-08-22 16:33:30 +07:00
commit c73b0d247a
13 changed files with 1850 additions and 0 deletions

183
README.md Normal file
View File

@ -0,0 +1,183 @@
# AI Trainer
A Python application for training various unsloth models using data from GitHub repositories. Supports both Qwen2.5-Coder and Qwen3 models optimized for RTX3070 8GB VRAM.
## Supported Models
### 1. Qwen2.5-Coder-7B-Instruct (Default)
- **Model**: `unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit`
- **Best for**: Code generation, code completion, programming tasks
- **Memory Usage**: Moderate (~6-7GB VRAM)
- **Config**: `configs/training_config.yaml`
### 2. Qwen3-8B
- **Model**: `unsloth/Qwen3-8B-bnb-4bit`
- **Best for**: General instruction following, broader language tasks
- **Memory Usage**: Higher (~7-8GB VRAM)
- **Config**: `configs/training_config_qwen3.yaml`
## Features
- **Dataset Processing**: Automatically processes code from GitHub repositories
- **Memory Optimized**: Designed for RTX3070 8GB VRAM with no CPU offloading
- **Configurable Training**: YAML-based configuration system
- **Progress Logging**: Comprehensive logging and monitoring
- **Modular Design**: Clean separation of concerns with dataset processing, training, and utilities
- **Multi-Model Support**: Easy switching between different model architectures
## Requirements
- Python 3.8+
- CUDA-compatible GPU (tested with RTX3070 8GB VRAM)
- Git
- Dependencies listed in `requirements.txt`
## Installation
1. Clone this repository
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Usage
### Training Qwen2.5-Coder-7B (Default)
```bash
# Using the main script
python src/main.py \
--repo1 https://github.com/user/repo1 \
--repo2 https://github.com/user/repo2 \
--config configs/training_config.yaml \
--output_dir ./models \
--log_level INFO
# Or using the runner script
python run_training.py \
--repo1 https://github.com/user/repo1 \
--repo2 https://github.com/user/repo2
```
### Training Qwen3-8B
```bash
# Using the main script with Qwen3 config
python src/main.py \
--repo1 https://github.com/user/repo1 \
--repo2 https://github.com/user/repo2 \
--config configs/training_config_qwen3.yaml \
--output_dir ./models \
--log_level INFO
# Or using the dedicated Qwen3 runner
python run_training_qwen3.py \
--repo1 https://github.com/user/repo1 \
--repo2 https://github.com/user/repo2
```
### Command Line Arguments
- `--repo1`: First GitHub repository URL (required)
- `--repo2`: Second GitHub repository URL (required)
- `--config`: Path to training configuration file (default: configs/training_config.yaml)
- `--output_dir`: Directory to save trained model (default: ./models)
- `--log_level`: Logging level (DEBUG, INFO, WARNING, ERROR)
## Project Structure
```
ai_trainer/
├── src/
│ ├── __init__.py
│ ├── main.py # Main entry point
│ ├── trainer.py # Model training logic
│ ├── dataset_processor.py # GitHub repository processing
│ ├── config.py # Configuration management
│ └── utils.py # Utility functions
├── configs/
│ └── training_config.yaml # Training configuration
├── data/
│ └── processed/ # Processed datasets
├── models/ # Trained models
├── logs/ # Training logs
├── requirements.txt
└── README.md
```
## Memory Optimization
This application is specifically optimized for RTX3070 8GB VRAM:
- Uses 4-bit quantization (bnb-4bit)
- Gradient checkpointing enabled
- No CPU offloading
- Optimized batch sizes for 8GB VRAM
- Memory-efficient data loading
## Configuration
### Qwen2.5-Coder-7B Configuration
**File**: `configs/training_config.yaml`
```yaml
model:
name: "unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit"
max_seq_length: 2048
training:
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2.0e-4
num_train_epochs: 3
memory:
use_gradient_checkpointing: true
offload_to_cpu: false
max_memory_usage: 0.85
```
### Qwen3-8B Configuration
**File**: `configs/training_config_qwen3.yaml`
```yaml
model:
name: "unsloth/Qwen3-8B-bnb-4bit"
max_seq_length: 2048
training:
per_device_train_batch_size: 1 # More conservative
gradient_accumulation_steps: 8 # Higher accumulation
learning_rate: 1.0e-4 # Lower learning rate
num_train_epochs: 3
memory:
use_gradient_checkpointing: true
offload_to_cpu: false
max_memory_usage: 0.95 # More aggressive memory usage
```
### Key Differences
| Setting | Qwen2.5-Coder | Qwen3-8B | Reason |
|---------|---------------|----------|---------|
| Batch Size | 2 | 1 | Larger model needs smaller batches |
| Gradient Accumulation | 4 | 8 | Maintains effective batch size |
| Learning Rate | 2e-4 | 1e-4 | Larger model needs more conservative LR |
| Memory Usage | 85% | 95% | Qwen3 can use more VRAM |
| Effective Batch Size | 8 | 8 | Same training dynamics |
## Model Selection Guide
### Choose Qwen2.5-Coder-7B when:
- You want to fine-tune specifically for **code generation** tasks
- Working with **programming languages** and technical content
- Need **code completion** and **code understanding** capabilities
- Prefer **moderate memory usage** (~6-7GB VRAM)
### Choose Qwen3-8B when:
- You need **general instruction following** capabilities
- Working with **mixed content** (code + natural language)
- Want **broader language understanding** and generation
- Have **sufficient VRAM** (~7-8GB) and prefer newer architecture
## License
MIT License

99
compare_configs.py Normal file
View File

@ -0,0 +1,99 @@
#!/usr/bin/env python3
"""
Compare training configurations for different models
"""
import yaml
from pathlib import Path
from colorama import init, Fore, Style
init(autoreset=True)
def load_config(config_path):
"""Load YAML configuration"""
with open(config_path, 'r') as f:
return yaml.safe_load(f)
def compare_configs():
"""Compare the two training configurations"""
print(f"\n{Fore.CYAN}{'='*80}{Style.RESET_ALL}")
print(f"{Fore.CYAN}AI TRAINER - MODEL CONFIGURATION COMPARISON{Style.RESET_ALL}")
print(f"{Fore.CYAN}{'='*80}{Style.RESET_ALL}")
# Load configurations
qwen25_config = load_config('configs/training_config.yaml')
qwen3_config = load_config('configs/training_config_qwen3.yaml')
# Model comparison
print(f"\n{Fore.GREEN}📊 MODEL COMPARISON{Style.RESET_ALL}")
print(f"{'Setting':<25} {'Qwen2.5-Coder-7B':<20} {'Qwen3-8B':<15}")
print(f"{'-'*60}")
print(f"{'Model Name':<25} {qwen25_config['model']['name']:<20} {qwen3_config['model']['name']:<15}")
print(f"{'Max Seq Length':<25} {qwen25_config['model']['max_seq_length']:<20} {qwen3_config['model']['max_seq_length']:<15}")
# Training comparison
print(f"\n{Fore.GREEN}⚙️ TRAINING PARAMETERS{Style.RESET_ALL}")
print(f"{'Parameter':<25} {'Qwen2.5-Coder-7B':<20} {'Qwen3-8B':<15} {'Difference':<15}")
print(f"{'-'*75}")
training_params = [
('Batch Size', 'per_device_train_batch_size'),
('Gradient Accumulation', 'gradient_accumulation_steps'),
('Learning Rate', 'learning_rate'),
('Warmup Steps', 'warmup_steps'),
('Epochs', 'num_train_epochs')
]
for param_name, param_key in training_params:
qwen25_val = qwen25_config['training'][param_key]
qwen3_val = qwen3_config['training'][param_key]
diff = "🔻" if qwen3_val < qwen25_val else "🔺" if qwen3_val > qwen25_val else "➡️"
print(f"{param_name:<25} {qwen25_val:<20} {qwen3_val:<15} {diff}")
# Memory comparison
print(f"\n{Fore.GREEN}🧠 MEMORY SETTINGS{Style.RESET_ALL}")
print(f"{'Setting':<25} {'Qwen2.5-Coder-7B':<20} {'Qwen3-8B':<15}")
print(f"{'-'*60}")
memory_params = [
('Max Memory Usage', 'max_memory_usage'),
('Gradient Checkpointing', 'use_gradient_checkpointing'),
('CPU Offloading', 'offload_to_cpu')
]
for param_name, param_key in memory_params:
qwen25_val = qwen25_config['memory'][param_key]
qwen3_val = qwen3_config['memory'][param_key]
print(f"{param_name:<25} {qwen25_val:<20} {qwen3_val:<15}")
# Usage guide
print(f"\n{Fore.YELLOW}💡 RECOMMENDATION GUIDE{Style.RESET_ALL}")
print(f"{'='*80}")
print(f"\n{Fore.BLUE}Use Qwen2.5-Coder-7B when:{Style.RESET_ALL}")
print(f" • You want to fine-tune for code generation tasks")
print(f" • Working primarily with programming languages")
print(f" • Need code completion and understanding")
print(f" • Prefer moderate memory usage (~6-7GB VRAM)")
print(f"\n{Fore.BLUE}Use Qwen3-8B when:{Style.RESET_ALL}")
print(f" • You need general instruction following")
print(f" • Working with mixed code and natural language")
print(f" • Want broader language understanding")
print(f" • Have sufficient VRAM (~7-8GB)")
print(f"\n{Fore.GREEN}🚀 QUICK START COMMANDS{Style.RESET_ALL}")
print(f"{'='*80}")
print(f"\n{Fore.CYAN}For Qwen2.5-Coder-7B:{Style.RESET_ALL}")
print(f"python run_training.py --repo1 <repo1> --repo2 <repo2>")
print(f"\n{Fore.CYAN}For Qwen3-8B:{Style.RESET_ALL}")
print(f"python run_training_qwen3.py --repo1 <repo1> --repo2 <repo2>")
print(f"\n{Fore.CYAN}{'='*80}{Style.RESET_ALL}")
if __name__ == "__main__":
compare_configs()

View File

@ -0,0 +1,132 @@
# Training configuration optimized for RTX3070 8GB VRAM
# AI Trainer for unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit
model:
name: "unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit"
max_seq_length: 2048
trust_remote_code: true
use_fast_tokenizer: true
padding_side: "left"
truncation_side: "left"
training:
# Memory-optimized batch size for RTX3070 8GB
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
# Training parameters
num_train_epochs: 3
learning_rate: 2.0e-4
warmup_steps: 10
warmup_ratio: 0.1
# Logging and saving
logging_steps: 1
save_steps: 100
save_total_limit: 3
# Evaluation
evaluation_strategy: "steps"
eval_steps: 100
load_best_model_at_end: true
metric_for_best_model: "loss"
greater_is_better: false
# Data loading
dataloader_num_workers: 2
dataloader_pin_memory: true
remove_unused_columns: false
# Memory optimization - CRITICAL for RTX3070 8GB
use_gradient_checkpointing: true
offload_to_cpu: false # Explicitly no CPU offloading
# Optimizer settings
optim: "adamw_torch"
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1.0e-8
max_grad_norm: 1.0
# Learning rate scheduler
lr_scheduler_type: "cosine"
# Precision - BF16 for better stability on modern GPUs
bf16: true
fp16: false
tf32: true
# Dataset settings
dataset_shuffle: true
dataset_seed: 42
# Output settings
output_dir: "./models"
logging_dir: "./logs"
report_to: ["tensorboard"]
dataset:
# File filtering
min_file_size: 10
max_file_size: 10000
# Supported programming languages
supported_languages:
- python
- javascript
- typescript
- java
- cpp
- c
- csharp
- php
- ruby
- go
- rust
- swift
- kotlin
- scala
- sql
- bash
- yaml
- json
- xml
- html
- css
- markdown
# Files and directories to exclude
exclude_patterns:
- "\\.git/"
- "__pycache__/"
- "\\.pytest_cache/"
- "node_modules/"
- "\\.venv/"
- "venv/"
- "package-lock\\.json$"
- "yarn\\.lock$"
- "\\.log$"
- "\\.tmp$"
- "\\.bak$"
- "~\\$.*"
- "\\.swp$"
- "\\.swo$"
- "\\.DS_Store"
- "\\.pyc$"
- "\\.pyo$"
- "\\.pyd$"
- "\\.so$"
- "\\.dll$"
- "\\.exe$"
memory:
# Memory management for RTX3070 8GB
max_memory_usage: 0.85 # Use up to 85% of GPU memory
enable_memory_tracking: true
clear_cache_between_epochs: true
# Attention optimization
use_memory_efficient_attention: true
attention_slicing: true
slice_size: 1

View File

@ -0,0 +1,132 @@
# Training configuration optimized for RTX3070 8GB VRAM - Qwen3-8B Model
# AI Trainer for unsloth/Qwen3-8B-bnb-4bit
model:
name: "unsloth/Qwen3-8B-bnb-4bit"
max_seq_length: 2048
trust_remote_code: true
use_fast_tokenizer: true
padding_side: "left"
truncation_side: "left"
training:
# Memory-optimized batch size for RTX3070 8GB with Qwen3-8B
per_device_train_batch_size: 1 # More conservative for larger model
gradient_accumulation_steps: 8 # Higher accumulation to maintain effective batch size
# Training parameters
num_train_epochs: 3
learning_rate: 1.0e-4 # Slightly lower for larger model
warmup_steps: 15
warmup_ratio: 0.1
# Logging and saving
logging_steps: 1
save_steps: 100
save_total_limit: 3
# Evaluation
evaluation_strategy: "steps"
eval_steps: 100
load_best_model_at_end: true
metric_for_best_model: "loss"
greater_is_better: false
# Data loading
dataloader_num_workers: 2
dataloader_pin_memory: true
remove_unused_columns: false
# Memory optimization - CRITICAL for RTX3070 8GB with 8B model
use_gradient_checkpointing: true
offload_to_cpu: false # Explicitly no CPU offloading
# Optimizer settings
optim: "adamw_torch"
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1.0e-8
max_grad_norm: 1.0
# Learning rate scheduler
lr_scheduler_type: "cosine"
# Precision - BF16 for better stability on modern GPUs
bf16: true
fp16: false
tf32: true
# Dataset settings
dataset_shuffle: true
dataset_seed: 42
# Output settings
output_dir: "./models"
logging_dir: "./logs"
report_to: ["tensorboard"]
dataset:
# File filtering
min_file_size: 10
max_file_size: 10000
# Supported programming languages
supported_languages:
- python
- javascript
- typescript
- java
- cpp
- c
- csharp
- php
- ruby
- go
- rust
- swift
- kotlin
- scala
- sql
- bash
- yaml
- json
- xml
- html
- css
- markdown
# Files and directories to exclude
exclude_patterns:
- "\\.git/"
- "__pycache__/"
- "\\.pytest_cache/"
- "node_modules/"
- "\\.venv/"
- "venv/"
- "package-lock\\.json$"
- "yarn\\.lock$"
- "\\.log$"
- "\\.tmp$"
- "\\.bak$"
- "~\\$.*"
- "\\.swp$"
- "\\.swo$"
- "\\.DS_Store"
- "\\.pyc$"
- "\\.pyo$"
- "\\.pyd$"
- "\\.so$"
- "\\.dll$"
- "\\.exe$"
memory:
# Memory management for RTX3070 8GB with Qwen3-8B
max_memory_usage: 0.95 # Use up to 95% for more aggressive memory usage
enable_memory_tracking: true
clear_cache_between_epochs: true
# Attention optimization
use_memory_efficient_attention: true
attention_slicing: true
slice_size: 1

55
requirements.txt Normal file
View File

@ -0,0 +1,55 @@
# Core ML libraries
torch>=2.1.0
torchvision>=0.16.0
torchaudio>=2.1.0
# Unsloth for efficient model training
unsloth[cu121]>=2024.5
unsloth_zoo>=2024.5
# Transformers and tokenizers
transformers>=4.38.0
tokenizers>=0.15.0
sentencepiece>=0.1.99
# Datasets and data processing
datasets>=2.18.0
pandas>=2.0.0
numpy>=1.24.0
# Git and repository handling
GitPython>=3.1.0
requests>=2.31.0
# Configuration and utilities
PyYAML>=6.0.0
tqdm>=4.65.0
colorama>=0.4.6
python-dotenv>=1.0.0
# Memory optimization
bitsandbytes>=0.43.0
accelerate>=0.27.0
# Logging and monitoring
tensorboard>=2.14.0
wandb>=0.16.0
# Code processing
tree-sitter>=0.20.0
tree-sitter-python>=0.20.0
tree-sitter-javascript>=0.20.0
tree-sitter-typescript>=0.20.0
tree-sitter-java>=0.20.0
tree-sitter-go>=0.20.0
tree-sitter-rust>=0.20.0
# Optional: for model quantization and optimization
optimum>=1.17.0
auto-gptq>=0.6.0
# Development and testing
pytest>=7.4.0
black>=23.0.0
isort>=5.12.0
flake8>=6.0.0

22
run_training.py Normal file
View File

@ -0,0 +1,22 @@
#!/usr/bin/env python3
"""
Simple training runner script for AI Trainer
"""
import os
import sys
from pathlib import Path
# Add src to path
sys.path.append(str(Path(__file__).parent / "src"))
from main import main
if __name__ == "__main__":
# Set environment variables for better CUDA performance
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:512'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
# Run the main training application
main()

31
run_training_qwen3.py Normal file
View File

@ -0,0 +1,31 @@
#!/usr/bin/env python3
"""
Training runner script for unsloth/Qwen3-8B-bnb-4bit model
Optimized for RTX3070 8GB VRAM
"""
import os
import sys
from pathlib import Path
# Add src to path
sys.path.append(str(Path(__file__).parent / "src"))
from main import main
if __name__ == "__main__":
# Set environment variables for better CUDA performance with Qwen3
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:512'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
# Use Qwen3 configuration by default
if '--config' not in sys.argv:
sys.argv.extend(['--config', 'configs/training_config_qwen3.yaml'])
print("🚀 Starting training with unsloth/Qwen3-8B-bnb-4bit model")
print("📊 Configuration: configs/training_config_qwen3.yaml")
print("🧠 Memory optimization: RTX3070 8GB mode")
# Run the main training application
main()

6
src/__init__.py Normal file
View File

@ -0,0 +1,6 @@
"""
AI Trainer - Training framework for unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit model
"""
__version__ = "1.0.0"
__author__ = "AI Trainer"

227
src/config.py Normal file
View File

@ -0,0 +1,227 @@
"""
Configuration management for AI Trainer
Handles training parameters and model settings
"""
import os
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Optional, Union
import yaml
@dataclass
class ModelConfig:
"""Model-specific configuration"""
name: str = "unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit"
max_seq_length: int = 2048
trust_remote_code: bool = True
use_fast_tokenizer: bool = True
padding_side: str = "left"
truncation_side: str = "left"
@dataclass
class TrainingConfig:
"""Training configuration"""
per_device_train_batch_size: int = 2
gradient_accumulation_steps: int = 4
num_train_epochs: int = 3
learning_rate: float = 2e-4
warmup_steps: int = 10
logging_steps: int = 1
save_steps: int = 100
save_total_limit: int = 3
evaluation_strategy: str = "steps"
eval_steps: int = 100
load_best_model_at_end: bool = True
metric_for_best_model: str = "loss"
greater_is_better: bool = False
dataloader_num_workers: int = 2
dataloader_pin_memory: bool = True
remove_unused_columns: bool = False
label_names: List[str] = None
# Memory optimization for RTX3070 8GB
use_gradient_checkpointing: bool = True
offload_to_cpu: bool = False # Explicitly no CPU offloading
use_reentrant: bool = True
gradient_checkpointing_kwargs: Dict = None
# Optimizer settings
optim: str = "adamw_torch"
weight_decay: float = 0.01
adam_beta1: float = 0.9
adam_beta2: float = 0.999
adam_epsilon: float = 1e-8
max_grad_norm: float = 1.0
# Learning rate scheduler
lr_scheduler_type: str = "cosine"
warmup_ratio: float = 0.1
# Precision settings
bf16: bool = True
fp16: bool = False
tf32: bool = True
# Dataset processing
dataset_shuffle: bool = True
dataset_seed: int = 42
# Output settings
output_dir: str = "./models"
logging_dir: str = "./logs"
report_to: List[str] = None
def __post_init__(self):
if self.label_names is None:
self.label_names = ["labels"]
if self.gradient_checkpointing_kwargs is None:
self.gradient_checkpointing_kwargs = {"use_reentrant": self.use_reentrant}
if self.report_to is None:
self.report_to = ["tensorboard"]
@dataclass
class DatasetConfig:
"""Dataset processing configuration"""
min_file_size: int = 10
max_file_size: int = 10000 # Characters
supported_languages: List[str] = None
exclude_patterns: List[str] = None
def __post_init__(self):
if self.supported_languages is None:
self.supported_languages = [
'python', 'javascript', 'typescript', 'java', 'cpp', 'c',
'csharp', 'php', 'ruby', 'go', 'rust', 'swift', 'kotlin',
'scala', 'sql', 'bash', 'yaml', 'json', 'xml', 'html', 'css'
]
if self.exclude_patterns is None:
self.exclude_patterns = [
r'\.git/',
r'__pycache__/',
r'node_modules/',
r'\.venv/',
r'package-lock\.json$',
r'\.log$'
]
@dataclass
class MemoryConfig:
"""Memory optimization settings for RTX3070 8GB"""
max_memory_usage: float = 0.85 # Use up to 85% of GPU memory
enable_memory_tracking: bool = True
clear_cache_between_epochs: bool = True
use_memory_efficient_attention: bool = True
attention_slicing: bool = True
slice_size: int = 1
@dataclass
class AppConfig:
"""Main application configuration"""
model: ModelConfig
training: TrainingConfig
dataset: DatasetConfig
memory: MemoryConfig
@classmethod
def from_yaml(cls, config_path: Union[str, Path]) -> "AppConfig":
"""Load configuration from YAML file"""
config_path = Path(config_path)
if not config_path.exists():
# Create default configuration
config = cls(
model=ModelConfig(),
training=TrainingConfig(),
dataset=DatasetConfig(),
memory=MemoryConfig()
)
config.save_yaml(config_path)
return config
with open(config_path, 'r', encoding='utf-8') as f:
config_dict = yaml.safe_load(f)
# Parse nested configurations
model_config = ModelConfig(**config_dict.get('model', {}))
training_config = TrainingConfig(**config_dict.get('training', {}))
dataset_config = DatasetConfig(**config_dict.get('dataset', {}))
memory_config = MemoryConfig(**config_dict.get('memory', {}))
return cls(
model=model_config,
training=training_config,
dataset=dataset_config,
memory=memory_config
)
def save_yaml(self, config_path: Union[str, Path]):
"""Save configuration to YAML file"""
config_path = Path(config_path)
config_path.parent.mkdir(parents=True, exist_ok=True)
config_dict = {
'model': {
'name': self.model.name,
'max_seq_length': self.model.max_seq_length,
'trust_remote_code': self.model.trust_remote_code,
'use_fast_tokenizer': self.model.use_fast_tokenizer,
'padding_side': self.model.padding_side,
'truncation_side': self.model.truncation_side
},
'training': {
'per_device_train_batch_size': self.training.per_device_train_batch_size,
'gradient_accumulation_steps': self.training.gradient_accumulation_steps,
'num_train_epochs': self.training.num_train_epochs,
'learning_rate': self.training.learning_rate,
'warmup_steps': self.training.warmup_steps,
'logging_steps': self.training.logging_steps,
'save_steps': self.training.save_steps,
'save_total_limit': self.training.save_total_limit,
'evaluation_strategy': self.training.evaluation_strategy,
'eval_steps': self.training.eval_steps,
'load_best_model_at_end': self.training.load_best_model_at_end,
'metric_for_best_model': self.training.metric_for_best_model,
'greater_is_better': self.training.greater_is_better,
'dataloader_num_workers': self.training.dataloader_num_workers,
'dataloader_pin_memory': self.training.dataloader_pin_memory,
'remove_unused_columns': self.training.remove_unused_columns,
'use_gradient_checkpointing': self.training.use_gradient_checkpointing,
'offload_to_cpu': self.training.offload_to_cpu,
'optim': self.training.optim,
'weight_decay': self.training.weight_decay,
'lr_scheduler_type': self.training.lr_scheduler_type,
'warmup_ratio': self.training.warmup_ratio,
'bf16': self.training.bf16,
'fp16': self.training.fp16,
'tf32': self.training.tf32,
'dataset_shuffle': self.training.dataset_shuffle,
'dataset_seed': self.training.dataset_seed
},
'dataset': {
'min_file_size': self.dataset.min_file_size,
'max_file_size': self.dataset.max_file_size,
'supported_languages': self.dataset.supported_languages,
'exclude_patterns': self.dataset.exclude_patterns
},
'memory': {
'max_memory_usage': self.memory.max_memory_usage,
'enable_memory_tracking': self.memory.enable_memory_tracking,
'clear_cache_between_epochs': self.memory.clear_cache_between_epochs,
'use_memory_efficient_attention': self.memory.use_memory_efficient_attention,
'attention_slicing': self.memory.attention_slicing,
'slice_size': self.memory.slice_size
}
}
with open(config_path, 'w', encoding='utf-8') as f:
yaml.dump(config_dict, f, default_flow_style=False, indent=2)

250
src/dataset_processor.py Normal file
View File

@ -0,0 +1,250 @@
"""
Dataset processor for GitHub repositories
Processes code from GitHub repositories into training datasets
"""
import json
import logging
import os
import re
import shutil
import tempfile
from pathlib import Path
from typing import Dict, List, Optional, Tuple
import git
from datasets import Dataset
from tqdm import tqdm
from config import TrainingConfig
class DatasetProcessor:
"""Processes GitHub repositories into training datasets"""
# Supported file extensions for code training
CODE_EXTENSIONS = {
'.py': 'python',
'.js': 'javascript',
'.ts': 'typescript',
'.java': 'java',
'.cpp': 'cpp',
'.c': 'c',
'.h': 'c',
'.hpp': 'cpp',
'.cs': 'csharp',
'.php': 'php',
'.rb': 'ruby',
'.go': 'go',
'.rs': 'rust',
'.swift': 'swift',
'.kt': 'kotlin',
'.scala': 'scala',
'.sql': 'sql',
'.sh': 'bash',
'.yaml': 'yaml',
'.yml': 'yaml',
'.json': 'json',
'.xml': 'xml',
'.html': 'html',
'.css': 'css',
'.md': 'markdown'
}
# Files and directories to exclude
EXCLUDE_PATTERNS = [
r'\.git/',
r'__pycache__/',
r'\.pytest_cache/',
r'node_modules/',
r'\.venv/',
r'venv/',
r'\.DS_Store',
r'\.pyc$',
r'\.pyo$',
r'\.pyd$',
r'\.so$',
r'\.dll$',
r'\.exe$',
r'\.bin$',
r'package-lock\.json$',
r'yarn\.lock$',
r'\.log$',
r'\.tmp$',
r'\.bak$',
r'~\$.*',
r'\.swp$',
r'\.swo$'
]
def __init__(self):
self.logger = logging.getLogger(__name__)
self.temp_dirs = []
def process_github_repos(self, repo_urls: List[str], config: TrainingConfig) -> Dataset:
"""
Process multiple GitHub repositories into a training dataset
Args:
repo_urls: List of GitHub repository URLs
config: Training configuration
Returns:
Dataset ready for training
"""
all_code_samples = []
for repo_url in repo_urls:
try:
self.logger.info(f"Processing repository: {repo_url}")
repo_samples = self._process_single_repo(repo_url, config)
all_code_samples.extend(repo_samples)
self.logger.info(f"Extracted {len(repo_samples)} samples from {repo_url}")
except Exception as e:
self.logger.error(f"Failed to process repository {repo_url}: {str(e)}")
continue
if not all_code_samples:
raise ValueError("No code samples extracted from any repository")
self.logger.info(f"Total samples collected: {len(all_code_samples)}")
# Create HuggingFace dataset
dataset = Dataset.from_list(all_code_samples)
# Filter by sequence length
dataset = dataset.filter(
lambda x: len(x['text'].split()) <= config.model.max_seq_length
)
self.logger.info(f"Dataset size after filtering: {len(dataset)}")
return dataset
def _process_single_repo(self, repo_url: str, config: TrainingConfig) -> List[Dict]:
"""
Process a single GitHub repository
Args:
repo_url: GitHub repository URL
config: Training configuration
Returns:
List of code samples with metadata
"""
temp_dir = tempfile.mkdtemp()
self.temp_dirs.append(temp_dir)
try:
# Clone repository
repo_name = repo_url.split('/')[-1].replace('.git', '')
repo_path = os.path.join(temp_dir, repo_name)
self.logger.info(f"Cloning {repo_url} to {repo_path}")
repo = git.Repo.clone_from(repo_url, repo_path)
# Extract code samples
code_samples = self._extract_code_samples(repo_path, config)
return code_samples
finally:
# Cleanup
shutil.rmtree(temp_dir, ignore_errors=True)
def _extract_code_samples(self, repo_path: str, config: TrainingConfig) -> List[Dict]:
"""
Extract code samples from a repository
Args:
repo_path: Path to cloned repository
config: Training configuration
Returns:
List of code samples
"""
code_samples = []
repo_path_obj = Path(repo_path)
# Find all code files
code_files = []
for ext in self.CODE_EXTENSIONS:
code_files.extend(repo_path_obj.rglob(f'*{ext}'))
self.logger.info(f"Found {len(code_files)} code files")
for code_file in tqdm(code_files, desc="Processing code files"):
try:
if self._should_exclude_file(str(code_file.relative_to(repo_path))):
continue
sample = self._process_code_file(code_file, repo_path_obj, config)
if sample:
code_samples.append(sample)
except Exception as e:
self.logger.warning(f"Failed to process {code_file}: {str(e)}")
continue
return code_samples
def _should_exclude_file(self, relative_path: str) -> bool:
"""Check if a file should be excluded based on patterns"""
for pattern in self.EXCLUDE_PATTERNS:
if re.search(pattern, relative_path):
return True
return False
def _process_code_file(self, file_path: Path, repo_path: Path, config: TrainingConfig) -> Optional[Dict]:
"""
Process a single code file into a training sample
Args:
file_path: Path to the code file
repo_path: Path to the repository root
config: Training configuration
Returns:
Dictionary containing the processed sample or None if invalid
"""
try:
# Read file content
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
content = f.read()
# Skip if file is too small or too large
if len(content.strip()) < 10:
return None
if len(content) > config.model.max_seq_length * 4: # Rough character limit
return None
# Get relative path for context
relative_path = file_path.relative_to(repo_path)
# Determine language
extension = file_path.suffix.lower()
language = self.CODE_EXTENSIONS.get(extension, 'unknown')
# Create training sample
sample = {
'text': content,
'language': language,
'file_path': str(relative_path),
'repo_name': repo_path.name,
'file_size': len(content),
'line_count': len(content.splitlines())
}
return sample
except Exception as e:
self.logger.warning(f"Error processing {file_path}: {str(e)}")
return None
def cleanup(self):
"""Clean up temporary directories"""
for temp_dir in self.temp_dirs:
try:
shutil.rmtree(temp_dir, ignore_errors=True)
except Exception as e:
self.logger.warning(f"Failed to cleanup {temp_dir}: {str(e)}")
self.temp_dirs.clear()

110
src/main.py Normal file
View File

@ -0,0 +1,110 @@
#!/usr/bin/env python3
"""
Main entry point for AI Trainer application
Training framework for unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit model
"""
import argparse
import logging
import os
import sys
from pathlib import Path
# Add src to path for imports
sys.path.append(str(Path(__file__).parent))
from trainer import ModelTrainer
from dataset_processor import DatasetProcessor
from config import TrainingConfig
from utils import setup_logging, check_gpu_memory
def parse_arguments():
"""Parse command line arguments"""
parser = argparse.ArgumentParser(description="AI Trainer for Qwen2.5-Coder model")
parser.add_argument(
"--config",
type=str,
default="configs/training_config.yaml",
help="Path to training configuration file"
)
parser.add_argument(
"--repo1",
type=str,
required=True,
help="First GitHub repository URL"
)
parser.add_argument(
"--repo2",
type=str,
required=True,
help="Second GitHub repository URL"
)
parser.add_argument(
"--output_dir",
type=str,
default="./models",
help="Directory to save trained model"
)
parser.add_argument(
"--log_level",
type=str,
default="INFO",
choices=["DEBUG", "INFO", "WARNING", "ERROR"],
help="Logging level"
)
return parser.parse_args()
def main():
"""Main application entry point"""
args = parse_arguments()
# Setup logging
setup_logging(args.log_level)
logger = logging.getLogger(__name__)
logger.info("Starting AI Trainer for Qwen2.5-Coder-7B-Instruct-bnb-4bit")
logger.info(f"Repository 1: {args.repo1}")
logger.info(f"Repository 2: {args.repo2}")
try:
# Check GPU memory
gpu_info = check_gpu_memory()
logger.info(f"GPU Memory Info: {gpu_info}")
# Load configuration
config = TrainingConfig.from_yaml(args.config)
logger.info("Configuration loaded successfully")
# Process datasets from GitHub repositories
dataset_processor = DatasetProcessor()
logger.info("Processing datasets from GitHub repositories...")
train_dataset = dataset_processor.process_github_repos(
repo_urls=[args.repo1, args.repo2],
config=config
)
logger.info(f"Dataset processed successfully. Size: {len(train_dataset)}")
# Initialize and run trainer
trainer = ModelTrainer(config=config, output_dir=args.output_dir)
logger.info("Starting model training...")
trained_model_path = trainer.train(train_dataset)
logger.info(f"Training completed! Model saved to: {trained_model_path}")
except Exception as e:
logger.error(f"Training failed with error: {str(e)}")
sys.exit(1)
if __name__ == "__main__":
main()

284
src/trainer.py Normal file
View File

@ -0,0 +1,284 @@
"""
Model trainer for unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit
Optimized for RTX3070 8GB VRAM with no CPU offloading
"""
import logging
import os
import gc
import torch
from pathlib import Path
from typing import Optional, Dict, Any
import torch.nn as nn
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
Trainer,
TrainingArguments,
DataCollatorForLanguageModeling
)
from datasets import Dataset
from unsloth import FastLanguageModel, is_bfloat16_supported
from config import AppConfig
from utils import check_gpu_memory, clear_gpu_cache, get_memory_usage
class ModelTrainer:
"""Trainer class for fine-tuning the Qwen2.5-Coder model"""
def __init__(self, config: AppConfig, output_dir: str = "./models"):
"""
Initialize the model trainer
Args:
config: Application configuration
output_dir: Directory to save the trained model
"""
self.config = config
self.output_dir = Path(output_dir)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.logger = logging.getLogger(__name__)
# Model and tokenizer
self.model = None
self.tokenizer = None
# Training components
self.trainer = None
# Memory tracking
self.initial_memory = None
def train(self, train_dataset: Dataset) -> str:
"""
Train the model on the provided dataset
Args:
train_dataset: Dataset for training
Returns:
Path to the saved model
"""
try:
self.logger.info("Starting model training...")
# Check initial GPU memory
self._check_initial_setup()
# Load model and tokenizer
self._load_model_and_tokenizer()
# Prepare dataset
tokenized_dataset = self._prepare_dataset(train_dataset)
# Setup trainer
self._setup_trainer(tokenized_dataset)
# Start training
self.logger.info("Beginning training loop...")
self.trainer.train()
# Save final model
final_model_path = self._save_model()
self.logger.info(f"Training completed successfully! Model saved to: {final_model_path}")
return str(final_model_path)
except Exception as e:
self.logger.error(f"Training failed: {str(e)}")
raise
finally:
self._cleanup()
def _check_initial_setup(self):
"""Check initial GPU memory and setup"""
gpu_info = check_gpu_memory()
self.logger.info(f"GPU Memory Info: {gpu_info}")
# Store initial memory usage
self.initial_memory = get_memory_usage()
self.logger.info(".2f")
# Verify CUDA availability
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available. This trainer requires a CUDA-compatible GPU.")
self.logger.info(f"CUDA device: {torch.cuda.get_device_name()}")
self.logger.info(f"CUDA version: {torch.version.cuda}")
def _load_model_and_tokenizer(self):
"""Load the model and tokenizer with memory optimization"""
self.logger.info(f"Loading model: {self.config.model.name}")
# Clear cache before loading
clear_gpu_cache()
try:
# Load model with unsloth for memory efficiency
self.model, self.tokenizer = FastLanguageModel.from_pretrained(
model_name=self.config.model.name,
max_seq_length=self.config.model.max_seq_length,
dtype=None, # Auto-detect
load_in_4bit=True, # Use 4-bit quantization
token=None, # Use default token
)
# Configure model for training
self.model = FastLanguageModel.get_peft_model(
self.model,
r=16, # LoRA rank
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_alpha=16,
lora_dropout=0, # Supports any, but = 0 is optimized
bias="none", # Supports any, but = "none" is optimized
use_gradient_checkpointing=self.config.training.use_gradient_checkpointing,
random_state=3407,
use_rslora=False, # We support rank stabilized LoRA
loftq_config=None, # And LoftQ
)
self.logger.info("Model and tokenizer loaded successfully")
except Exception as e:
self.logger.error(f"Failed to load model: {str(e)}")
raise
def _prepare_dataset(self, train_dataset: Dataset) -> Dataset:
"""Prepare and tokenize the dataset"""
self.logger.info("Preparing dataset...")
def tokenize_function(examples):
return self.tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=self.config.model.max_seq_length,
return_tensors="pt"
)
# Tokenize dataset
tokenized_dataset = train_dataset.map(
tokenize_function,
batched=True,
remove_columns=["text", "language", "file_path", "repo_name", "file_size", "line_count"],
desc="Tokenizing dataset"
)
self.logger.info(f"Dataset tokenized. Size: {len(tokenized_dataset)}")
return tokenized_dataset
def _setup_trainer(self, tokenized_dataset: Dataset):
"""Setup the HuggingFace trainer with memory optimizations"""
self.logger.info("Setting up trainer...")
# Training arguments optimized for RTX3070 8GB
training_args = TrainingArguments(
output_dir=str(self.output_dir / "checkpoints"),
num_train_epochs=self.config.training.num_train_epochs,
per_device_train_batch_size=self.config.training.per_device_train_batch_size,
gradient_accumulation_steps=self.config.training.gradient_accumulation_steps,
learning_rate=self.config.training.learning_rate,
warmup_steps=self.config.training.warmup_steps,
warmup_ratio=self.config.training.warmup_ratio,
logging_steps=self.config.training.logging_steps,
save_steps=self.config.training.save_steps,
save_total_limit=self.config.training.save_total_limit,
evaluation_strategy=self.config.training.evaluation_strategy,
eval_steps=self.config.training.eval_steps,
load_best_model_at_end=self.config.training.load_best_model_at_end,
metric_for_best_model=self.config.training.metric_for_best_model,
greater_is_better=self.config.training.greater_is_better,
optim=self.config.training.optim,
weight_decay=self.config.training.weight_decay,
lr_scheduler_type=self.config.training.lr_scheduler_type,
adam_beta1=self.config.training.adam_beta1,
adam_beta2=self.config.training.adam_beta2,
adam_epsilon=self.config.training.adam_epsilon,
max_grad_norm=self.config.training.max_grad_norm,
dataloader_num_workers=self.config.training.dataloader_num_workers,
dataloader_pin_memory=self.config.training.dataloader_pin_memory,
remove_unused_columns=self.config.training.remove_unused_columns,
bf16=self.config.training.bf16 if is_bfloat16_supported() else False,
fp16=self.config.training.fp16,
tf32=self.config.training.tf32,
report_to=self.config.training.report_to,
logging_dir=self.config.training.logging_dir,
seed=self.config.training.dataset_seed,
data_seed=self.config.training.dataset_seed,
dataloader_drop_last=True, # Better memory management
gradient_checkpointing=self.config.training.use_gradient_checkpointing,
# Memory optimization settings
ddp_find_unused_parameters=False,
per_device_eval_batch_size=self.config.training.per_device_train_batch_size,
)
# Data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=self.tokenizer,
mlm=False # Causal language modeling
)
# Initialize trainer
self.trainer = Trainer(
model=self.model,
args=training_args,
train_dataset=tokenized_dataset,
eval_dataset=tokenized_dataset, # Using same dataset for eval (for demo)
data_collator=data_collator,
tokenizer=self.tokenizer,
)
self.logger.info("Trainer setup completed")
def _save_model(self) -> Path:
"""Save the trained model"""
self.logger.info("Saving model...")
# Create final model directory
final_model_dir = self.output_dir / "final_model"
final_model_dir.mkdir(parents=True, exist_ok=True)
try:
# Save the model
self.model.save_pretrained(str(final_model_dir))
self.tokenizer.save_pretrained(str(final_model_dir))
# Save configuration
self.config.save_yaml(final_model_dir / "training_config.yaml")
self.logger.info(f"Model saved to: {final_model_dir}")
return final_model_dir
except Exception as e:
self.logger.error(f"Failed to save model: {str(e)}")
raise
def _cleanup(self):
"""Clean up resources"""
try:
# Clear GPU cache
clear_gpu_cache()
# Force garbage collection
gc.collect()
# Delete model and tokenizer to free memory
if self.model is not None:
del self.model
if self.tokenizer is not None:
del self.tokenizer
if self.trainer is not None:
del self.trainer
# Final memory cleanup
if torch.cuda.is_available():
torch.cuda.empty_cache()
except Exception as e:
self.logger.warning(f"Error during cleanup: {str(e)}")

319
src/utils.py Normal file
View File

@ -0,0 +1,319 @@
"""
Utility functions for AI Trainer
Memory management, logging, and helper functions optimized for RTX3070 8GB VRAM
"""
import gc
import logging
import os
import sys
from pathlib import Path
from typing import Dict, Optional, Tuple, Any
import torch
import psutil
from colorama import init, Fore, Back, Style
# Initialize colorama for cross-platform colored output
init(autoreset=True)
def setup_logging(log_level: str = "INFO", log_file: Optional[str] = None) -> logging.Logger:
"""
Setup logging configuration with colored console output
Args:
log_level: Logging level (DEBUG, INFO, WARNING, ERROR)
log_file: Optional log file path
Returns:
Configured logger
"""
# Create formatter with colors
class ColoredFormatter(logging.Formatter):
COLORS = {
'DEBUG': Fore.CYAN,
'INFO': Fore.GREEN,
'WARNING': Fore.YELLOW,
'ERROR': Fore.RED,
'CRITICAL': Fore.RED + Back.WHITE
}
def format(self, record):
# Add color to the level name
if record.levelname in self.COLORS:
colored_levelname = f"{self.COLORS[record.levelname]}{record.levelname}{Style.RESET_ALL}"
record.levelname = colored_levelname
return super().format(record)
# Create logger
logger = logging.getLogger()
logger.setLevel(getattr(logging, log_level.upper()))
# Console handler with colors
console_handler = logging.StreamHandler(sys.stdout)
console_formatter = ColoredFormatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
console_handler.setFormatter(console_formatter)
logger.addHandler(console_handler)
# File handler if specified
if log_file:
log_path = Path(log_file)
log_path.parent.mkdir(parents=True, exist_ok=True)
file_handler = logging.FileHandler(log_path)
file_formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
file_handler.setFormatter(file_formatter)
logger.addHandler(file_handler)
return logger
def check_gpu_memory() -> Dict[str, Any]:
"""
Check GPU memory status and availability
Returns:
Dictionary with GPU memory information
"""
if not torch.cuda.is_available():
return {"error": "CUDA not available"}
try:
device = torch.cuda.current_device()
total_memory = torch.cuda.get_device_properties(device).total_memory
allocated_memory = torch.cuda.memory_allocated(device)
reserved_memory = torch.cuda.memory_reserved(device)
free_memory = total_memory - allocated_memory
return {
"device": torch.cuda.get_device_name(device),
"device_id": device,
"total_memory_gb": round(total_memory / (1024**3), 2),
"allocated_memory_gb": round(allocated_memory / (1024**3), 2),
"reserved_memory_gb": round(reserved_memory / (1024**3), 2),
"free_memory_gb": round(free_memory / (1024**3), 2),
"memory_utilization": round((allocated_memory / total_memory) * 100, 2),
"cuda_version": torch.version.cuda,
"cudnn_version": torch.backends.cudnn.version() if torch.backends.cudnn.is_available() else "N/A"
}
except Exception as e:
return {"error": f"Failed to get GPU info: {str(e)}"}
def get_memory_usage() -> Dict[str, float]:
"""
Get system memory usage
Returns:
Dictionary with memory usage information
"""
try:
# GPU memory
gpu_memory = check_gpu_memory()
# System memory
system_memory = psutil.virtual_memory()
return {
"gpu_total_gb": gpu_memory.get("total_memory_gb", 0),
"gpu_allocated_gb": gpu_memory.get("allocated_memory_gb", 0),
"gpu_free_gb": gpu_memory.get("free_memory_gb", 0),
"system_total_gb": round(system_memory.total / (1024**3), 2),
"system_available_gb": round(system_memory.available / (1024**3), 2),
"system_used_gb": round(system_memory.used / (1024**3), 2),
"system_memory_percent": system_memory.percent
}
except Exception as e:
return {"error": f"Failed to get memory usage: {str(e)}"}
def clear_gpu_cache():
"""Clear GPU cache and perform garbage collection"""
try:
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
# Force garbage collection
gc.collect()
except Exception as e:
print(f"Warning: Failed to clear GPU cache: {str(e)}")
def optimize_memory_settings():
"""Apply memory optimization settings for RTX3070"""
try:
if torch.cuda.is_available():
# Set memory fraction to prevent out-of-memory
torch.cuda.set_per_process_memory_fraction(0.85) # Use 85% of GPU memory
# Enable TF32 for better performance
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# Optimize CUDA memory allocator
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:512'
except Exception as e:
print(f"Warning: Failed to optimize memory settings: {str(e)}")
def format_bytes(bytes_value: int) -> str:
"""
Format bytes into human readable format
Args:
bytes_value: Number of bytes
Returns:
Formatted string (e.g., "1.5 GB")
"""
for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
if bytes_value < 1024.0:
return ".1f"
bytes_value /= 1024.0
return ".1f"
def print_system_info():
"""Print comprehensive system information"""
print(f"\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
print(f"{Fore.CYAN}SYSTEM INFORMATION{Style.RESET_ALL}")
print(f"{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
# GPU Information
gpu_info = check_gpu_memory()
if "error" not in gpu_info:
print(f"\n{Fore.GREEN}GPU Information:{Style.RESET_ALL}")
print(f" Device: {gpu_info['device']}")
print(f" CUDA Version: {gpu_info['cuda_version']}")
print(f" Total Memory: {gpu_info['total_memory_gb']} GB")
print(f" Allocated Memory: {gpu_info['allocated_memory_gb']} GB")
print(f" Free Memory: {gpu_info['free_memory_gb']} GB")
print(f" Memory Utilization: {gpu_info['memory_utilization']}%")
else:
print(f"\n{Fore.RED}GPU Information: {gpu_info['error']}{Style.RESET_ALL}")
# System Memory
system_memory = psutil.virtual_memory()
print(f"\n{Fore.GREEN}System Memory:{Style.RESET_ALL}")
print(f" Total: {format_bytes(system_memory.total)}")
print(f" Available: {format_bytes(system_memory.available)}")
print(f" Used: {format_bytes(system_memory.used)}")
print(f" Usage: {system_memory.percent}%")
# CPU Information
print(f"\n{Fore.GREEN}CPU Information:{Style.RESET_ALL}")
print(f" Cores: {psutil.cpu_count(logical=False)} physical, {psutil.cpu_count(logical=True)} logical")
print(f" CPU Usage: {psutil.cpu_percent()}%")
print(f"\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
def validate_environment():
"""Validate that the environment is suitable for training"""
issues = []
# Check CUDA availability
if not torch.cuda.is_available():
issues.append("CUDA is not available. A CUDA-compatible GPU is required.")
# Check GPU memory
if torch.cuda.is_available():
gpu_info = check_gpu_memory()
if "total_memory_gb" in gpu_info:
total_memory = gpu_info["total_memory_gb"]
if total_memory < 8:
issues.append(f"GPU memory ({total_memory} GB) may be insufficient. Recommended: 8GB+")
# Check required Python modules
required_modules = ['torch', 'transformers', 'datasets', 'git']
for module in required_modules:
try:
__import__(module)
except ImportError:
issues.append(f"Required module '{module}' is not installed.")
if issues:
print(f"\n{Fore.YELLOW}Environment Validation Issues:{Style.RESET_ALL}")
for issue in issues:
print(f" - {issue}")
return False
print(f"\n{Fore.GREEN}Environment validation passed!{Style.RESET_ALL}")
return True
def create_training_summary(config, training_time: float, final_model_path: str) -> str:
"""
Create a summary of the training session
Args:
config: Training configuration
training_time: Training time in seconds
final_model_path: Path to the saved model
Returns:
Formatted summary string
"""
summary = ".1f"".2f"f"""
{Fore.CYAN}{'='*60}{Style.RESET_ALL}
TRAINING SUMMARY
{Fore.CYAN}{'='*60}{Style.RESET_ALL}
Configuration:
Model: {config.model.name}
Epochs: {config.training.num_train_epochs}
Batch Size: {config.training.per_device_train_batch_size}
Gradient Accumulation: {config.training.gradient_accumulation_steps}
Learning Rate: {config.training.learning_rate}
Max Sequence Length: {config.model.max_seq_length}
Performance:
Training Time: {training_time:.2f} seconds ({training_time/3600:.2f} hours)
Effective Batch Size: {config.training.per_device_train_batch_size * config.training.gradient_accumulation_steps}
Output:
Model Saved To: {final_model_path}
Memory Settings:
Gradient Checkpointing: {config.training.use_gradient_checkpointing}
CPU Offloading: {config.training.offload_to_cpu}
BF16 Enabled: {config.training.bf16}
{Fore.CYAN}{'='*60}{Style.RESET_ALL}
"""
return summary
def safe_import(module_name: str, fallback: Any = None):
"""
Safely import a module with fallback
Args:
module_name: Name of the module to import
fallback: Fallback value if import fails
Returns:
Imported module or fallback
"""
try:
return __import__(module_name)
except ImportError:
return fallback
# Initialize memory optimization settings on import
try:
optimize_memory_settings()
except Exception:
pass # Ignore errors during initialization