ai_html_document_trainer/README.md
2025-08-22 16:30:56 +07:00

183 lines
4.8 KiB
Markdown

# Odoo AI Model Trainer
A comprehensive Python project for training AI models on Odoo documentation using Unsloth, optimized for RTX3070 8GB VRAM. The project scrapes both English and Indonesian Odoo documentation and fine-tunes the unsloth/Qwen3-8B-bnb-4bit model.
## Features
- 🌐 **Bilingual Support**: Scrapes both English and Indonesian Odoo documentation
- 🚀 **Optimized Training**: Uses Unsloth for 2x faster training and 70% less memory
- 🎯 **RTX3070 Optimized**: Configured for 8GB VRAM with memory-efficient settings
- 📊 **Data Pipeline**: Complete pipeline from data collection to model training
- 🔧 **Modular Design**: Separate scripts for scraping, preprocessing, and training
- 📈 **Progress Tracking**: Built-in statistics and progress monitoring
## Requirements
### Hardware
- NVIDIA RTX3070 (8GB VRAM) or better
- 16GB+ RAM recommended
- 50GB+ free disk space
### Software
- Python 3.8+
- CUDA 11.8+
- PyTorch with CUDA support
## Installation
1. **Clone or download this project**
2. **Install dependencies**:
```bash
pip install -r requirements.txt
```
3. **Verify CUDA installation**:
```bash
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
```
## Usage
### Full Pipeline (Recommended)
Run the complete training pipeline:
```bash
python main.py
```
### Step-by-Step Execution
1. **Data Collection Only**:
```bash
python main.py --only-collection
```
2. **Data Preprocessing Only**:
```bash
python main.py --only-preprocessing
```
3. **Model Training Only**:
```bash
python main.py --only-training
```
### Skip Specific Steps
```bash
# Skip data collection if you already have data
python main.py --skip-collection
# Skip preprocessing if you already have training data
python main.py --skip-preprocessing
# Skip training for testing other components
python main.py --skip-training
```
### Individual Scripts
You can also run individual scripts directly:
```bash
# Scrape Odoo documentation
python data_scraper.py
# Preprocess the scraped data
python data_preprocessor.py
# Train the model
python train_model.py
```
## Project Structure
```
.
├── main.py # Main orchestrator script
├── data_scraper.py # Web scraping for Odoo docs
├── data_preprocessor.py # Data cleaning and formatting
├── train_model.py # Model training with Unsloth
├── requirements.txt # Python dependencies
├── README.md # This file
├── odoo_docs_data.csv # Scraped raw data (generated)
├── training_data.json # Processed training data (generated)
└── odoo_model_output/ # Trained model (generated)
```
## Output Files
- **odoo_docs_data.csv**: Raw scraped documentation
- **training_data.json**: Processed training data in instruction format
- **odoo_model_output/**: Directory containing the fine-tuned model
- **odoo_model_output_gguf/**: GGUF quantized model for deployment
## Configuration
### Memory Optimization for RTX3070
The training is configured with:
- Batch size: 1 (per device)
- Gradient accumulation: 4 (effective batch size: 4)
- Max sequence length: 2048 tokens
- 4-bit quantization to save VRAM
- Gradient checkpointing enabled
### Training Parameters
- Learning rate: 2e-4
- Max steps: 100 (increase for production)
- Warmup steps: 5
- LoRA rank: 16
- LoRA alpha: 16
## Troubleshooting
### CUDA Out of Memory
If you encounter CUDA OOM errors:
1. Reduce batch size in `train_model.py`
2. Increase gradient accumulation steps
3. Reduce max sequence length
4. Restart your Python session
### Data Collection Issues
- Check internet connection
- Odoo website may block rapid requests - the script includes delays
- If Indonesian docs fail, they may be at a different URL
### Training Issues
- Ensure CUDA is properly installed
- Check that your GPU drivers are up to date
- Verify PyTorch CUDA compatibility
## Model Usage
After training, you can use the model for Odoo-related questions:
```python
from train_model import OdooModelTrainer
trainer = OdooModelTrainer()
trainer.load_model()
# Load your trained model
# trainer.model = ... (load from odoo_model_output)
response = trainer.generate_response("How do I install Odoo?")
print(response)
```
## Performance Notes
- **Training Time**: ~30-60 minutes for 100 steps on RTX3070
- **Memory Usage**: ~6-7GB VRAM during training
- **Data Size**: ~20-50MB of documentation data
- **Model Size**: ~4-5GB for the fine-tuned model
## Contributing
Feel free to submit issues and enhancement requests!
## License
This project is open source. Please check individual component licenses for details.
## Disclaimer
This project is for educational and research purposes. Ensure compliance with Odoo's terms of service when scraping documentation.