ai_html_document_trainer/README.md

# Odoo AI Model Trainer

A comprehensive Python project for training AI models on Odoo documentation using Unsloth, optimized for RTX3070 8GB VRAM. The project scrapes both English and Indonesian Odoo documentation and fine-tunes the unsloth/Qwen3-8B-bnb-4bit model.

## Features

- 🌐 **Bilingual Support**: Scrapes both English and Indonesian Odoo documentation
- 🚀 **Optimized Training**: Uses Unsloth for 2x faster training and 70% less memory
- 🎯 **RTX3070 Optimized**: Configured for 8GB VRAM with memory-efficient settings
- 📊 **Data Pipeline**: Complete pipeline from data collection to model training
- 🔧 **Modular Design**: Separate scripts for scraping, preprocessing, and training
- 📈 **Progress Tracking**: Built-in statistics and progress monitoring

## Requirements

### Hardware
- NVIDIA RTX3070 (8GB VRAM) or better
- 16GB+ RAM recommended
- 50GB+ free disk space

### Software
- Python 3.8+
- CUDA 11.8+
- PyTorch with CUDA support

## Installation

1. **Clone or download this project**

2. **Install dependencies**:
   ```bash
   pip install -r requirements.txt
   ```

3. **Verify CUDA installation**:
   ```bash
   python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
   ```

## Usage

### Full Pipeline (Recommended)
Run the complete training pipeline:
```bash
python main.py
```

### Step-by-Step Execution

1. **Data Collection Only**:
   ```bash
   python main.py --only-collection
   ```

2. **Data Preprocessing Only**:
   ```bash
   python main.py --only-preprocessing
   ```

3. **Model Training Only**:
   ```bash
   python main.py --only-training
   ```

### Skip Specific Steps
```bash
# Skip data collection if you already have data
python main.py --skip-collection

# Skip preprocessing if you already have training data
python main.py --skip-preprocessing

# Skip training for testing other components
python main.py --skip-training
```

### Individual Scripts
You can also run individual scripts directly:

```bash
# Scrape Odoo documentation
python data_scraper.py

# Preprocess the scraped data
python data_preprocessor.py

# Train the model
python train_model.py
```

## Project Structure

```
.
├── main.py                    # Main orchestrator script
├── data_scraper.py           # Web scraping for Odoo docs
├── data_preprocessor.py      # Data cleaning and formatting
├── train_model.py           # Model training with Unsloth
├── requirements.txt          # Python dependencies
├── README.md                # This file
├── odoo_docs_data.csv       # Scraped raw data (generated)
├── training_data.json       # Processed training data (generated)
└── odoo_model_output/       # Trained model (generated)
```

## Output Files

- **odoo_docs_data.csv**: Raw scraped documentation
- **training_data.json**: Processed training data in instruction format
- **odoo_model_output/**: Directory containing the fine-tuned model
- **odoo_model_output_gguf/**: GGUF quantized model for deployment

## Configuration

### Memory Optimization for RTX3070
The training is configured with:
- Batch size: 1 (per device)
- Gradient accumulation: 4 (effective batch size: 4)
- Max sequence length: 2048 tokens
- 4-bit quantization to save VRAM
- Gradient checkpointing enabled

### Training Parameters
- Learning rate: 2e-4
- Max steps: 100 (increase for production)
- Warmup steps: 5
- LoRA rank: 16
- LoRA alpha: 16

## Troubleshooting

### CUDA Out of Memory
If you encounter CUDA OOM errors:
1. Reduce batch size in `train_model.py`
2. Increase gradient accumulation steps
3. Reduce max sequence length
4. Restart your Python session

### Data Collection Issues
- Check internet connection
- Odoo website may block rapid requests - the script includes delays
- If Indonesian docs fail, they may be at a different URL

### Training Issues
- Ensure CUDA is properly installed
- Check that your GPU drivers are up to date
- Verify PyTorch CUDA compatibility

## Model Usage

After training, you can use the model for Odoo-related questions:

```python
from train_model import OdooModelTrainer

trainer = OdooModelTrainer()
trainer.load_model()

# Load your trained model
# trainer.model = ... (load from odoo_model_output)

response = trainer.generate_response("How do I install Odoo?")
print(response)
```

## Performance Notes

- **Training Time**: ~30-60 minutes for 100 steps on RTX3070
- **Memory Usage**: ~6-7GB VRAM during training
- **Data Size**: ~20-50MB of documentation data
- **Model Size**: ~4-5GB for the fine-tuned model

## Contributing

Feel free to submit issues and enhancement requests!

## License

This project is open source. Please check individual component licenses for details.

## Disclaimer

This project is for educational and research purposes. Ensure compliance with Odoo's terms of service when scraping documentation.