ai_github_trainer/README_DATASET_PROCESSING.md

# Dataset Processing from GitHub Repositories

This guide explains how to get and process datasets from GitHub repositories using the provided tools.

## Prerequisites

Make sure you have installed the required dependencies:

```bash
pip install -r requirements.txt
```

## Using the DatasetProcessor Class

The `DatasetProcessor` class in `src/dataset_processor.py` provides comprehensive functionality for processing GitHub repositories into training datasets.

### Example Usage

```python
from src.dataset_processor import DatasetProcessor
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig

# Initialize configuration
config = AppConfig(
    model=ModelConfig(),
    training=TrainingConfig(),
    dataset=DatasetConfig(),
    memory=MemoryConfig()
)

# Initialize dataset processor
processor = DatasetProcessor()

# Process GitHub repositories
repo_urls = [
    "https://github.com/karpathy/nanoGPT.git",
    # Add more repository URLs as needed
]

dataset = processor.process_github_repos(
    repo_urls=repo_urls,
    config=config,
    github_token=None  # Add your token for private repositories
)

print(f"Dataset processed successfully with {len(dataset)} samples")
```

## Using the DatasetProcessorSynthetic Class

The `DatasetProcessorSynthetic` class in `src/dataset_processor_synthetic.py` provides functionality for processing GitHub repositories into training datasets in QA ChatML format using a local AI model (Ollama).

### Example Usage

```python
from src.dataset_processor_synthetic import DatasetProcessorSynthetic
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig

# Initialize configuration
config = AppConfig(
    model=ModelConfig(),
    training=TrainingConfig(),
    dataset=DatasetConfig(),
    memory=MemoryConfig()
)

# Initialize dataset processor
processor = DatasetProcessorSynthetic()

# Process GitHub repositories
repo_urls = [
    "https://github.com/karpathy/nanoGPT.git",
    # Add more repository URLs as needed
]

dataset = processor.process_github_repos(
    repo_urls=repo_urls,
    config=config,
    github_token=None  # Add your token for private repositories
)

print(f"Dataset processed successfully with {len(dataset)} samples")
```

## Saving and Loading Datasets

Both dataset processors support saving and loading datasets to/from disk to avoid reprocessing:

```python
# Save dataset
processor.save_dataset(dataset, "./my_processed_dataset")

# Load dataset
loaded_dataset = processor.load_dataset("./my_processed_dataset")
```

The main script also supports saving/loading datasets via command-line arguments:

```bash
# Process and save dataset
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset

# Load and train with existing dataset
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset
```

## Using the Example Script

You can run the example script directly:

```bash
python example_dataset_processing.py
```

This will process the example repository and show information about the processed dataset.

## Using in Google Colab

The `ai_trainer_t4_colab.ipynb` notebook includes sections for processing GitHub repositories:

1. Simple repository processing (Section 5)
2. Advanced dataset processing (Section 5.1)

## Supported File Types

The DatasetProcessor supports the following file types:
- Python (.py)
- JavaScript (.js)
- TypeScript (.ts)
- Java (.java)
- C++ (.cpp, .hpp)
- C (.c, .h)
- C# (.cs)
- PHP (.php)
- Ruby (.rb)
- Go (.go)
- Rust (.rs)
- Swift (.swift)
- Kotlin (.kt)
- Scala (.scala)
- SQL (.sql)
- Bash (.sh)
- YAML (.yaml, .yml)
- JSON (.json)
- XML (.xml)
- HTML (.html)
- CSS (.css)
- Markdown (.md)

## Configuration

The dataset processing can be configured through the `DatasetConfig` class:

```python
dataset_config = DatasetConfig(
    min_file_size=10,  # Minimum file size in characters
    max_file_size=10000,  # Maximum file size in characters
    supported_languages=[...],  # List of supported programming languages
    exclude_patterns=[...]  # Patterns to exclude
)
```

## Output Format

The processed dataset contains the following fields for each sample:

For the standard `DatasetProcessor`:
- `text`: The content of the code file
- `language`: The programming language detected
- `file_path`: Relative path to the file within the repository
- `repo_name`: Name of the repository
- `file_size`: Size of the file in characters
- `line_count`: Number of lines in the file

For the `DatasetProcessorSynthetic`:
- `messages`: List of messages in ChatML format (system, user, assistant)
- `language`: The programming language detected
- `file_path`: Relative path to the file within the repository
- `repo_name`: Name of the repository
- `file_size`: Size of the file in characters
- `line_count`: Number of lines in the file