ai_github_trainer/README_DATASET_PROCESSING.md
Suherdy Yacob aaa0f1b51e 1. add 2 data processor type: standard and synthetic
2. add DataProcessorSynthetic class to format github repo to QA ChatML format
2025-08-23 16:44:33 +07:00

4.8 KiB

Dataset Processing from GitHub Repositories

This guide explains how to get and process datasets from GitHub repositories using the provided tools.

Prerequisites

Make sure you have installed the required dependencies:

pip install -r requirements.txt

Using the DatasetProcessor Class

The DatasetProcessor class in src/dataset_processor.py provides comprehensive functionality for processing GitHub repositories into training datasets.

Example Usage

from src.dataset_processor import DatasetProcessor
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig

# Initialize configuration
config = AppConfig(
    model=ModelConfig(),
    training=TrainingConfig(),
    dataset=DatasetConfig(),
    memory=MemoryConfig()
)

# Initialize dataset processor
processor = DatasetProcessor()

# Process GitHub repositories
repo_urls = [
    "https://github.com/karpathy/nanoGPT.git",
    # Add more repository URLs as needed
]

dataset = processor.process_github_repos(
    repo_urls=repo_urls,
    config=config,
    github_token=None  # Add your token for private repositories
)

print(f"Dataset processed successfully with {len(dataset)} samples")

Using the DatasetProcessorSynthetic Class

The DatasetProcessorSynthetic class in src/dataset_processor_synthetic.py provides functionality for processing GitHub repositories into training datasets in QA ChatML format using a local AI model (Ollama).

Example Usage

from src.dataset_processor_synthetic import DatasetProcessorSynthetic
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig

# Initialize configuration
config = AppConfig(
    model=ModelConfig(),
    training=TrainingConfig(),
    dataset=DatasetConfig(),
    memory=MemoryConfig()
)

# Initialize dataset processor
processor = DatasetProcessorSynthetic()

# Process GitHub repositories
repo_urls = [
    "https://github.com/karpathy/nanoGPT.git",
    # Add more repository URLs as needed
]

dataset = processor.process_github_repos(
    repo_urls=repo_urls,
    config=config,
    github_token=None  # Add your token for private repositories
)

print(f"Dataset processed successfully with {len(dataset)} samples")

Saving and Loading Datasets

Both dataset processors support saving and loading datasets to/from disk to avoid reprocessing:

# Save dataset
processor.save_dataset(dataset, "./my_processed_dataset")

# Load dataset
loaded_dataset = processor.load_dataset("./my_processed_dataset")

The main script also supports saving/loading datasets via command-line arguments:

# Process and save dataset
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset

# Load and train with existing dataset
python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset

Using the Example Script

You can run the example script directly:

python example_dataset_processing.py

This will process the example repository and show information about the processed dataset.

Using in Google Colab

The ai_trainer_t4_colab.ipynb notebook includes sections for processing GitHub repositories:

  1. Simple repository processing (Section 5)
  2. Advanced dataset processing (Section 5.1)

Supported File Types

The DatasetProcessor supports the following file types:

  • Python (.py)
  • JavaScript (.js)
  • TypeScript (.ts)
  • Java (.java)
  • C++ (.cpp, .hpp)
  • C (.c, .h)
  • C# (.cs)
  • PHP (.php)
  • Ruby (.rb)
  • Go (.go)
  • Rust (.rs)
  • Swift (.swift)
  • Kotlin (.kt)
  • Scala (.scala)
  • SQL (.sql)
  • Bash (.sh)
  • YAML (.yaml, .yml)
  • JSON (.json)
  • XML (.xml)
  • HTML (.html)
  • CSS (.css)
  • Markdown (.md)

Configuration

The dataset processing can be configured through the DatasetConfig class:

dataset_config = DatasetConfig(
    min_file_size=10,  # Minimum file size in characters
    max_file_size=10000,  # Maximum file size in characters
    supported_languages=[...],  # List of supported programming languages
    exclude_patterns=[...]  # Patterns to exclude
)

Output Format

The processed dataset contains the following fields for each sample:

For the standard DatasetProcessor:

  • text: The content of the code file
  • language: The programming language detected
  • file_path: Relative path to the file within the repository
  • repo_name: Name of the repository
  • file_size: Size of the file in characters
  • line_count: Number of lines in the file

For the DatasetProcessorSynthetic:

  • messages: List of messages in ChatML format (system, user, assistant)
  • language: The programming language detected
  • file_path: Relative path to the file within the repository
  • repo_name: Name of the repository
  • file_size: Size of the file in characters
  • line_count: Number of lines in the file