ai_github_trainer/README_DATASET_PROCESSING.md

2.7 KiB

Dataset Processing from GitHub Repositories

This guide explains how to get and process datasets from GitHub repositories using the provided tools.

Prerequisites

Make sure you have installed the required dependencies:

pip install -r requirements.txt

Using the DatasetProcessor Class

The DatasetProcessor class in src/dataset_processor.py provides comprehensive functionality for processing GitHub repositories into training datasets.

Example Usage

from src.dataset_processor import DatasetProcessor
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig

# Initialize configuration
config = AppConfig(
    model=ModelConfig(),
    training=TrainingConfig(),
    dataset=DatasetConfig(),
    memory=MemoryConfig()
)

# Initialize dataset processor
processor = DatasetProcessor()

# Process GitHub repositories
repo_urls = [
    "https://github.com/karpathy/nanoGPT.git",
    # Add more repository URLs as needed
]

dataset = processor.process_github_repos(
    repo_urls=repo_urls,
    config=config,
    github_token=None  # Add your token for private repositories
)

print(f"Dataset processed successfully with {len(dataset)} samples")

Using the Example Script

You can run the example script directly:

python example_dataset_processing.py

This will process the example repository and show information about the processed dataset.

Using in Google Colab

The ai_trainer_t4_colab.ipynb notebook includes sections for processing GitHub repositories:

  1. Simple repository processing (Section 5)
  2. Advanced dataset processing (Section 5.1)

Supported File Types

The DatasetProcessor supports the following file types:

  • Python (.py)
  • JavaScript (.js)
  • TypeScript (.ts)
  • Java (.java)
  • C++ (.cpp, .hpp)
  • C (.c, .h)
  • C# (.cs)
  • PHP (.php)
  • Ruby (.rb)
  • Go (.go)
  • Rust (.rs)
  • Swift (.swift)
  • Kotlin (.kt)
  • Scala (.scala)
  • SQL (.sql)
  • Bash (.sh)
  • YAML (.yaml, .yml)
  • JSON (.json)
  • XML (.xml)
  • HTML (.html)
  • CSS (.css)
  • Markdown (.md)

Configuration

The dataset processing can be configured through the DatasetConfig class:

dataset_config = DatasetConfig(
    min_file_size=10,  # Minimum file size in characters
    max_file_size=10000,  # Maximum file size in characters
    supported_languages=[...],  # List of supported programming languages
    exclude_patterns=[...]  # Patterns to exclude
)

Output Format

The processed dataset contains the following fields for each sample:

  • text: The content of the code file
  • language: The programming language detected
  • file_path: Relative path to the file within the repository
  • repo_name: Name of the repository
  • file_size: Size of the file in characters
  • line_count: Number of lines in the file