# Dataset Processing from GitHub Repositories This guide explains how to get and process datasets from GitHub repositories using the provided tools. ## Prerequisites Make sure you have installed the required dependencies: ```bash pip install -r requirements.txt ``` ## Using the DatasetProcessor Class The `DatasetProcessor` class in `src/dataset_processor.py` provides comprehensive functionality for processing GitHub repositories into training datasets. ### Example Usage ```python from src.dataset_processor import DatasetProcessor from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig # Initialize configuration config = AppConfig( model=ModelConfig(), training=TrainingConfig(), dataset=DatasetConfig(), memory=MemoryConfig() ) # Initialize dataset processor processor = DatasetProcessor() # Process GitHub repositories repo_urls = [ "https://github.com/karpathy/nanoGPT.git", # Add more repository URLs as needed ] dataset = processor.process_github_repos( repo_urls=repo_urls, config=config, github_token=None # Add your token for private repositories ) print(f"Dataset processed successfully with {len(dataset)} samples") ``` ## Using the DatasetProcessorSynthetic Class The `DatasetProcessorSynthetic` class in `src/dataset_processor_synthetic.py` provides functionality for processing GitHub repositories into training datasets in QA ChatML format using a local AI model (Ollama). ### Example Usage ```python from src.dataset_processor_synthetic import DatasetProcessorSynthetic from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig # Initialize configuration config = AppConfig( model=ModelConfig(), training=TrainingConfig(), dataset=DatasetConfig(), memory=MemoryConfig() ) # Initialize dataset processor processor = DatasetProcessorSynthetic() # Process GitHub repositories repo_urls = [ "https://github.com/karpathy/nanoGPT.git", # Add more repository URLs as needed ] dataset = processor.process_github_repos( repo_urls=repo_urls, config=config, github_token=None # Add your token for private repositories ) print(f"Dataset processed successfully with {len(dataset)} samples") ``` ## Saving and Loading Datasets Both dataset processors support saving and loading datasets to/from disk to avoid reprocessing: ```python # Save dataset processor.save_dataset(dataset, "./my_processed_dataset") # Load dataset loaded_dataset = processor.load_dataset("./my_processed_dataset") ``` The main script also supports saving/loading datasets via command-line arguments: ```bash # Process and save dataset python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset # Load and train with existing dataset python src/main.py --repo1 https://github.com/repo1 --repo2 https://github.com/repo2 --dataset_path ./my_dataset ``` ## Using the Example Script You can run the example script directly: ```bash python example_dataset_processing.py ``` This will process the example repository and show information about the processed dataset. ## Using in Google Colab The `ai_trainer_t4_colab.ipynb` notebook includes sections for processing GitHub repositories: 1. Simple repository processing (Section 5) 2. Advanced dataset processing (Section 5.1) ## Supported File Types The DatasetProcessor supports the following file types: - Python (.py) - JavaScript (.js) - TypeScript (.ts) - Java (.java) - C++ (.cpp, .hpp) - C (.c, .h) - C# (.cs) - PHP (.php) - Ruby (.rb) - Go (.go) - Rust (.rs) - Swift (.swift) - Kotlin (.kt) - Scala (.scala) - SQL (.sql) - Bash (.sh) - YAML (.yaml, .yml) - JSON (.json) - XML (.xml) - HTML (.html) - CSS (.css) - Markdown (.md) ## Configuration The dataset processing can be configured through the `DatasetConfig` class: ```python dataset_config = DatasetConfig( min_file_size=10, # Minimum file size in characters max_file_size=10000, # Maximum file size in characters supported_languages=[...], # List of supported programming languages exclude_patterns=[...] # Patterns to exclude ) ``` ## Output Format The processed dataset contains the following fields for each sample: For the standard `DatasetProcessor`: - `text`: The content of the code file - `language`: The programming language detected - `file_path`: Relative path to the file within the repository - `repo_name`: Name of the repository - `file_size`: Size of the file in characters - `line_count`: Number of lines in the file For the `DatasetProcessorSynthetic`: - `messages`: List of messages in ChatML format (system, user, assistant) - `language`: The programming language detected - `file_path`: Relative path to the file within the repository - `repo_name`: Name of the repository - `file_size`: Size of the file in characters - `line_count`: Number of lines in the file