ai_github_trainer/README_DATASET_PROCESSING.md

113 lines
2.7 KiB
Markdown

# Dataset Processing from GitHub Repositories
This guide explains how to get and process datasets from GitHub repositories using the provided tools.
## Prerequisites
Make sure you have installed the required dependencies:
```bash
pip install -r requirements.txt
```
## Using the DatasetProcessor Class
The `DatasetProcessor` class in `src/dataset_processor.py` provides comprehensive functionality for processing GitHub repositories into training datasets.
### Example Usage
```python
from src.dataset_processor import DatasetProcessor
from src.config import AppConfig, ModelConfig, TrainingConfig, DatasetConfig, MemoryConfig
# Initialize configuration
config = AppConfig(
model=ModelConfig(),
training=TrainingConfig(),
dataset=DatasetConfig(),
memory=MemoryConfig()
)
# Initialize dataset processor
processor = DatasetProcessor()
# Process GitHub repositories
repo_urls = [
"https://github.com/karpathy/nanoGPT.git",
# Add more repository URLs as needed
]
dataset = processor.process_github_repos(
repo_urls=repo_urls,
config=config,
github_token=None # Add your token for private repositories
)
print(f"Dataset processed successfully with {len(dataset)} samples")
```
## Using the Example Script
You can run the example script directly:
```bash
python example_dataset_processing.py
```
This will process the example repository and show information about the processed dataset.
## Using in Google Colab
The `ai_trainer_t4_colab.ipynb` notebook includes sections for processing GitHub repositories:
1. Simple repository processing (Section 5)
2. Advanced dataset processing (Section 5.1)
## Supported File Types
The DatasetProcessor supports the following file types:
- Python (.py)
- JavaScript (.js)
- TypeScript (.ts)
- Java (.java)
- C++ (.cpp, .hpp)
- C (.c, .h)
- C# (.cs)
- PHP (.php)
- Ruby (.rb)
- Go (.go)
- Rust (.rs)
- Swift (.swift)
- Kotlin (.kt)
- Scala (.scala)
- SQL (.sql)
- Bash (.sh)
- YAML (.yaml, .yml)
- JSON (.json)
- XML (.xml)
- HTML (.html)
- CSS (.css)
- Markdown (.md)
## Configuration
The dataset processing can be configured through the `DatasetConfig` class:
```python
dataset_config = DatasetConfig(
min_file_size=10, # Minimum file size in characters
max_file_size=10000, # Maximum file size in characters
supported_languages=[...], # List of supported programming languages
exclude_patterns=[...] # Patterns to exclude
)
```
## Output Format
The processed dataset contains the following fields for each sample:
- `text`: The content of the code file
- `language`: The programming language detected
- `file_path`: Relative path to the file within the repository
- `repo_name`: Name of the repository
- `file_size`: Size of the file in characters
- `line_count`: Number of lines in the file