This project provides a complete solution for extracting and analyzing root-level files (e.g., README
, CONTRIBUTING
, CODE_OF_CONDUCT
) from GitHub repositories. The tool is designed to:
- Extract relevant files from repositories and organize them by programming language.
- Analyze the extracted content using predefined prompts and AI-powered classification.
This tool is particularly useful for understanding various aspects of open-source projects, such as governance, community engagement, and documentation quality.
-
Repository Extraction:
- Automatically fetches specified root-level files from GitHub repositories.
- Organizes extracted data by programming language.
- Supports configurable file patterns (e.g.,
README
,CODE_OF_CONDUCT
).
-
Text Classification:
- Uses AI-powered prompts to analyze extracted files.
- Identifies governance structures, user testing mentions, and non-coding contributions.
- Generates structured JSON results for further analysis.
-
Modular Design:
- Extractor: Fetches and organizes repository data.
- Classifier: Processes extracted files and applies structured prompts to analyze content.
- Main Runner: Automates execution of both modules.
-
Logging and Validation:
- Logs all operations in dedicated log files.
- Includes validation features for sampling and debugging.
This section guides you through setting up the environment, installing dependencies, and executing the Diversity Card Analyzer pipeline, which includes both extraction and classification of GitHub repository files.
-
Create a Virtual Environment (Recommended):
python -m venv venv source venv/bin/activate # On macOS/Linux venv\Scripts\activate # On Windows
-
Install Dependencies:
pip install -r requirements.txt
-
Modify Configuration Files:
- Extractor Configuration:
- Update
config/extractor.yaml
with your GitHub API token and extraction parameters.
- Update
- Classifier Configuration:
- Ensure
config/classifier.yaml
contains valid API credentials and classification settings.
- Ensure
- Repositories List:
- Add target repositories to
repositories.json
.
- Add target repositories to
- Extractor Configuration:
Once the setup is complete, execute the main script to run both extractor and classifier sequentially:
python main.py
This will:
- Extract files from GitHub repositories and store them in
data/root_files/
. - Classify extracted files using predefined prompts and save the output in
data/classification/
.
requirements.txt
β Lists all required dependencies.config/extractor.yaml
β Configuration for repository extraction.config/classifier.yaml
β Configuration for text classification.repositories.json
β Defines repositories to be processed.main.py
β Orchestrates the execution of extractor and classifier.
- Execution logs are stored in
logs/main.log
. - Each module has its own logs:
- Extractor β
logs/extractor.log
- Classifier β
logs/classifier.log
- Extractor β
- Ensure all required API tokens are set in the configuration files.
- Check logs for any errors or missing configurations.
- Modify
repositories.json
to include or exclude repositories as needed.
The Extractor module automates the retrieval of specific root-level files from GitHub repositories. These files, such as README
, CONTRIBUTING
, and CODE_OF_CONDUCT
, provide essential insights into the structure, guidelines, and governance of open-source projects. Extracted files are categorized by programming language and stored locally for further analysis.
-
Configuration:
- The extraction process is configured through
config/extractor.yaml
, which defines:- GitHub API authentication
- Target file patterns (e.g.,
README.md
,CODE_OF_CONDUCT.md
) - Output directory structure
repositories.json
contains the list of repositories to be processed, specifying repository owner, name, and programming language.
- The extraction process is configured through
-
Target File Matching:
- The extractor scans the root directory of each repository and identifies files that match predefined patterns.
- Only relevant files are downloaded to avoid unnecessary processing.
-
Data Organization:
- Extracted files are stored in
data/root_files/<language>/
. - Each repositoryβs files are combined into a single text file named
<owner>_<repo>.txt
to facilitate structured processing.
- Extracted files are stored in
-
Logging:
- Every extraction process is logged in
logs/extractor.log
. - The log file contains details of processed repositories, extracted files, skipped files, and any errors encountered.
- Every extraction process is logged in
repositories_extractor.py
β Main script for extracting files from repositories.config/extractor.yaml
β Configures extraction settings (API authentication, file patterns, output paths).repositories.json
β Defines the list of repositories to be processed.
-
Prepare Configuration:
- Ensure
config/extractor.yaml
is correctly set up with API credentials and extraction parameters. - Define repositories in
repositories.json
with the following structure:{ "repos": [ { "owner": "OWNER_NAME", "name": "REPO_NAME", "language": "LANGUAGE" }, { "owner": "OWNER_NAME", "name": "REPO_NAME", "language": "LANGUAGE" }, ] }
- Ensure
-
Run the Extractor: Execute the extractor script to fetch and organize repository files:
python src/extractor/repositories_extractor.py
-
Output:
- Extracted files are stored in
data/root_files/<language>/
. - Log file is available in
logs/extractor.log
for debugging and tracking the extraction process.
- Extracted files are stored in
β Automates the retrieval of essential documentation across multiple repositories.
β Organizes extracted data efficiently for structured analysis.
β Provides detailed logs for traceability and debugging.
The Classifier module processes and analyzes extracted root files from GitHub repositories using AI-driven text classification. It applies structured prompts to assess various aspects of open-source project documentation, such as governance participation, non-coding contributions, and user testing. The classification results are stored in JSON format for easy interpretation and further analysis.
-
File Selection:
- The classifier automatically detects and processes files stored in
data/root_files/
. - Files are organized by programming language, and all extracted files are processed systematically.
- The classifier automatically detects and processes files stored in
-
Prompt-Based Analysis:
- The classification process is guided by predefined prompts stored in
config/prompts.yaml
. - Each prompt is designed to extract specific information, such as governance structures, diversity indicators, and user testing considerations.
- The classification process is guided by predefined prompts stored in
-
AI Processing:
- The classifier interacts with an AI language model to analyze the content of each file.
- The model's responses are parsed into structured JSON format, containing categorized insights.
-
Output Organization:
- Classification results are stored in
data/classification/<language>/
. - Each processed file generates a corresponding JSON output, named
<file_name>.json
.
- Classification results are stored in
-
Logging:
- Execution details, including processed files and errors, are logged in
logs/classifier.log
.
- Execution details, including processed files and errors, are logged in
classifier.py
β Main script for analyzing extracted files.config/classifier.yaml
β Configures classification settings (API authentication, output paths, prompts).prompts.py
β Contains predefined prompts for structured analysis.
-
Prepare Configuration:
- Ensure
config/classifier.yaml
is correctly set up with API credentials and classification parameters. - Verify that the
data/root_files/
directory contains extracted files organized by language.
- Ensure
-
Run the Classifier: Execute the classifier script to analyze extracted files:
python src/classifier/classifier.py
-
Output:
- Processed results are stored in
data/classification/<language>/
. - Log file is available in
logs/classifier.log
for debugging and tracking the classification process.
- Processed results are stored in
{
"development_team": {
"mention_to_dev_team": "yes",
"profile_aspects": {
"mentioned": "yes",
"aspects": ["geographic diversity"]
}
},
"non_coding_contributors": {
"mention_non_coding_contributors": "no",
"non_coding_roles": {
"explained": "no",
"roles": []
}
},
"tests_with_potential_users": {
"mention_tests_with_users": "yes",
"mention_labor_force": "no",
"mention_reporting_platforms": "no"
},
"deployment_context": {
"mention_specific_use_case": "no",
"mention_target_population": "no",
"mention_specific_adaptation": "no"
},
"governance_participants": {
"mention_governance_participants": "no",
"mention_funders": "no"
}
}
β Provides structured analysis of open-source project documentation.
β Uses AI-driven classification for deeper insights.
β Outputs JSON files for seamless integration with other tools or dashboards.
β Comprehensive logging ensures transparency and debugging support.