Diversity Card Analyzer

📋 Overview

This project provides a complete solution for extracting and analyzing root-level files (e.g., README, CONTRIBUTING, CODE_OF_CONDUCT) from GitHub repositories. The tool is designed to:

Extract relevant files from repositories and organize them by programming language.
Analyze the extracted content using predefined prompts and AI-powered classification.

This tool is particularly useful for understanding various aspects of open-source projects, such as governance, community engagement, and documentation quality.

✨ Key Features

Repository Extraction:
- Automatically fetches specified root-level files from GitHub repositories.
- Organizes extracted data by programming language.
- Supports configurable file patterns (e.g., README, CODE_OF_CONDUCT).
Text Classification:
- Uses AI-powered prompts to analyze extracted files.
- Identifies governance structures, user testing mentions, and non-coding contributions.
- Generates structured JSON results for further analysis.
Modular Design:
- Extractor: Fetches and organizes repository data.
- Classifier: Processes extracted files and applies structured prompts to analyze content.
- Main Runner: Automates execution of both modules.
Logging and Validation:
- Logs all operations in dedicated log files.
- Includes validation features for sampling and debugging.

🚀 Setup & Execution

📖 Overview

This section guides you through setting up the environment, installing dependencies, and executing the Diversity Card Analyzer pipeline, which includes both extraction and classification of GitHub repository files.

🛠️ Environment Setup

Create a Virtual Environment (Recommended):

python -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\Scripts\activate  # On Windows

Install Dependencies:
```
pip install -r requirements.txt
```
Modify Configuration Files:
- Extractor Configuration:
  - Update config/extractor.yaml with your GitHub API token and extraction parameters.
- Classifier Configuration:
  - Ensure config/classifier.yaml contains valid API credentials and classification settings.
- Repositories List:
  - Add target repositories to repositories.json.

▶️ Running the Full Pipeline

Once the setup is complete, execute the main script to run both extractor and classifier sequentially:

python main.py

This will:

Extract files from GitHub repositories and store them in data/root_files/.
Classify extracted files using predefined prompts and save the output in data/classification/.

📂 Key Files

requirements.txt → Lists all required dependencies.
config/extractor.yaml → Configuration for repository extraction.
config/classifier.yaml → Configuration for text classification.
repositories.json → Defines repositories to be processed.
main.py → Orchestrates the execution of extractor and classifier.

📊 Logs & Debugging

Execution logs are stored in logs/main.log.
Each module has its own logs:
- Extractor → logs/extractor.log
- Classifier → logs/classifier.log

✅ Final Notes

Ensure all required API tokens are set in the configuration files.
Check logs for any errors or missing configurations.
Modify repositories.json to include or exclude repositories as needed.

🔍 Extractor

📖 Overview

The Extractor module automates the retrieval of specific root-level files from GitHub repositories. These files, such as README, CONTRIBUTING, and CODE_OF_CONDUCT, provide essential insights into the structure, guidelines, and governance of open-source projects. Extracted files are categorized by programming language and stored locally for further analysis.

⚙️ How It Works

Configuration:
- The extraction process is configured through config/extractor.yaml, which defines:
  - GitHub API authentication
  - Target file patterns (e.g., README.md, CODE_OF_CONDUCT.md)
  - Output directory structure
- repositories.json contains the list of repositories to be processed, specifying repository owner, name, and programming language.
Target File Matching:
- The extractor scans the root directory of each repository and identifies files that match predefined patterns.
- Only relevant files are downloaded to avoid unnecessary processing.
Data Organization:
- Extracted files are stored in data/root_files/<language>/.
- Each repository’s files are combined into a single text file named <owner>_<repo>.txt to facilitate structured processing.
Logging:
- Every extraction process is logged in logs/extractor.log.
- The log file contains details of processed repositories, extracted files, skipped files, and any errors encountered.

📂 Key Files

repositories_extractor.py → Main script for extracting files from repositories.
config/extractor.yaml → Configures extraction settings (API authentication, file patterns, output paths).
repositories.json → Defines the list of repositories to be processed.

🛠️ Usage Instructions

Prepare Configuration:

Ensure config/extractor.yaml is correctly set up with API credentials and extraction parameters.

Define repositories in repositories.json with the following structure:

{
  "repos": [
    { "owner": "OWNER_NAME", "name": "REPO_NAME", "language": "LANGUAGE" },
    { "owner": "OWNER_NAME", "name": "REPO_NAME", "language": "LANGUAGE" },
  ]
}

Run the Extractor: Execute the extractor script to fetch and organize repository files:
```
python src/extractor/repositories_extractor.py
```
Output:
- Extracted files are stored in data/root_files/<language>/.
- Log file is available in logs/extractor.log for debugging and tracking the extraction process.

🌟 Key Benefits

✅ Automates the retrieval of essential documentation across multiple repositories.

✅ Organizes extracted data efficiently for structured analysis.

✅ Provides detailed logs for traceability and debugging.

🔍 Classifier

📖 Overview

The Classifier module processes and analyzes extracted root files from GitHub repositories using AI-driven text classification. It applies structured prompts to assess various aspects of open-source project documentation, such as governance participation, non-coding contributions, and user testing. The classification results are stored in JSON format for easy interpretation and further analysis.

⚙️ How It Works

File Selection:
- The classifier automatically detects and processes files stored in data/root_files/.
- Files are organized by programming language, and all extracted files are processed systematically.
Prompt-Based Analysis:
- The classification process is guided by predefined prompts stored in config/prompts.yaml.
- Each prompt is designed to extract specific information, such as governance structures, diversity indicators, and user testing considerations.
AI Processing:
- The classifier interacts with an AI language model to analyze the content of each file.
- The model's responses are parsed into structured JSON format, containing categorized insights.
Output Organization:
- Classification results are stored in data/classification/<language>/.
- Each processed file generates a corresponding JSON output, named <file_name>.json.
Logging:
- Execution details, including processed files and errors, are logged in logs/classifier.log.

📂 Key Files

classifier.py → Main script for analyzing extracted files.
config/classifier.yaml → Configures classification settings (API authentication, output paths, prompts).
prompts.py → Contains predefined prompts for structured analysis.

🛠️ Usage Instructions

Prepare Configuration:
- Ensure config/classifier.yaml is correctly set up with API credentials and classification parameters.
- Verify that the data/root_files/ directory contains extracted files organized by language.
Run the Classifier: Execute the classifier script to analyze extracted files:
```
python src/classifier/classifier.py
```
Output:
- Processed results are stored in data/classification/<language>/.
- Log file is available in logs/classifier.log for debugging and tracking the classification process.

📊 Example Output

{
  "development_team": {
    "mention_to_dev_team": "yes",
    "profile_aspects": {
      "mentioned": "yes",
      "aspects": ["geographic diversity"]
    }
  },
  "non_coding_contributors": {
    "mention_non_coding_contributors": "no",
    "non_coding_roles": {
      "explained": "no",
      "roles": []
    }
  },
  "tests_with_potential_users": {
    "mention_tests_with_users": "yes",
    "mention_labor_force": "no",
    "mention_reporting_platforms": "no"
  },
  "deployment_context": {
    "mention_specific_use_case": "no",
    "mention_target_population": "no",
    "mention_specific_adaptation": "no"
  },
  "governance_participants": {
    "mention_governance_participants": "no",
    "mention_funders": "no"
  }
}

🌟 Key Benefits

✅ Provides structured analysis of open-source project documentation.

✅ Uses AI-driven classification for deeper insights.

✅ Outputs JSON files for seamless integration with other tools or dashboards.

✅ Comprehensive logging ensures transparency and debugging support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diversity Card Analyzer

📋 Overview

✨ Key Features

🚀 Setup & Execution

📖 Overview

🛠️ Environment Setup

▶️ Running the Full Pipeline

📂 Key Files

📊 Logs & Debugging

✅ Final Notes

🔍 Extractor

📖 Overview

⚙️ How It Works

📂 Key Files

🛠️ Usage Instructions

🌟 Key Benefits

🔍 Classifier

📖 Overview

⚙️ How It Works

📂 Key Files

🛠️ Usage Instructions

📊 Example Output

🌟 Key Benefits

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
config		config
data		data
logs		logs
src		src
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE.md		LICENSE.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

SOM-Research/Analyzer-Diversity-Card

Folders and files

Latest commit

History

Repository files navigation

Diversity Card Analyzer

📋 Overview

✨ Key Features

🚀 Setup & Execution

📖 Overview

🛠️ Environment Setup

▶️ Running the Full Pipeline

📂 Key Files

📊 Logs & Debugging

✅ Final Notes

🔍 Extractor

📖 Overview

⚙️ How It Works

📂 Key Files

🛠️ Usage Instructions

🌟 Key Benefits

🔍 Classifier

📖 Overview

⚙️ How It Works

📂 Key Files

🛠️ Usage Instructions

📊 Example Output

🌟 Key Benefits

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages