This application is designed to periodically fetch data from a public GitHub repository file like this containing patterns for Personally Identifiable Information (PII) data which the help of github acess token. It then synchronizes the data with a MongoDB collection.
- It fetches data periodically from the GitHub file specified by the link.
- It stores the fetched data in a MongoDB collection.
- It handles additions, updates, and deletes of entries in the GitHub file, reflecting the changes in the MongoDB collection on the next run.
- The code is written in Python.
- It handles edge cases such as handling GitHub API errors, checking for the last synchronized commit, comparing commits to determine if there are new changes, and logging errors for debugging purposes.
Note: it only collects the useful informations ( entries for which sensitive is marked as true
in the file).
-
Given a GitHub link to a file like this which contains patterns for PII data, a cron mentioned here will periodically run and fetch data from this file and store this in the mongo collection mentioned in the mongodb_uri of main.py.
-
Note that if a new entry is added to the file, then the same would be reflected in mongo on the next run. Same goes for updates and deletes as well.
- Python 3.x
- pip (Python package installer)
- GitHub access token
- MongoDB URI
- Clone the repository:
$ git clone https://github.com/mayankesh239/PII_Tracker.git
- Navigate to the project directory:
$ cd PII_Tracker
- Install the required Python packages:
$ pip install -r requirements.txt
- Generate a GitHub access token:
- Go to https://github.com/settings/tokens.
- Click on "Generate new token".
- Give the token a suitable description and select the necessary scopes (e.g., repo access).
- Click on "Generate token" and copy the generated access token.
- Set the GitHub access token as an environment variable:
- Open the terminal and execute the following command:
Replace "your-access-token" with the GitHub access token you generated.
$ export GITHUB_ACCESS_TOKEN="your-access-token"
- Set the MongoDB URI:
- Open
main.py
file in a text editor. - Replace the value of
mongodb_uri
variable (at line no 14 ) with your MongoDB connection URI. You can refer this Create Cluster Using MongoDB Atlas) to create cluster in MongoDB Atlas.
- Configure the application:
- Open the main.py file.
- Update the following variables in the code:
- repository_url: Set it to the GitHub repository URL containing the PII data file.
- file_path: Set it to the file path of the PII data file within the repository.
- mongodb_uri: Set it to the connection URI for your MongoDB database.
- database_name: Set it to the name of the MongoDB database.
- collection_name: Set it to the name of the MongoDB collection.
demo_1.webm
To run the application and perform data synchronization, execute the following command in the project directory:
$ python3 main.py
The application will fetch data from the this GitHub repository file, filter the sensitive information based on the "sensitive" attribute, and update the MongoDB collection with the filtered data. It will log the execution status and any errors encountered in the pii_sync.log
file.
demo_2.webm
To set up a cron job for periodic execution, you can use the crontab
command on Linux systems:
- Open the terminal and execute the following command:
$ crontab -e
If prompted to select an editor, choose your preferred editor (e.g., nano, vim).
- Add the following line to the crontab file to schedule the job at 10:32 PM every day:
32 22 * * * /usr/bin/python3 /path/to/your/pii_tracker/main.py
Replace /path/to/your/pii-tracker
with the actual path to the project directory. Save the crontab file and exit the editor.
- Execute the following command:
sudo apt install postfix
During the installation, you will be prompted to choose the general type of configuration. Select "Internet Site" and press Enter. Then, enter your fully qualified domain name (FQDN) when prompted. If you don't have a registered domain name, you can use the hostname of your server as the FQDN. To find out the hostname, you can run the following command in your terminal:
hostname
- Save the file and exit the text editor.
The cron job will now run at the specified time and execute the PII synchronization process. You can check the execution and any potential error messages in the log file specified in your script's logging configuration (pii_sync.log in this case).
You can check the scheduled cron jobs by running the following command in the terminal:
crontab -l
If the cron job is not working you can refer this. This doc has a list some of the ways to fix the issues.
Note: this will work in linux. You can use task scheduler to perform this on windows ( refer this )