An accidental "Disallow: /" can happen to anyone, but it doesn't need to remain unnoticed. Whether you're a webmaster, developer, or SEO, this tool can help you quickly discover unwelcome robots.txt changes.
- Easily check results. Robots.txt check results are printed to the screen, logged in files, and optionally emailed (refer to the setup instructions).
- Snapshots. The robots.txt content is saved following the first check and whenever the file changes.
- Diffs. A diff file is created after every change to help you view the difference at a glance.
- Comprehensive logging. Check results and any errors are recorded in a main log and website log (where relevant), so you can refer back to historic data and investigate if anything goes wrong.
- Designed for reliability. Errors such as a mistyped URL or connection issue are handled gracefully and won't break other robots.txt checks. Unexpected issues are caught and logged.
- A /data/ directory is created/accessed to store robots.txt file data and logs.
- All monitored robots.txt files are downloaded and compared against the previous version.
- The check result is logged/printed and categorised as either "first run", "no change", "change", or "error".
- Timestamped snapshots ("first run" and "change") and diffs ("change") are saved in the relevant site directory.
- If enabled, site-specific email alerts are sent out ("first run", "change", and "error").
- If enabled, an administrative email alert is sent out detailing overall check results and any unexpected errors.
- Clone a local copy of this repository to your computer and/or a server, depending on your chosen setup (detailed below).
- Create a new
config.py
file in the /app/ subdirectory. Copy the contents ofexample_config.py
intoconfig.py
. Edit thePATH
variable to reference the location of the project root directory. - Install all requirements (specific versions of Python and packages) using the Pipfile. For more information, please read the Pipenv documentation.
- Create a "monitored_sites.csv" file in the top-level directory (which contains the README file). Populate the file with a header row (column labels) and details of the website(s) you want to monitor, as defined below:
- URL (column 1): the absolute URL of the website homepage, with a trailing slash.
- Name (column 2): the website's name identifier (letters/numbers only).
- Email (column 3): the email address of the website monitor, i.e. the person who will receive email notifications of errors/changes (if enabled). Note: an email header row label is required but email addresses don't need to be populated unless emails are enabled (detailed below).
URL | Name | |
---|---|---|
https://github.com/ | Github | person@example.com |
https://www.example.com/ | Example | another.person@example.com |
The quickest setup, suitable if you plan to run the tool on your computer for sites that you're personally monitoring (or are otherwise willing to notify other people about changes/errors manually).
-
Run
main.py
(in the /app/ subdirectory). The results should be printed to the console and the data/logs should be saved in a newly created /data/ directory. -
Run
main.py
again whenever you want to re-check the robots.txt files. It's recommended that you check the print output or main log regularly, or at least after new sites are added, in case there are any unexpected errors.
More setup, suitable if you plan to run the tool on your computer for yourself and others.
- Edit
config.py
(in the /app/ subdirectory):
- Set
ADMIN_EMAIL
to equal an email address that will receive the summary report. - Set
SENDER_EMAIL
to equal an address that will send all emails (unless this isn't required in your email sending functionality, e.g. if using a third-party email service). - Set
EMAILS_ENABLED
to equalTrue
.
-
Implement email sending functionality in
emails.py
(in the /app/ subdirectory) within thesend_emails()
function stub. The function stub receives all pre-created email content (as defined in the function docstring) and includes generic exception (uncaught error) handling, but it will be up to you to implement the sending of emails using your chosen package/library/email service. -
Run
main.py
. The results should be printed to the console and the data/logs should be saved in a newly created /data/ directory. Check that emails are being sent/received as intended. -
Run
main.py
again whenever you want to re-check the robots.txt files. It's recommended that you check the print output or main log regularly, or at least after new sites are added, in case there are any unexpected errors.
The most setup for a fully automated experience.
Refer to "Local with emails enabled", with the following considerations:
- You may need to refer to your server's documentation for details of setting up a cron job to regularly run
main.py
. - You may need to edit the shebang line at the top of
main.py
.
In general, errors will be reported/logged if anything is clearly breaking. However, it's a good idea to review the saved website(s) data (in the /data/ directory) after running the tool for the first and second time, to check that everything is looking correct.
Feel free to create an issue or open a new discussion.
If the tool repeatedly reports an error or inaccurate robots.txt content for a site, but you're able to view the file via your browser, this is likely due to an invalid URL format or the tool being blocked in some way.
Try the following:
- Check that the monitored URL is in the correct format.
- Request that your IP address (or your server's IP address) is whitelisted.
- Try adjusting the
USER_AGENT
string (e.g. to spoof a common browser) inconfig.py
. - Troubleshoot with your IT/development teams.
There are no major updates/features planned and I'm not looking for any substantial contributions (but you're welcome to fork your own version).
However, I plan to maintain the project by periodically upgrading to newer (actively supported) versions of Python/Requests and fixing any significant bugs.