An easy to use, extensible robots.txt
parser library with full support for literally every directive and specification on the Internet.
- Permission checks
- Fetch crawler rules
- Sitemap discovery
- Host preference
- Dynamic URL parameter discovery
robots.txt
rendering
(compared to most other robots.txt libraries)
- Automatic
robots.txt
download. (optional) - Integrated Caching system. (optional)
- Crawl Delay handler.
- Documentation available.
- Support for literally every single directive, from every specification.
- HTTP Status code handler, according to Google's spec.
- Dedicated
User-Agent
parser and group determiner library, for maximum accuracy. - Provides additional data like preferred host, dynamic URL parameters, Sitemap locations, etc.
- Protocols supported:
HTTP
,HTTPS
,FTP
,SFTP
andFTP/S
.
The recommended way to install the robots.txt parser is through Composer. Add this to your composer.json
file:
{
"require": {
"vipnytt/robotstxtparser": "^2.0"
}
}
Then run: php composer update
<?php
$client = new vipnytt\RobotsTxtParser\UriClient('http://example.com');
if ($client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html')) {
// Access is granted
}
if ($client->userAgent('MyBot')->isDisallowed('http://example.com/admin')) {
// Access is denied
}
<?php
// Syntax: $baseUri, [$statusCode:int|null], [$robotsTxtContent:string], [$encoding:string], [$byteLimit:int|null]
$client = new vipnytt\RobotsTxtParser\TxtClient('http://example.com', 200, $robotsTxtContent);
// Permission checks
$allowed = $client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html'); // bool
$denied = $client->userAgent('MyBot')->isDisallowed('http://example.com/admin'); // bool
// Crawl delay rules
$crawlDelay = $client->userAgent('MyBot')->crawlDelay()->getValue(); // float | int
// Dynamic URL parameters
$cleanParam = $client->cleanParam()->export(); // array
// Preferred host
$host = $client->host()->export(); // string | null
$host = $client->host()->getWithUriFallback(); // string
$host = $client->host()->isPreferred(); // bool
// XML Sitemap locations
$host = $client->sitemap()->export(); // array
The above is just a taste the basics, a whole bunch of more advanced and/or specialized methods are available for almost any purpose. Visit the cheat-sheet for the technical details.
Visit the Documentation for more information.
- Google robots.txt specifications
- Yandex robots.txt specifications
- W3C Recommendation HTML 4.01 specification
- Sitemaps.org protocol
- Sean Conner: "An Extended Standard for Robot Exclusion"
- Martijn Koster: "A Method for Web Robots Control"
- Martijn Koster: "A Standard for Robot Exclusion"
- RFC 7231,
2616 - RFC 7230,
2616 - RFC 5322,
2822,822 - RFC 3986,
1808 - RFC 1945
- RFC 1738
- RFC 952