Robots.txt parser

An easy to use, extensible robots.txt parser library with full support for literally every directive and specification on the Internet.

Usage cases:

Permission checks
Fetch crawler rules
Sitemap discovery
Host preference
Dynamic URL parameter discovery
robots.txt rendering

Advantages

(compared to most other robots.txt libraries)

Automatic robots.txt download. (optional)
Integrated Caching system. (optional)
Crawl Delay handler.
Documentation available.
Support for literally every single directive, from every specification.
HTTP Status code handler, according to Google's spec.
Dedicated User-Agent parser and group determiner library, for maximum accuracy.
Provides additional data like preferred host, dynamic URL parameters, Sitemap locations, etc.
Protocols supported: HTTP, HTTPS, FTP, SFTP and FTP/S.

Requirements:

PHP 7.3+ or 8.0+
PHP extensions:
- cURL
- mbstring

Installation

The recommended way to install the robots.txt parser is through Composer. Add this to your composer.json file:

{
  "require": {
    "vipnytt/robotstxtparser": "^2.1"
  }
}

Then run: php composer update

Getting started

Basic usage example

<?php
$client = new vipnytt\RobotsTxtParser\UriClient('http://example.com');

if ($client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html')) {
    // Access is granted
}
if ($client->userAgent('MyBot')->isDisallowed('http://example.com/admin')) {
    // Access is denied
}

A small excerpt of basic methods

<?php
// Syntax: $baseUri, [$statusCode:int|null], [$robotsTxtContent:string], [$encoding:string], [$byteLimit:int|null]
$client = new vipnytt\RobotsTxtParser\TxtClient('http://example.com', 200, $robotsTxtContent);

// Permission checks
$allowed = $client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html'); // bool
$denied = $client->userAgent('MyBot')->isDisallowed('http://example.com/admin'); // bool

// Crawl delay rules
$crawlDelay = $client->userAgent('MyBot')->crawlDelay()->getValue(); // float | int

// Dynamic URL parameters
$cleanParam = $client->cleanParam()->export(); // array

// Preferred host
$host = $client->host()->export(); // string | null
$host = $client->host()->getWithUriFallback(); // string
$host = $client->host()->isPreferred(); // bool

// XML Sitemap locations
$host = $client->sitemap()->export(); // array

The above is just a taste the basics, a whole bunch of more advanced and/or specialized methods are available for almost any purpose. Visit the cheat-sheet for the technical details.

Visit the Documentation for more information.

Directives

Clean-param
Host
Sitemap
User-agent

Specifications

Google robots.txt specifications
Yandex robots.txt specifications
W3C Recommendation HTML 4.01 specification
Sitemaps.org protocol
Sean Conner: "An Extended Standard for Robot Exclusion"
Martijn Koster: "A Method for Web Robots Control"
Martijn Koster: "A Standard for Robot Exclusion"
RFC 7231, ~~2616~~
RFC 7230, ~~2616~~
RFC 5322, ~~2822~~, ~~822~~
RFC 3986, ~~1808~~
RFC 1945
RFC 1738
RFC 952

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Robots.txt parser

Usage cases:

Advantages

Requirements:

Installation

Getting started

Basic usage example

A small excerpt of basic methods

Directives

Specifications

Files

README.md

Latest commit

History

README.md

File metadata and controls

Robots.txt parser

Usage cases:

Advantages

Requirements:

Installation

Getting started

Basic usage example

A small excerpt of basic methods

Directives

Specifications