Web Scraping

Robots.txt

Definition updated April 2026

What is robots.txt?

Robots.txt is a plain-text file placed at the root of a website (e.g., example.com/robots.txt) that instructs automated crawlers which paths they are permitted to access. It is the standard mechanism of the Robots Exclusion Protocol, used by site operators to control bot access.

Common directives include Disallow (block a path), Allow (explicitly permit a path within a blocked section), and Crawl-delay (suggested wait time between requests). Robots.txt is advisory, not enforceable - malicious bots ignore it - but reputable crawlers and legally cautious data tools honor it as standard practice.

Before scraping any website, checking robots.txt is both an ethical and legal first step. Some jurisdictions have treated scraping in violation of robots.txt as a factor in computer fraud claims. Data APIs provided by the platform itself bypass this concern entirely.

Related Terms

Web Crawler Data Scraping Anti-Scraping

Ready to work with live data?

HappyEndpoint APIs deliver real-world data from leading platforms - no scraping, no stale snapshots.

Browse Datasets