Summary
Implement a feature in the web crawler that automatically discovers, fetches, parses, and enforces the rules specified in a website’s robots.txt file before crawling any URLs from that domain
This includes respecting Disallow, Allow, and Crawl-delay directives, and ensuring that the crawler does not access or queue URLs that are forbidden by the site's robots.txt policy. The crawler should cache robots.txt files
Affected Area(s)
Apps:
Libraries:
Other:
Motivation
Respecting robots.txt prevents overloading servers and avoids crawling restricted areas, aligning with industry best practices and ethical standards.
Summary
Implement a feature in the web crawler that automatically discovers, fetches, parses, and enforces the rules specified in a website’s robots.txt file before crawling any URLs from that domain
This includes respecting Disallow, Allow, and Crawl-delay directives, and ensuring that the crawler does not access or queue URLs that are forbidden by the site's robots.txt policy. The crawler should cache robots.txt files
Affected Area(s)
Apps:
Libraries:
Other:
Motivation
Respecting robots.txt prevents overloading servers and avoids crawling restricted areas, aligning with industry best practices and ethical standards.