C.2.4 Describe how a web-crawler functions
A web crawler, also known as a web spider, web robot or simply bot, is a program that browses the web in a methodical and automated manner. For each page it finds, a copy is downloaded and indexed. In this process it extracts all links from the given page and then repeats the same process for all found links. This way, it tries to find as many pages as possible.
- They might look at meta data contained in the head of web pages, but this depends on the crawler
- A crawler might not be able to read pages with dynamic content as they are very simple programs
Stop Bots using Band With
Save Band width less time on site crawling
Issue: A crawler consumes resources and a page might not wish to be “crawled”. For this reason “robots.txt” files were created, where a page states what should be indexed and what shouldn’t.
- A file that contains components to specify pages on a website that must not be crawled by search engine bots
- File is placed in root directory of the site
- The standard for robots.txt is called “Robots Exclusion Protocol”
- Can be specific to a special web crawler, or apply to all crawlers
- Not all bots follow this standard (malicious bots, malware) -> “illegal” bots can ignore robots.txt
- Still considered to be better to include a robots.txt instead of leaving it out
- It keeps the bots from less “noteworthy” content of a website
more time spend on indexing important/relevant content of the website