C.2.2 Distinguish between the surface web and the deep web
The surface web is the part of the web that can be reached by a search engine. For this, pages need to be static and fixed, so that they can be reached through links from other sites on the surface web. They also need to be accessible without special configuration. Examples include Google, Facebook, Youtube, etc.
- Pages that are reachable (and indexed) by a search engine
- Pages that can be reached through links from other sites in the surface web
- Pages that do not require special access configurations
The deep web is the part of the web that is not searchable by normal search engines. Reasons for this include proprietary content that requires authentication or VPN access, e.g. private social media, emails; commercial content that is protected by paywalls, e.g. online news papers, academic research databases; personal information that is protected, e.g. bank information, health records; dynamic content. Dynamic content is usually a result of some query, where data are fetched from a database.
- Pages not reachable by search engines
- Substantially larger than the surface web
- Common characteristics:
- Password protected pages, e.g. emails, private social media
- Paywalls, e.g. online news papers, academic research databases
- personal information, e.g. health records
- Pages without any incoming links
C.2.3 Outline the principles of searching algorithms used by search engines
The most known search algorithms are PageRank and the HITS algorithm, but it is important to know that most search engines include various other factors as well, e.g.
- the time that a page has existed
- the frequency of the search keywords on the page
- other unknown factors (undisclosed)
For the following description the terms “inlinks” and “outlinks” are used. Inlinks are links that point to the page in question, i.e. if page W has an inlink, there is a page Z containing the URL of page W. Outlinks are links that point to a different page than the one in question, i.e. if page W has an outlink, it is an URL of another page, e.g. page Z.
PageRank works by counting the number and quality of inlinks of a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
The value of an in-link from a page A is proportional to
P(A) / C(A)
Where P(A) is the PageRank score of page A, and C(A) is the number of out-links page A has.
As mentioned it is important to note that there are many other factors considered. For instance, the anchor text of a link is often far more important than its PageRank score.
- Pages are given a score (rank)
- Rank determines the order in which pages appear
- Inlinks add value to a page
- The importance of an inlink depends on the PageRank (score) of the linking page
- PageRank counts links per page and determines which page are most important
Based on the idea that keywords are not everything that matters; there are sites that might be more relevant even if they don’t contain the most keywords. It introduces the idea of different types of pages, authorities and hubs.
Authorities: A page is called an authority, if it contains valuable information and if it is truly relevant for the search query. It is assumed that such a page has a high number of in-links.
Hubs: These are pages that are relevant for finding authorities. They contain useful links towards them. It is therefore assumed that these pages have a high number of out-links.
The algorithm is based on mathematical graph theory, where a page is represented by a vertex and links between pages are represented by edges (in form of vectors).
Figure 1: A simple graph Link
The algorithm starts by creating a graph:
- It first finds the top 200 pages based on the occurrence of keywords from the query. Let’s call the set of these pages RQ
- It then finds all pages that link to the set of pages RQ and all pages which these link to (basically all pages linked in or out). Together with RQ this makes up the set SQ
- The algorithm gives each page in the set SQ a hub weight and an authority weight, based on how many pages link towards it (authority) and how many pages it links to (hub)
- The algorithm then lists the pages based on their weight
C.2.4 Describe how a web-crawler functions.
A web crawler, also known as a web spider, web robot or simply bot, is a program that browses the web in a methodical and automated manner. For each page it finds, a copy is downloaded and indexed. In this process it extracts all links from the given page and then repeats the same process for all found links. This way, it tries to find as many pages as possible.
- They might look at meta data contained in the head of web pages, but this depends on the crawler
- A crawler might not be able to read pages with dynamic content as they are very simple programs
Issue: A crawler consumes resources and a page might not wish to be “crawled”. For this reason “robots.txt” files were created, where a page states what should be indexed and what shouldn’t.
- A file that contains components to specify pages on a website that must not be crawled by search engine bots
- File is placed in root directory of the site
- The standard for robots.txt is called “Robots Exclusion Protocol”
- Can be specific to a special web crawler, or apply to all crawlers
- Not all bots follow this standard (malicious bots, malware) -> “illegal” bots can ignore robots.txt
- Still considered to be better to include a robots.txt instead of leaving it out
- It keeps the bots from less “noteworthy” content of a website
more time spend on indexing important/relevant content of the website