C.2.3 Outline the principles of searching algorithms used by search engines


The most known search algorithms are PageRank and the HITS algorithm, but it is important to know that most search engines include various other factors as well, e.g.

  • the time that a page has existed
  • the frequency of the search keywords on the page
  • other unknown factors (undisclosed)

For the following description the terms “inlinks” and “outlinks” are used. Inlinks are links that point to the page in question, i.e. if page W has an inlink, there is a page Z containing the URL of page W. Outlinks are links that point to a different page than the one in question, i.e. if page W has an outlink, it is an URL of another page, e.g. page Z.

PageRank algorithm

PageRank works by counting the number and quality of inlinks of a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

The value of an in-link from a page A is proportional to

P(A) / C(A)

Where P(A) is the PageRank score of page A, and C(A) is the number of out-links page A has.

As mentioned it is important to note that there are many other factors considered. For instance, the anchor text of a link is often far more important than its PageRank score.

Summary:

  • Pages are given a score (rank)
  • Rank determines the order in which pages appear
  • Incoming links add value to a page
  • The importance of an inlink depends on the PageRank (score) of the linking page
  • PageRank counts links per page and determines which page are most important

HITS algorithm

Based on the idea that keywords are not everything that matters; there are sites that might be more relevant even if they don’t contain the most keywords. It introduces the idea of different types of pages, authorities and hubs.

Authorities: A page is called an authority, if it contains valuable information and if it is truly relevant for the search query. It is assumed that such a page has a high number of in-links.

Hubs: These are pages that are relevant for finding authorities. They contain useful links towards them. It is therefore assumed that these pages have a high number of out-links.

The algorithm is based on mathematical graph theory, where a page is represented by a vertex and links between pages are represented by edges (in form of vectors).

    Leave a Comment: