12 Class Dec Activity

Update your notes with how search has continually evolved since the start of the www in early 90's. Describe how the need for more relevant results enabled Google to become the dominant search engine. Briefly describe the theory behind the HITS algorithm.

If you were designing a WEB CRAWLER state 3 features/qualities that you would include in the design

Describe how the web can be represented as a directed graph ( sketch o include image of both strong and not string connected graphs )

Draw both the Truth Tables and logic gates for the following logic operators

  • OR Gate
  • AND Gate
  • NOT Gate
  • NAND
  • NOR
  • XOR

Web - History - Democratic - Static/Dynamic and SPAM

Team 3  Features of a Web Crawlers Politeness and Robustness Click Here

Features a Web Crawler Should Have  Scaleable , Distributed, Freshness Click Here

Instructions : With your selected topic spend 10 mins reading researching . 

After 10 mins  Team 1 and Team 2 transfer topic knowledge same for Team 3 and 4.  

Team 1 present findings from Team 2  -  by sharing knowledge with rest of class

Team 2 present findings from Team 1  -  by sharing knowledge with rest of class

Team 3 present findings from Team 4  -  by sharing knowledge with rest of class

Team 4 present findings from Team 3  -  by sharing knowledge with rest of class

A Brief History of the Web

HTTP hypertext transfer protocol

Since 1990 web usage has shown tremendous growth to the point where it now claims a good fraction of humanity as participants, by relying on a simple, open client-server design: (1) the server communicates with the client via a protocol ( HTTP hypertext transfer protocol ). Easy to use and able to transfer TEXT, IMAGES, VIDEO and  AUDIO 

Browser accepts a URL (for Uniform Resource Locator)

EXAMPLE :   http://www.stanford.edu/home/atoz/contact.html

  • 1
    HTTP  The protocol
  • 2
    www.stanford.edu : The Domain 
  • 3
    /home/atoz/contact.html  : Is  the path to file contact.html 

Browser Features

  • 1
    The designers of the first browsers made it easy to view the HTML markup tags (CTRL U)
  • 2
    Browser  ignored what they did not understand - Supporting web page creation by everyone - Display what it can rather than displaying nothing. 

What it did promote was amateur content creators who could freely experiment with and learn from their newly created web pages without fear that a simple syntax error would ``bring the system down.''

Making web information ``discoverable''

  • 1
    Taxonomies or directories ( like yahoo index or DMOZ ) - Manual and Difficult to Scale
  • 2
    full-text index search engines such as Altavista, Excite and Infoseek - Search Box - Indexes

Relevance/Quality of Results

  • 1
    Left much to be desired
  • 2
    This necessitated the invention of new ranking and spam-fighting techniques in order to ensure the quality of the search results. Quality = Relevance

New Approach  - Ranking Based on Authority

  • 1
    Google Page Rank which is a measure of a web site trust based on incoming links from establised authorative sites 
  • 2
    HITS - Use Hubs and Authorities

Features of A Web Crawler

Features of a Web Crawlers Politness and Robustness Click Here

Features a Web Crawler Should Have  Scaleable , Distributed, Freshness Click Here

The Web as A Directed Graph Is this Graph Strongly Connected ? Why ?

Further Notes next class

Insert Video

get studets to depict linked lists  3 types on paper

Open office example competing against MS office click here / click here

Hubs and Authorities  and Page Rank

We now develop a scheme in which, given a query, every web page is assigned two scores. One is called its hub score and the other its authority score . For any query, we compute two ranked lists of results rather than one. The ranking of one list is induced by the hub scores and that of the other by the authority scores.

This approach stems from a particular insight into the creation of web pages, that there are two primary kinds of web pages useful as results for broad-topic searches. By a broad topic search we mean an informational query such as "I wish to learn about leukemia". There are authoritative sources of information on the topic; in this case, the National Cancer Institute's page on leukemia would be such a page. We will call such pages authorities; in the computation we are about to describe, they are the pages that will emerge with high authority scores.

On the other hand, there are many pages on the Web that are hand-compiled lists of links to authoritative web pages on a specific topic. These hub pages are not in themselves authoritative sources of topic-specific information, but rather compilations that someone with an interest in the topic has spent time putting together. The approach we will take, then, is to use these hub pages to discover the authority pages. In the computation we now develop, these hub pages are the pages that will emerge with high hub scores.

A good hub page is one that points to many good authorities; a good authority page is one that is pointed to by many good hub pages. 

Page Rank

We now focus on scoring and ranking measures derived from the link structure alone. Our first technique for link analysis assigns to every node in the web graph a numerical score between 0 and 1, known as its PageRank . The PageRank of a node will depend on the link structure of the web graph. Given a query, a web search engine computes a composite score for each web page that combines hundreds of features such as cosine similarity (Section 6.3 ) and term proximity (Section 7.2.2 ), together with the PageRank score. This composite score, developed using the methods of Section 15.4.1 , is used to provide a ranked list of results for the query.

Consider a random surfer who begins at a web page (a node of the web graph) and executes a random walk on the Web as follows. At each time step, the surfer proceeds from his current page A to a randomly chosen web page that A hyperlinks to. Figure 21.1 shows the surfer at a node A, out of which there are three hyperlinks to nodes B, C and D; the surfer proceeds at the next time step to one of these three nodes, with equal probabilities 1/3.

Figure 21.1: The random surfer at node A proceeds with probability 1/3 to each of B, C and D.
\begin{figure}\begin{picture}(700,100)\thicklines \put(125,50){\circle{20}} \... ...50){\vector(1,1){20}} \put(135,50){\vector(1,-1){20}} \end{picture} \end{figure}

As the surfer proceeds in this random walk from node to node, he visits some nodes more often than others; intuitively, these are nodes with many links coming in from other frequently visited nodes. The idea behind PageRank is that pages visited more often in this walk are more important.

The Web as a Graph

We can view the static Web consisting of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directed edge.
Figure 19.2: Two nodes of the web graph joined by a link.
\begin{figure}\begin{picture}(300,100)\thicklines \put(60,40){\circle{40}} \p... ...}\thinlines \put(45,45){\framebox{\small{anchor}}} \end{picture} \end{figure}

Figure 19.2 shows two nodes A and B from the web graph, each corresponding to a web page, with a hyperlink from A to B. We refer to the set of all such nodes and directed edges as the web graph. Figure 19.2 also shows that (as is the case with most links on web pages) there is some text surrounding the origin of the hyperlink on page A. This text is generally encapsulated in the href attribute of the <a> (for anchor) tag that encodes the hyperlink in the HTML code of page A, and is referred to as anchor text . As one might suspect, this directed graph is not strongly connected: there are pairs of pages such that one cannot proceed from one page of the pair to the other by following hyperlinks. We refer to the hyperlinks into a page as in-links and those out of a page as out-links . The number of in-links to a page (also known as its in-degree) has averaged from roughly 8 to 15, in a range of studies. We similarly define the out-degree of a web page to be the number of links out of it. These notions are represented in Figure 19.3 .

\includegraphics[width=9cm]{smallgraph.eps} A sample small web graph.In this example we have six pages labeled A-F. Page B has in-degree 3 and out-degree 1. This example graph is not strongly connected: there is no path from any of pages B-F to page A.

There is ample evidence that these links are not randomly distributed; for one thing, the distribution of the number of links into a web page does not follow the Poisson distribution one would expect if every web page were to pick the destinations of its links uniformly at random. Rather, this distribution is widely reported to be a power law , in which the total number of web pages with in-degree $i$ is proportional to $1/i^\alpha$; the value of $\alpha$ typically reported by studies is 2.1.[*]Furthermore, several studies have suggested that the directed graph connecting web pages has a bowtie shape: there are three major categories of web pages that are sometimes referred to as IN, OUT and SCC. A web surfer can pass from any page in IN to any page in SCC, by following hyperlinks. Likewise, a surfer can pass from page in SCC to any page in OUT. Finally, the surfer can surf from any page in SCC to any other page in SCC. However, it is not possible to pass from a page in SCC to any page in IN, or from a page in OUT to a page in SCC (or, consequently, IN). Notably, in several studies IN and OUT are roughly equal in size, whereas SCC is somewhat larger; most web pages fall into one of these three sets. The remaining pages form into tubes that are small sets of pages outside SCC that lead directly from IN to OUT, and tendrils that either lead nowhere from IN, or from nowhere to OUT. Figure 19.4 illustrates this structure of the Web.

\includegraphics[width=10cm]{bowtie.eps} The bowtie structure of the Web.Here we show one tube and three tendrils.