Category Archives for Web Science

C.2.10 Explain why the effectiveness of a search engine is determined by …….

C.2.10 Explain why the effectiveness of a search engine is determined by the assumptions made when developing it.


The search engine must serve up results that are relevant to what the users search for, Google used page rank , prior to that search engines just used title tags  key word tags.  These could be easily manipulated  ( stuffed with keywords that you wish to rank for ) to get your site on page 1. Google devised a  Page Rank checking algorithm which played a big part of their search algorithm. 

  • Avoid Indexing Spam sites ( duplicate content copied ) . Detect sites that use Black Hat and remove from index
  • Don't process static sites ( that do not change ) / crawl  more frequently authoritative/changing ( fresh content ) news  sites 
  • Respect Robots text files 
  • Determine sites that change on a regular basis and cache these. 
  • The spider should not overload servers by continually hitting the same site. 
  • The algorithm must be able to avoid spider traps
  • Ignore paid for links ( can be difficult )
  • Ignore exact match anchor text if  its being used to rank keywords /  search terms ( backlink profile should look natural ) o the search engine
  • Use comments box to add more or question why these.

    What are some of the major metrics used?

      C.2.8-9 Suggest how developers can create ………

      C.2.8-9 Suggest how developers can create pages that appear more prominently in search engine results. Describe the different metrics used by search engines


      The process of making pages appear more prominently in search engine results is called SEO. There are many different techniques, considered in section C.2.11. This field is a big aspect of web marketing, as search engines do not disclose how exactly they work, making it hard for developers to perfectly optimise pages.

      In order to improve the ranking of a web site Google uses many many metrics below is a few of the important ones.

      Top Metrics

      On Page

      • Make sure your site can be crawled and thus indexed avoid flash and provide a sitemap and good web site architecture
      • The Title Create a title tag with your key phrase near or at the beginning. The title should be crafted to get the user to click on your web site when displayed in the search results. The title must reflect the content of yourr site
      • Content will always be important it must be high quality and any information must be factual at least  1000 words for home  page
      • Freshness of content 
      • Mobile Friendly
      • Page load speed under 3 seconds
      • If link broker browser will return a HTTP response code 404. This should be detected by web designer and  provide a help page with user navigation.
      • Text Formatting (use of h1,h2,bold etc)
      • HTTPS 
      • Do Keyword Research to find what users actually search for and build pages for these terms

      These are only a fraction that google will use, more recently they have given a very slight increase for sites that are HTTPS 

      Describe the different metrics used by search engines

      Naturally an overlap exists with what the web site developer should do to get the site high in the serps ( search engine results page)

      On Page

      Relevancy does your site provide the information the user is searching for. The user experience (UX) is becoming a big part as this can not be manipulated and in the future will play a much bigger role.  User Experience ( time user stays on site / bounce rate ). Many factors play a role in the user experience.  Load Speed, Easy Navigation ( no broken links ),  Spelling, quality and factually correct content, Structured layout 5 Use of images/video 5 page design colors, images video infographics and  formatting so it is easy to scan the page for relevant information. The idea is to get the user to stay on your site ( sticky ) . If a use lands on your site after do a search and leaves after a few seconds sends or even before the page loads ( slow loding ) the is a very BIG  signal to google that they should not have served up that result

      Off Page

      Back Links from other web site the more authoritative the site the better ( example huffington post ), The site that links to your site should also be relevant. Example if your selling dog insurance a site from a respected charitable dog web site would be a very big boost. A link from a site that provides car rental would have little impact as totally irrelevant.

      Social media marketing FACE Book etc. Be a leader in your field and comment on relevant authoritative forums or blogs. Others users sharing via social bookmarking sites.

      This an area in which you can manipulate the search results. If Google discovers this your web site will be dropped from index. So need to ensure any links are natural linking to an authoritative article  info graphic on your web site.

        What are some of the major metrics used?

          C.2.7 Outline the purpose of web-indexing in search engines

          C.2.7 Outline the purpose of web-indexing in search engines


          Search engines index websites in order to respond to search queries with relevant information as quick as possible. For this reason, it stores information about indexed web pages, e.g. keyword, title or descriptions, in its database. This way search engines can quickly identify pages relevant to a search query.

          Indexing has the additional purpose of giving a page a certain weight, as described in the search algorithms. This way search results can be ranked, after being indexed.

            C.2.6 Discuss the use of parallel web-crawling

            C.2.6 Discuss the use of parallel web-crawling


            • Size of the web grows, increasing the time it would take to download pages
            • To make this reasonable “it becomes imperative to parallelize the crawling process (Stanford)

            Advantages

            • Scalability: as the web grows a single process can not handle everything Multithreaded processing can solve the problem
            • Network load dispersion: as the web is geographically dispersed, dispersing crawlers disperses the network load
            • Network load reduction ( scalability, efficiency and throughput )

            Issues of parallel web crawling

            • Overlapping: parallel web crawlers might index the same page multiple times
            • Quality: If a crawler wants to download ‘important’ pages first, this might not work in a parallel process
            • Communication bandwidth: parallel crawlers need to communicate for the former reasons, which for many processes might take significant communication bandwidth . Why search engines take the quality approach click here
            • If parallel crawlers request the same page frequently over a short time it will overload servers

            Discuss the use of parallel web crawling

            A crawler is a program that downloads and stores Web pages, often for a Web search engine. Roughly, a crawler starts off by placing an initial set of URLs, S0, in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. This process is repeated until the crawler decides to stop. Collected pages are later used for other applications, such as a Web search engine or a Web cache. As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single process. Therefore, many search engines often run multiple processes in parallel to perform the above task, so that download rate is maximized (reference http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.8408&rep=rep1&type=pdf )

            Why search engines take the quality approach ( dated )

            According to a study released in October 2000, the directly accessible "surface web" consists of about 2.5 billion pages, while the "deep web" (dynamically generated web pages) consists of about 550 billion pages, 95% of which are publicly accessible [LVDSS00].

            By comparison, the Google index released in June 2000 contained 560 million full-text-indexed pages [Goo00]. In other words, Google — which, according to a recent measurement [HHMN00], has the greatest coverage of all search engines — covers only about 0.1% of the publicly accessible web, and the other major search engines do even worse.

            Increasing the coverage of existing search engines by three orders of magnitude would pose a number of technical challenges, both with respect to their ability to discover, download, and index web pages, as well as their ability to serve queries against an index of that size. (For query engines based on inverted lists, the cost of serving a query is linear to the size of the index.) Therefore, search engines should attempt to download the best pages and include (only) them in their index.

            Search Methods

            Breadth-first, Depth First, Backlink count, PageRank and Random

            Mercator is an extensible, multithreaded, high-performance web crawler [HN99, Mer00]. It is written in Java and is highly configurable. Its default download strategy is to perform a breadth-first search of the web, with the following three modifications:

            1. It downloads multiple pages (typically 500) in parallel. This modification allows us to download about 10 million pages a day; without it, we would download well under 100,000 pages per day.
            2. Only a single HTTP connection is opened to any given web server at any given time. This modification is necessary due to the prevalence of relative URLs on the web (about 80% of the links on an average web page refer to the same host), which leads to a high degree of host locality in the crawler's download queue. If we were to download many pages from the same host in parallel, we would overload or even crash that web server.
            3. If it took t seconds to download a document from a given web server, then Mercator will wait for 10t seconds before contacting that web server again. This modification is not strictly necessary, but it further eases the load our crawler places on individual servers on the web. We found that this policy reduces the rate of complaints we receive while crawling.

              Further Reading Click Here

              C.2.5 Discuss the relationship between data in a meta tag and how it is accessed by a web-crawler

              C.2.5 Discuss the relationship between data in a meta tag and how it is accessed by a web-crawler


              Answer depends on different crawlers, but generally speaking:

              • The title tag, not strictly a meta-tag, is what is shown in the results, through the indexer
              • The description meta-tag provides the indexer with a short description of the page
              • The keywords meta-tag provides…well keywords about your page

              While meta-tags used to play a role in ranking, this has been overused by many pages and therefore meta-tags are not considered by most search engines anymore.

              Crawlers now mostly use meta-tags to compare keywords and description to the content of the page to give it a certain weight. For this reason while meta-tags do not play the big role it used to, it’s still important to include them.

                C.2.4 Describe how a web-crawler functions

                C.2.4 Describe how a web-crawler functions


                A web crawler, also known as a web spider, web robot or simply bot, is a program that browses the web in a methodical and automated manner. For each page it finds, a copy is downloaded and indexed. In this process it extracts all links from the given page and then repeats the same process for all found links. This way, it tries to find as many pages as possible.

                Limitations:

                • They might look at meta data contained in the head of web pages, but this depends on the crawler
                • A crawler might not be able to read pages with dynamic content as they are very simple programs

                Robots.txt

                Stop Bots using Band With

                Save Band width less time on site crawling

                Issue: A crawler consumes resources and a page might not wish to be “crawled”. For this reason “robots.txt” files were created, where a page states what should be indexed and what shouldn’t.

                • A file that contains components to specify pages on a website that must not be crawled by search engine bots
                • File is placed in root directory of the site
                • The standard for robots.txt is called “Robots Exclusion Protocol”
                • Can be specific to a special web crawler, or apply to all crawlers
                • Not all bots follow this standard (malicious bots, malware) -> “illegal” bots can ignore robots.txt
                • Still considered to be better to include a robots.txt instead of leaving it out
                • It keeps the bots from less “noteworthy” content of a website
                  more time spend on indexing important/relevant content of the website

                  C.2.3 Outline the principles of searching algorithms used by search engines

                  C.2.3 Outline the principles of searching algorithms used by search engines


                  The most known search algorithms are PageRank and the HITS algorithm, but it is important to know that most search engines include various other factors as well, e.g.

                  • the time that a page has existed
                  • the frequency of the search keywords on the page
                  • other unknown factors (undisclosed)

                  For the following description the terms “inlinks” and “outlinks” are used. Inlinks are links that point to the page in question, i.e. if page W has an inlink, there is a page Z containing the URL of page W. Outlinks are links that point to a different page than the one in question, i.e. if page W has an outlink, it is an URL of another page, e.g. page Z.

                  PageRank algorithm

                  PageRank works by counting the number and quality of inlinks of a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

                  The value of an in-link from a page A is proportional to

                  P(A) / C(A)
                  

                  Where P(A) is the PageRank score of page A, and C(A) is the number of out-links page A has.

                  As mentioned it is important to note that there are many other factors considered. For instance, the anchor text of a link is often far more important than its PageRank score.

                  Summary:

                  • Pages are given a score (rank)
                  • Rank determines the order in which pages appear
                  • Incoming links add value to a page
                  • The importance of an inlink depends on the PageRank (score) of the linking page
                  • PageRank counts links per page and determines which page are most important

                  HITS algorithm

                  Based on the idea that keywords are not everything that matters; there are sites that might be more relevant even if they don’t contain the most keywords. It introduces the idea of different types of pages, authorities and hubs.

                  Authorities: A page is called an authority, if it contains valuable information and if it is truly relevant for the search query. It is assumed that such a page has a high number of in-links.

                  Hubs: These are pages that are relevant for finding authorities. They contain useful links towards them. It is therefore assumed that these pages have a high number of out-links.

                  The algorithm is based on mathematical graph theory, where a page is represented by a vertex and links between pages are represented by edges (in form of vectors).

                    C.2.2 Distinguish between the surface web and the deep web

                    C.2.2 Distinguish between the surface web and the deep web


                    Surface web

                    The surface web is the part of the web that can be reached by a search engine. For this, pages need to be static and fixed, so that they can be reached through links from other sites on the surface web. They also need to be accessible without special configuration. Examples include Google, Facebook, Youtube, etc.

                    Summary:

                    • Pages that are reachable (and indexed) by a search engine
                    • Pages that can be reached through links from other sites in the surface web
                    • Pages that do not require special access configurations

                    Deep web

                    The deep web is the part of the web that is not searchable by normal search engines. Reasons for this include proprietary content that requires authentication or VPN access, e.g. private social media, emails; commercial content that is protected by paywalls, e.g. online news papers, academic research databases; personal information that is protected, e.g. bank information, health records; dynamic content. Dynamic content is usually a result of some query, where data are fetched from a database.

                    Summary:

                    • Pages not reachable by search engines
                    • Substantially larger than the surface web
                    • Common characteristics:
                      • Dynamically generated pages, e.g. through queries, JavaScript, AJAX, Flash
                      • Password protected pages, e.g. emails, private social media
                        • Paywalls, e.g. online news papers, academic research databases
                        • personal information, e.g. health records
                      • Pages without any incoming links

                      Distinguish between the surface web and the deep web 2

                      [ TOK: Data is always accessible? ]

                      Q What's an example of a site on the surface web, deep web and dark web?

                      Q How can you access the dark web?(We can't install/configure/access it from school)

                      Q What are some ethical/moral uses of the dark web? Are they justified?

                    C.2.1 Define the term search engine

                    C.2.1 Define the term search engine


                    "A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler."   Source Wiki

                    C.1.3 HTTP(S), HTML, URL, XML, XSLT, JS & CSS

                    C.1.3 HTTP(S), HTML, URL, XML, XSLT, JS & CSS


                    HTTP – Hypertext Transfer Protocol

                    • Application layer protocol from the Internet Protocol suite to transfer and exchange hypermedia
                    • request-response protocol based on client-server model
                    • user agent (e.g. web browser) requests some resource from a server through an URL, and the web server gives and response
                    • different HTTP request methods, e.g. for retrieving or submitting data (GET and POST)

                    HTTPS – Hypertext Transfer Protocol Secure

                    • Based on HTTP
                    • Adds an additional security layer of SSL or TLS
                    • ensures authentication of website by using digital certificates
                    • ensures integrity and confidentiality through encryption of communication
                    • still possible to track IP address and port number of web server (which is why HTTPS websites are also blocked in China)

                    Think why China would block sites using SSL

                    HTML – Hypertext Mark-up Language

                    • semantic markup language
                    • standard language for web documents
                    • uses elements enclosed by tags to markup a document

                    URL – Uniform Resource Locator

                    • unique string that identifies a web resource
                    • reference to a web resource
                    • primarily used for HTTP, but also for other protocols like FTP or email (mailto)
                    • includes name AND access method (e.g. ‘http://’)
                    • serves as a mechanism to retrieve a resource
                    • follows a specific syntax
                      schema:[//[user:password@]host[:port]][/]path[?query][#fragment]

                    XML – Extensible Mark-up Language

                    • markup language with a set of rules defining how to encode a document
                    • human-readable
                    • similar to HTML in using tags
                    • used for representation of arbitrary data structures

                    XLST – Extensible stylesheet language

                    • styling language for XML used for data presentation and transformation
                    • data presentation means displaying data in some format/medium, about style
                    • data transformation is about parsing a source tree of nodes out and transform it into something different
                    • XLST can be used to transform XML files into other XML files, HTML, PDF, PNG and others

                    JavaScript

                    • interpreted programming language
                    • core technology of most websites with HTML and CSS
                    • high-level, dynamic and untyped; therefore relatively easy for beginners
                    • allows to dynamically manipulate the content of HTML documents
                    • makes websites dynamic

                    CSS – Cascading style sheet

                    • style sheet language to describe the presentation of a mark-up document, usually HTML
                    • used to create better designed websites
                    • intended to separate content in presentation in HTML and CSS
                    • it uses selectors to describe particular elements of a document, and gives these properties that define things ranging from font color to page position

                    URI – Uniform Resource Identifier

                    • more general definition than URL
                    • a string serving as an identifier for some resource(document, image, mailbox, video, files, etc.)

                    TrendOne - The Web ExpansionFigure 2: The Difference Between URLs and URIs (Daniel Miessler)