Category Archives for Web Science

C.1.2 Evolving Web

C.1.2 Describe how the web is constantly evolving


The beginnings of the web (Web 1.0 , Web of content)

The world wide web started around 1990/91 as a system of servers connected over the internet that deliver static documents, which are formatted as hypertext markup language (HTML) files, which support links to other documents, but also multimedia as graphics, video or audio. In the beginnings of the web, these documents consisted mainly of static information and text, where multimedia were added later. Some experts describe this as a “read-only web”, because users mostly searched and read information, while there was little user interaction or content contribution.

Web 2.0 – “Web of the Users”

However, the web started to evolve into the delivery of more dynamic documents, enabling user interaction or even allowing content contribution. The appearance of blogging platforms as Blogger in 1999 gives a time mark for the birth of the Web 2.0. Continuing the model from before, this would be the evolution to a “read-write” web. This opened new possibilities and lead to new concept as blogs, social networks or video-streaming platforms. Web 2.0 might also be looked at from the perspective of the websites themselves evolving in more dynamic and feature-rich. For instance, improved design, JavaScript and dynamic content loading could be considered Web 2.0 features.

Web 3.0 – “Semantic Web”

The internet and thus the world wide web is constantly developing and evolving into new directions and while the changes described for the Web 2.0 are clear to us today, the definition for the Web 3.0 is not definitive yet. Continuing the read to read-write description form earlier, it might be argued that the Web 3.0 would be the “read-write-execute” web. One interpretation of this, is that the web enables software agents to work with documents by using semantic markup. This allows for smarter searches and the presentation of relevant data fitting into context. This is why Web 3.0 is sometimes called the semantic executive web.

But what does this mean?

It’s about user input becoming more meaningful, more semantic, by users giving tags or other kinds of data to their document, that allow software agents to work with the input, e.g. to make it more searchable. The idea is to be able to better connect information that is semantically connected.

Later developments

However, it might also be argued that the Web 3.0 is what some people call the Internet of Things, which is basically connecting every day devices to the internet to make them smarter. In some way, this also fits the read-write-execute model, as it allows the user to control a real life action on a device over the internet. Either way, the web keeps evolving and the following image provides a good overview and an idea where the web is heading to.

he Web Expansion (TrendOne, 2008)

C.1.1 Internet & Web

Evolution of the web, different protocols and web technologies. Difference between static and dynamic web pages. External data sources. Role of the browser.

C.1.1 Distinguish between the Internet and World Wide Web



The Internet

Tech

a network of networks (infrastructure of network) 

Many people label www and the internet as the same thing. However the internet is connecting many different computers together, giving people the ability to exchange data within one another, such as the news, pictures, or even videos.

hardware or operator 2. WWW(world wide web) - operating system. The difference between Internet and WWW is that without internet their won't be a WWW. The WWW needs the internet to operate.


World Wide Web (www):

Tech: The World Wide Web, also known as “www” is a part of the internet, using web browsers to share information across the globe via hyperlinks.

Non-Tech: “WWW” is short for World Wide Web, and it uses browsers such as google chrome and firefox to access information online. It’s basically a software that allows us to connect to other people around the world.

Internet is the global network of networks of computers. Internet is networks of computers, cables and wireless connections, which governed by Internet Protocol (IP), which deals with data and packets.World Wide Web, also known as the Web, is one set of software running on the Internet. Web is a collection of webpages, files and folders connected through hyperlinks and URLs.Internet is the hardware part, and Web is the software part. Therefore, Web relies the Internet to run, but not vice-versa. In addition to WWW other examples would include VoIP and Mail which have their own protocols and run on the internet.




Q) Distinguish between the internet and World Wide Web (web)

(Distinguish: Make clear the differences between two or more concepts or items)

C.2.5 C.2.12

C.2.5 Discuss the relationship between data in a meta tag and how it is accessed by a web-crawler


    Answer depends on different crawlers, but generally speaking:

    • The title tag, not strictly a meta-tag, is what is shown in the results, through the indexer
    • The description meta-tag provides the indexer with a short description of the page
    • The keywords meta-tag provides…well keywords about your page

    While meta-tags used to play a role in ranking, this has been overused by many pages and therefore meta-tags are not considered by most search engines anymore.

    Crawlers now mostly use meta-tags to compare keywords and description to the content of the page to give it a certain weight. For this reason while meta-tags do not play the big role it used to, it’s still important to include them.

    Resources

    ​http://www.wordstream.com/meta-tags

    C.2.7 Outline the purpose of web-indexing in search engines


    Search engines index websites in order to respond to search queries with relevant information as quick as possible. For this reason, it stores information about indexed web pages, e.g. keyword, title or descriptions, in its database. This way search engines can quickly identify pages relevant to a search query.

    Indexing has the additional purpose of giving a page a certain weight, as described in the search algorithms. This way search results can be ranked, after being indexed.

    C.2.8-9 Suggest how developers can create pages that appear more prominently in search engine results. Describe the different metrics used by search engines.


    The process of making pages appear more prominently in search engine results is called SEO. There are many different techniques, considered in section C.2.11. This field is a big aspect of web marketing, as search engines do not disclose how exactly they work, making it hard for developers to perfectly optimise pages.

    In order to check the web presence of a website, there are different metrics to be used.

    Metrics

    • Search Engine Share of Referring visits: how the web page has been accessed: through direct access, referral pages or search engine results. Can indicate how meaningful traffic is.
    • Search Engine Referral: different search engines have different market shares; knowing which search engine traffic comes from helps to find potential improvements for certain search engines
    • Search terms and phrases: identify the most common search keywords and optimize
    • Conversion rate by search phrase/term: percentage of users that sign up coming from a search term
    • Number of sites receiving traffic from search engines: As large websites have many pages, it is important to see if individual sites are being accessed through search engines
    • Time taken: time spent by a user on a page after access through the search engine. Indicator for how relevant the page is and what resources were accessed
    • Number of hits: a page hit is when a page is downloaded. This is a counter of the visitors of the page and gives a rough idea of the traffic to the page
    • Quality of returns: quality of how a site gets placed in a return. Say how high it is ranked by search engines.
    • Quantity of returns: how many pages are indexed by a search engine

    Parameters Search Engines use to compare

    • Relevance:
      • Is determined by different programs like PageRank etc. which evaluate and determine the quality of web sites and put them high on the Index
      • The bigger the index the more pages the search engine can return that have relevance to each query
    • User experience:
      • Search engines look to find the “best” results for the searcher and part of this is the user experience a site provides. This includes ease of use, navigation; direct and relevant information; professional, modern and compatible design; high-quality, legitimate and credible content

    C.2.6 Discuss the use of parallel web-crawling


      • Size of the web grows, increasing the time it would take to download pages
      • To make this reasonable “it becomes imperative to parallelize the crawling process (Stanford)

      Advantages

      • Scalability: as the web grows a single process can not handle everything Multithreaded processing can solve the problem
      • Network load dispersion: as the web is geographically dispersed, dispersing crawlers disperses the network load
      • Network load reduction

      Issues of parallel web crawling

      • Overlapping: parallel web crawlers might index the same page multiple times
      • Quality: If a crawler wants to download ‘important’ pages first, this might not work in a parallel process
      • Communication bandwidth: parallel crawlers need to communicate for the former reasons, which for many processes might take significant communication bandwidth
      • If parallel crawlers request the same page frequently over a short time it will overload servers

      Resources

      C.2.1 C.2.5

      C.2.2 Distinguish between the surface web and the deep web


      Surface web

      The surface web is the part of the web that can be reached by a search engine. For this, pages need to be static and fixed, so that they can be reached through links from other sites on the surface web. They also need to be accessible without special configuration. Examples include Google, Facebook, Youtube, etc.

      Summary:

      • Pages that are reachable (and indexed) by a search engine
      • Pages that can be reached through links from other sites in the surface web
      • Pages that do not require special access configurations

      Deep web

      The deep web is the part of the web that is not searchable by normal search engines. Reasons for this include proprietary content that requires authentication or VPN access, e.g. private social media, emails; commercial content that is protected by paywalls, e.g. online news papers, academic research databases; personal information that is protected, e.g. bank information, health records; dynamic content. Dynamic content is usually a result of some query, where data are fetched from a database.

      Summary:

      • Pages not reachable by search engines
      • Substantially larger than the surface web
      • Common characteristics:
        • Dynamically generated pages, e.g. through queries, JavaScript, AJAX, Flash
        • Password protected pages, e.g. emails, private social media
          • Paywalls, e.g. online news papers, academic research databases
          • personal information, e.g. health records
        • Pages without any incoming links

      C.2.3 Outline the principles of searching algorithms used by search engines


      The most known search algorithms are PageRank and the HITS algorithm, but it is important to know that most search engines include various other factors as well, e.g.

      • the time that a page has existed
      • the frequency of the search keywords on the page
      • other unknown factors (undisclosed)

      For the following description the terms “inlinks” and “outlinks” are used. Inlinks are links that point to the page in question, i.e. if page W has an inlink, there is a page Z containing the URL of page W. Outlinks are links that point to a different page than the one in question, i.e. if page W has an outlink, it is an URL of another page, e.g. page Z.

      PageRank algorithm

      PageRank works by counting the number and quality of inlinks of a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

      The value of an in-link from a page A is proportional to

      P(A) / C(A)
      

      Where P(A) is the PageRank score of page A, and C(A) is the number of out-links page A has.

      As mentioned it is important to note that there are many other factors considered. For instance, the anchor text of a link is often far more important than its PageRank score.

      Summary:

      • Pages are given a score (rank)
      • Rank determines the order in which pages appear
      • Inlinks add value to a page
      • The importance of an inlink depends on the PageRank (score) of the linking page
      • PageRank counts links per page and determines which page are most important

      HITS algorithm

      Based on the idea that keywords are not everything that matters; there are sites that might be more relevant even if they don’t contain the most keywords. It introduces the idea of different types of pages, authorities and hubs.

      Authorities: A page is called an authority, if it contains valuable information and if it is truly relevant for the search query. It is assumed that such a page has a high number of in-links.

      Hubs: These are pages that are relevant for finding authorities. They contain useful links towards them. It is therefore assumed that these pages have a high number of out-links.

      The algorithm is based on mathematical graph theory, where a page is represented by a vertex and links between pages are represented by edges (in form of vectors).

        A simple graph

        Figure 1: A simple graph Link

        The algorithm starts by creating a graph:

        • It first finds the top 200 pages based on the occurrence of keywords from the query. Let’s call the set of these pages RQ
        • It then finds all pages that link to the set of pages RQ and all pages which these link to (basically all pages linked in or out). Together with RQ this makes up the set SQ
        • The algorithm gives each page in the set SQ a hub weight and an authority weight, based on how many pages link towards it (authority) and how many pages it links to (hub)
        • The algorithm then lists the pages based on their weight

        C.2.4 Describe how a web-crawler functions.


        A web crawler, also known as a web spider, web robot or simply bot, is a program that browses the web in a methodical and automated manner. For each page it finds, a copy is downloaded and indexed. In this process it extracts all links from the given page and then repeats the same process for all found links. This way, it tries to find as many pages as possible.

        Limitations:

        • They might look at meta data contained in the head of web pages, but this depends on the crawler
        • A crawler might not be able to read pages with dynamic content as they are very simple programs

        Robots.txt

        Issue: A crawler consumes resources and a page might not wish to be “crawled”. For this reason “robots.txt” files were created, where a page states what should be indexed and what shouldn’t.

        • A file that contains components to specify pages on a website that must not be crawled by search engine bots
        • File is placed in root directory of the site
        • The standard for robots.txt is called “Robots Exclusion Protocol”
        • Can be specific to a special web crawler, or apply to all crawlers
        • Not all bots follow this standard (malicious bots, malware) -> “illegal” bots can ignore robots.txt
        • Still considered to be better to include a robots.txt instead of leaving it out
        • It keeps the bots from less “noteworthy” content of a website
          more time spend on indexing important/relevant content of the website

        C.1

        Evolution of the web, different protocols and web technologies. Difference between static and dynamic web pages. External data sources. Role of the browser.

        C.1.1 Distinguish between the Internet and World Wide Web


        Internet

        • interconnected set of networks and computers
        • Permits transfer of data(e.g.: WWW, email, P2P, VOIP, FTP)
        • Permits delivery of services
        • Data transfer governed by protocols(TCP/IP)

        World Wide Web

        • A set of hypertext-linked resources
        • Resources are identified by URIs(Unified Resource Identifier)
        • Transferred between client and server via the Internet
        • Resources can be read by use of a browser

        C.1.1 Distinguish between the Internet and World Wide Web


        Internet

        • interconnected set of networks and computers
        • Permits transfer of data(e.g.: WWW, email, P2P, VOIP, FTP)
        • Permits delivery of services
        • Data transfer governed by protocols(TCP/IP)

        World Wide Web

        • A set of hypertext-linked resources
        • Resources are identified by URIs(Unified Resource Identifier)
        • Transferred between client and server via the Internet
        • Resources can be read by use of a browser

        C.1.2 Describe how the web is constantly evolving


        he Web Expansion (TrendOne, 2008)

        The beginnings of the web (Web 1.0 , Web of content)

        The world wide web started around 1990/91 as a system of servers connected over the internet that deliver static documents, which are formatted as hypertext markup language (HTML) files, which support links to other documents, but also multimedia as graphics, video or audio. In the beginnings of the web, these documents consisted mainly of static information and text, where multimedia were added later. Some experts describe this as a “read-only web”, because users mostly searched and read information, while there was little user interaction or content contribution.

        Web 2.0 – “Web of the Users”

        However, the web started to evolve into the delivery of more dynamic documents, enabling user interaction or even allowing content contribution. The appearance of blogging platforms as Blogger in 1999 gives a time mark for the birth of the Web 2.0. Continuing the model from before, this would be the evolution to a “read-write” web. This opened new possibilities and lead to new concept as blogs, social networks or video-streaming platforms. Web 2.0 might also be looked at from the perspective of the websites themselves evolving in more dynamic and feature-rich. For instance, improved design, JavaScript and dynamic content loading could be considered Web 2.0 features.

        Web 3.0 – “Semantic Web”

        The internet and thus the world wide web is constantly developing and evolving into new directions and while the changes described for the Web 2.0 are clear to us today, the definition for the Web 3.0 is not definitive yet. Continuing the read to read-write description form earlier, it might be argued that the Web 3.0 would be the “read-write-execute” web. One interpretation of this, is that the web enables software agents to work with documents by using semantic markup. This allows for smarter searches and the presentation of relevant data fitting into context. This is why Web 3.0 is sometimes called the semantic executive web.

        But what does this mean?

        It’s about user input becoming more meaningful, more semantic, by users giving tags or other kinds of data to their document, that allow software agents to work with the input, e.g. to make it more searchable. The idea is to be able to better connect information that is semantically connected.

        Later developments

        However, it might also be argued that the Web 3.0 is what some people call the Internet of Things, which is basically connecting every day devices to the internet to make them smarter. In some way, this also fits the read-write-execute model, as it allows the user to control a real life action on a device over the internet. Either way, the web keeps evolving and the following image provides a good overview and an idea where the web is heading to.

        C.1.2 Describe how the web is constantly evolving


        he Web Expansion (TrendOne, 2008)

        Web 3.0 – “Semantic Web”

        The internet and thus the world wide web is constantly developing and evolving into new directions and while the changes described for the Web 2.0 are clear to us today, the definition for the Web 3.0 is not definitive yet. Continuing the read to read-write description form earlier, it might be argued that the Web 3.0 would be the “read-write-execute” web. One interpretation of this, is that the web enables software agents to work with documents by using semantic markup. This allows for smarter searches and the presentation of relevant data fitting into context. This is why Web 3.0 is sometimes called the semantic executive web.

        But what does this mean?

        It’s about user input becoming more meaningful, more semantic, by users giving tags or other kinds of data to their document, that allow software agents to work with the input, e.g. to make it more searchable. The idea is to be able to better connect information that is semantically connected.

        Later developments

        Enter your text here...

        Enter your text here...