Quick Navigation

C.1.1 Distinguish between the Internet and World Wide Web

World Wide Web (www):

Tech: The World Wide Web, also known as “www” is a part of the internet, using web browsers to share information across the globe via hyperlinks.

Non-Tech: “WWW” is short for World Wide Web, and it uses browsers such as google chrome and firefox to access information online. It’s basically a software that allows us to connect to other people around the world.

Internet is the global network of networks of computers. Internet is networks of computers, cables and wireless connections, which governed by Internet Protocol (IP), which deals with data and packets.World Wide Web, also known as the Web, is one set of software running on the Internet. Web is a collection of webpages, files and folders connected through hyperlinks and URLs.Internet is the hardware part, and Web is the software part. Therefore, Web relies the Internet to run, but not vice-versa. In addition to WWW other examples would include VoIP and Mail which have their own protocols and run on the internet.

The Internet

Tech

a network of networks (infrastructure of network) 

Many people label www and the internet as the same thing. However the internet is connecting many different computers together, giving people the ability to exchange data within one another, such as the news, pictures, or even videos.

hardware or operator 2. WWW(world wide web) - operating system. The difference between Internet and WWW is that without internet their won't be a WWW. The WWW needs the internet to operate.

C.1.1 Question Section

Q) Distinguish between the internet and World Wide Web (web) / make clear the difference between 2 or more items / concepts

Past Questions

C.1.2 Describe how the web is constantly evolving

The beginnings of the web (Web 1.0 , Web of content)

The world wide web started around 1990/91 as a system of servers connected over the internet that deliver static documents, which are formatted as hypertext markup language (HTML) files, which support links to other documents, but also multimedia as graphics, video or audio. In the beginnings of the web, these documents consisted mainly of static information and text, where multimedia were added later. Some experts describe this as a “read-only web”, because users mostly searched and read information, while there was little user interaction or content contribution.

Web 2.0 – “Web of the Users”

However, the web started to evolve into the delivery of more dynamic documents, enabling user interaction or even allowing content contribution. The appearance of blogging platforms as Blogger in 1999 gives a time mark for the birth of the Web 2.0. Continuing the model from before, this would be the evolution to a “read-write” web. This opened new possibilities and lead to new concept as blogs, social networks or video-streaming platforms. Web 2.0 might also be looked at from the perspective of the websites themselves evolving in more dynamic and feature-rich. For instance, improved design, JavaScript and dynamic content loading could be considered Web 2.0 features.

Web 3.0 – “Semantic Web”

The internet and thus the world wide web is constantly developing and evolving into new directions and while the changes described for the Web 2.0 are clear to us today, the definition for the Web 3.0 is not definitive yet. Continuing the read to read-write description form earlier, it might be argued that the Web 3.0 would be the “read-write-execute” web. One interpretation of this, is that the web enables software agents to work with documents by using semantic markup. This allows for smarter searches and the presentation of relevant data fitting into context. This is why Web 3.0 is sometimes called the semantic executive web.

But what does this mean?

It’s about user input becoming more meaningful, more semantic, by users giving tags or other kinds of data to their document, that allow software agents to work with the input, e.g. to make it more searchable. The idea is to be able to better connect information that is semantically connected.

Later developments

However, it might also be argued that the Web 3.0 is what some people call the Internet of Things, which is basically connecting every day devices to the internet to make them smarter. In some way, this also fits the read-write-execute model, as it allows the user to control a real life action on a device over the internet. Either way, the web keeps evolving and the following image provides a good overview and an idea where the web is heading to.

Video Section

he Web Expansion (TrendOne, 2008)

C.1.4 Identify the characteristics of the following: • uniform resource identifier (URI) • URL

URIs are a standard for identifying documents using a short string of numbers, letters, and symbols. They are defined by RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. URLs, URNs, and URCs are all types of URI.

  • Contains information about how to fetch a resource from its location. For example:
  • http://example.com/mypage.html
  • ftp://example.com/download.zip
  • mailto:user@example.com
  • file:///home/user/file.txt
  • tel:1-888-555-5555
  • http://example.com/resource?foo=bar#fragment
  • /other/link.html (A relative URL, only useful in the context of another URL)
  • URLs always start with a protocol (http) and usually contain information such as the network host name (example.com) and often a document path (/foo/mypage.html). URLs may have query parameters and fragment identifiers.

    Identifies a resource by a unique and persistent name, but doesn't necessarily tell you how to locate it on the internet. It usually starts with the prefix urn: For example:

  • urn:isbn:0451450523 to identify a book by its ISBN number.
  • urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66 a globally unique identifier
  • urn:publishing:book - An XML namespace that identifies the document as a type of book.
  • URNs can identify ideas and concepts. They are not restricted to identifying documents. When a URN does represent a document, it can be translated into a URL by a "resolver". The document can then be downloaded from the URL.

    C.1.4 Question Section

    Question Section Past papers

    C.1.3 HTTP(S), HTML, URL, XML, XSLT, JS & CSS

    HTTPS – Hypertext Transfer Protocol Secure

  • Based on HTTP
  • Adds an additional security layer of SSL or TLS
  • ensures authentication of website by using digital certificates
  • ensures integrity and confidentiality through encryption of communication
  • still possible to track IP address and port number of web server (which is why HTTPS websites are also blocked in China)
  • HTML – Hypertext Mark-up Language

  • semantic markup language
  • standard language for web documents
  • uses elements enclosed by tags to markup a document
  • XML – Extensible Mark-up Language

  • markup language with a set of rules defining how to encode a document
  • human-readable
  • similar to HTML in using tags
  • used for representation of arbitrary data structures
  • XLST – Extensible stylesheet language

  • markup language with a set of rules defining how to encode a document
  • human-readable
  • similar to HTML in using tags
  • used for representation of arbitrary data structures
  • JavaScript

  • interpreted programming language
  • core technology of most websites with HTML and CSS
  • high-level, dynamic and untyped; therefore relatively easy for beginners
  • allows to dynamically manipulate the content of HTML documents
  • CSS – Cascading style sheet

  • style sheet language to describe the presentation of a mark-up document, usually HTML
  • used to create better designed websites
  • intended to separate content in presentation in HTML and CSS
  • it uses selectors to describe particular elements of a document, and gives these properties that define things ranging from font color to page position
  • C.1.8 Outline the different components of a web page.

    A web page can contain a variety of components. The basics structure of a HTML document is:

    head

    This is not visible on the page itself, but contains important information about it in form of metadata.

    title

    The title goes inside the head and is usually displayed in the window top of the web browser.

    meta tags

    There are various types of meta tags, which can give search engines information about the page, but are also used for other purposes, such as to specify the charset used.

    body

    The main part of the page document. This is where all the (visible) content goes in.

    Some other typical components:

    Navigation bar

    Usually a collection of links that helps to navigate the website top of page or as hamburger on mobile.

    Hyperlinks

    A hyperlink is a reference to another web page.

    Table Of Contents

    Might be contained in a sidebar and is used for navigation and orientation within the website.

    Banner

    Area at the top of a web page linking to other big topic areas.

    Sidebar

    Usually used for a table of contents or navigation bar.

    C.2.1 Define the term search engine

    A search engine is a  program that allows a user to search for information normally on the web.

    C.2.2 Distinguish between the surface web and the deep web

    Surface Web

    The surface web is the part of the web that can be reached by a search engine. For this, pages need to be static and fixed, so that they can be reached through links from other sites on the surface web. They also need to be accessible without special configuration. Examples include Google, Facebook, Youtube, etc.

  • Pages that are reachable (and indexed) by a search engine
  • Pages that can be reached through links from other sites in the surface web
  • Pages that do not require special access configurations
  • Deep web

    The deep web is the part of the web that is not searchable by normal search engines. Reasons for this include proprietary content that requires authentication or VPN access, e.g. private social media, emails; commercial content that is protected by paywalls, e.g. online news papers, academic research databases; personal information that is protected, e.g. bank information, health records; dynamic content. Dynamic content is usually a result of some query, where data are fetched from a database.

  • Pages not reachable by search engines
  • Substantially larger than the surface web
  • Common characteristics:
    • Dynamically generated pages, e.g. through queries, JavaScript, AJAX, Flash
    • Password protected pages, e.g. emails, private social media
      • Paywalls, e.g. online news papers, academic research databases
      • personal information, e.g. health records
    • Pages without any incoming links
  • C.2.3 Outline the principles of searching algorithms used by search engines

    The most known search algorithms are PageRank and the HITS algorithm, but it is important to know that most search engines include various other factors as well, e.g.

  • the time that a page has existed
  • the frequency of the search keywords on the page
  • other unknown factors (undisclosed)
  • For the following description the terms “inlinks” and “outlinks” are used. Inlinks are links that point to the page in question, i.e. if page W has an inlink, there is a page Z containing the URL of page W. Outlinks are links that point to a different page than the one in question, i.e. if page W has an outlink, it is an URL of another page, e.g. page Z.

    PageRank algorithm

    PageRank works by counting the number and quality of inlinks of a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

    As mentioned it is important to note that there are many other factors considered. For instance, the anchor text of a link is often far more important than its PageRank score.

    • Pages are given a score (rank)
    • Rank determines the order in which pages appear
    • Incoming links add value to a page
    • The importance of an inlink depends on the PageRank (score) of the linking page/Page Authotrity
    • PageRank counts links per page and determines which page are most important
    • Links from site that are relevant carry more weight than links from non related sites.  

    HITS algorithm

    Based on the idea that keywords are not everything that matters; there are sites that might be more relevant even if they don’t contain the most keywords. It introduces the idea of different types of pages, authorities and hubs.

    Authorities: A page is called an authority, if it contains valuable information and if it is truly relevant for the search query. It is assumed that such a page has a high number of in-links.

    Hubs: These are pages that are relevant for finding authorities. They contain useful links towards them. It is therefore assumed that these pages have a high number of out-links.

    The algorithm is based on mathematical graph theory, where a page is represented by a vertex and links between pages are represented by edges (in form of vectors).


    Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: Hubs point to lots of authorities. Authorities are pointed to by lots of hubs.

    C.2.4  Describe how a web-crawler functions

    A web crawler, also known as a web spider, web robot or simply bot, is a program that browses the web in a methodical and automated manner. For each page it finds, a copy is downloaded and indexed. In this process it extracts all links from the given page and then repeats the same process for all found links. This way, it tries to find as many pages as possible.

    Limitations:

    • They might look at meta data contained in the head of web pages, but this depends on the crawler
    • A crawler might not be able to read pages with dynamic content as they are very simple programs

    Robots.txt

    Stop Bots using Band With

    Save Band width less time on site crawling

    Issue: A crawler consumes resources and a page might not wish to be “crawled”. For this reason “robots.txt” files were created, where a page states what should be indexed and what shouldn’t.

    • A file that contains components to specify pages on a website that must not be crawled by search engine bots
    • File is placed in root directory of the site
    • The standard for robots.txt is called “Robots Exclusion Protocol”
    • Can be specific to a special web crawler, or apply to all crawlers
    • Not all bots follow this standard (malicious bots, malware) -> “illegal” bots can ignore robots.txt
    • Still considered to be better to include a robots.txt instead of leaving it out
    • It keeps the bots from less “noteworthy” content of a website
      more time spend on indexing important/relevant content of the website

    C.2.5 Discuss the relationship between data in a meta tag and how it is accessed by a web-crawler

    Students should be aware that this is not always a transitive relationship.

  • Meta Keywords Attribute - A series of keywords you deem relevant to the page in question.
  • Title Tag - This is the text you'll see at the top of your browser. Search engines view this text as the "title" of your page.
  • Meta Description Attribute - A brief description of the page.
  • Meta Robots Attribute - An indication to search engine crawlers (robots or "bots") as to what they should do with the page.
  • In the past the meta keyword tag could be spammed full of keywords sometimes not even relevant to the content on the page. This tag is mostly ignored by search engines. The met description can sometimes be show in the results, but is not  a factor in actual ranking.

    Robotics Tag

    Robotics Tag : This is super important and can be sued to disallow crawlers from crawling the page, you can specify all crawlers or list the ones that you do not wish to be crawled by.

    Answer depends on different crawlers, but generally speaking:

    • The title tag, not strictly a meta-tag, is what is shown in the results, through the indexer
    • The description meta-tag provides the indexer with a short description of the page and this can also be displayed in the SERPS
    • The keywords meta-tag provides…well keywords about your page

    C.2.6 Discuss the use of parallel web-crawling

  • Size of the web grows, increasing the time it would take to download pages
  • To make this reasonable “it becomes imperative to parallelize the crawling process (Stanford)
  • Advantages

  • Scalability: as the web grows a single process can not handle everything Multithreaded processing can solve the problem
  • Network load dispersion: as the web is geographically dispersed, dispersing crawlers disperses the network load
  • Network load reduction ( scalability, efficiency and throughput )
  • Issues of parallel web crawling

  • Overlapping: parallel web crawlers might index the same page multiple times
  • Quality: If a crawler wants to download ‘important’ pages first, this might not work in a parallel process
  • Communication bandwidth: parallel crawlers need to communicate for the former reasons, which for many processes might take significant communication bandwidth . Why search engines take the quality approach click here
  • If parallel crawlers request the same page frequently over a short time it will overload servers
  • Discuss the use of parallel web crawling

    A crawler is a program that downloads and stores Web pages, often for a Web search engine. Roughly, a crawler starts off by placing an initial set of URLs, So, in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. This process is repeated until the crawler decides to stop. Collected pages are later used for other applications, such as a Web search engine or a Web cache. As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single process. Therefore, many search engines often run multiple processes in parallel to perform the above task, so that download rate is maximized (reference http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.8408&rep=rep1&type=pdf )

    Why search engines take the quality approach ( dated )

    According to a study released in October 2000, the directly accessible "surface web" consists of about 2.5 billion pages, while the "deep web" (dynamically generated web pages) consists of about 550 billion pages, 95% of which are publicly accessible [LVDSS00].

    By comparison, the Google index released in June 2000 contained 560 million full-text-indexed pages [Goo00]. In other words, Google — which, according to a recent measurement [HHMN00], has the greatest coverage of all search engines — covers only about 0.1% of the publicly accessible web, and the other major search engines do even worse.

    Increasing the coverage of existing search engines by three orders of magnitude would pose a number of technical challenges, both with respect to their ability to discover, download, and index web pages, as well as their ability to serve queries against an index of that size. (For query engines based on inverted lists, the cost of serving a query is linear to the size of the index.) Therefore, search engines should attempt to download the best pages and include (only) them in their index.

    Mercator is an extensible, multithreaded, high-performance web crawler [HN99Mer00]. It is written in Java and is highly configurable. Its default download strategy is to perform a breadth-first search of the web, with the following three modifications:

  • It downloads multiple pages (typically 500) in parallel. This modification allows us to download about 10 million pages a day; without it, we would download well under 100,000 pages per day.
  • Only a single HTTP connection is opened to any given web server at any given time. This modification is necessary due to the prevalence of relative URLs on the web (about 80% of the links on an average web page refer to the same host), which leads to a high degree of host locality in the crawler's download queue. If we were to download many pages from the same host in parallel, we would overload or even crash that web server.
  • If it took t seconds to download a document from a given web server, then Mercator will wait for 10t seconds before contacting that web server again. This modification is not strictly necessary, but it further eases the load our crawler places on individual servers on the web. We found that this policy reduces the rate of complaints we receive while crawling.
  • Further Reading Click Here

    C.2.7 Outline the purpose of web-indexing in search engines

    Search engines index websites in order to respond to search queries with relevant information as quick as possible. For this reason, it stores information about indexed web pages, e.g. keyword, title or descriptions, in its database. This way search engines can quickly identify pages relevant to a search query.

    Indexing has the additional purpose of giving a page a certain weight, as described in the search algorithms. This way search results can be ranked, after being indexed.

    C.2.8 Suggest how developers can create pages that appear more prominently in search engine results. Describe the different metrics used by search engines


    Naturally an overlap exists with what the web site developer should do to get the site high in the serps ( search engine results page)

    On Page

    Relevancy does your site provide the information the user is searching for. The user experience (UX) is becoming a big part as this can not be manipulated and in the future will play a much bigger role.  User Experience ( time user stays on site / bounce rate ). Many factors play a role in the user experience.  Load Speed, Easy Navigation ( no broken links ),  Spelling, quality and factually correct content, Structured layout 5 Use of images/video 5 page design colors, images video infographics and  formatting so it is easy to scan the page for relevant information. The idea is to get the user to stay on your site ( sticky ) . If a use lands on your site after do a search and leaves after a few seconds sends or even before the page loads ( slow loding ) the is a very BIG  signal to google that they should not have served up that result

    Off Page

    Back Links from other web site the more authoritative the site the better ( example huffington post ), The site that links to your site should also be relevant. Example if your selling dog insurance a site from a respected charitable dog web site would be a very big boost. A link from a site that provides car rental would have little impact as totally irrelevant.

    Social media marketing FACE Book etc. Be a leader in your field and comment on relevant authoritative forums or blogs. Others users sharing via social bookmarking sites.

    This an area in which you can manipulate the search results. If Google discovers this your web site will be dropped from index. So need to ensure any links are natural linking to an authoritative article  info graphic on your web site.

    C.2.9 Describe the different metrics used by search engines.


    The process of making pages appear more prominently in search engine results is called SEO. There are many different techniques, considered in section C.2.11. This field is a big aspect of web marketing, as search engines do not disclose how exactly they work, making it hard for developers to perfectly optimise pages.

    In order to improve the ranking of a web site Google uses many many metrics below is a few of the important ones.

    Top Metrics

    On Page

    • Make sure your site can be crawled and thus indexed avoid flash and provide a sitemap and good web site architecture
    • The Title Create a title tag with your key phrase near or at the beginning. The title should be crafted to get the user to click on your web site when displayed in the search results. The title must reflect the content of yourr site
    • Content will always be important it must be high quality and any information must be factual at least  1000 words for home  page
    • Freshness of content 
    • Mobile Friendly
    • Page load speed under 3 seconds
    • If link broker browser will return a HTTP response code 404. This should be detected by web designer and  provide a help page with user navigation.
    • Text Formatting (use of h1,h2,bold etc)
    • HTTPS 
    • Do Keyword Research to find what users actually search for and build pages for these terms

    These are only a fraction that google will use, more recently they have given a very slight increase for sites that are HTTPS 

    C.2.10 Explain why the effectiveness of a search engine is determined by the assumptions made when developing it.


    The search engine must serve up results that are relevant to what the users search for, Google used page rank , prior to that search engines just used title tags  key word tags.  These could be easily manipulated  ( stuffed with keywords that you wish to rank for ) to get your site on page 1. Google devised a  Page Rank checking algorithm which played a big part of their search algorithm. 

    • Avoid Indexing Spam sites ( duplicate content copied ) . Detect sites that use Black Hat and remove from index
    • Don't process static sites ( that do not change ) / crawl  more frequently authoritative/changing ( fresh content ) news  sites 
    • Respect Robots text files 
    • Determine sites that change on a regular basis and cache these. 
    • The spider should not overload servers by continually hitting the same site. 
    • The algorithm must be able to avoid spider traps
    • Ignore paid for links ( can be difficult )
    • Ignore exact match anchor text if  its being used to rank keywords /  search terms ( backlink profile should look natural ) o the search engine
    • Use comments box to add more or question why these.

    C.2.11 Discuss the use of white hat and black hat search engine optimization.


    BLACK HAT

    Definition: Black hat SEO is a technique, in simple words, to get the top positions or higher rankings in the major search engines like Google, Yahoo and Bing that breaks the rule and regulations of search engine’s guidelines. See example of guidelines for google click here.

    Keyword stuffing

    This worked at one time, now you still need the key words / search terms in your title and page content you need to ensure that you do not overuse the keywords / phrases as that will trip a search engine filter.  

    PBN

    Google ( currently ) favors older sites, sites with history. In this approach you buy an expired domain with good metrics , build it up and add links to your sites giving a boost in ranking.  This works, but it is costly to set set up and you need to use alias etc.

    Paid For Links

    Similar to PBN the aim is to get good quality links from high authority sites. Have a look at Fiverr where yo can buy such links. This is difficult for google to detect and it is also very effective.  

    Syndicated / Copied Content

    Rather than creating good quality content use content copied from other sites, the content may be  changed using automated techniques.   Google is much better at detecting please refer to PANDA Update

    Over Use of Key Words in Anchor Text

    The anchor text tells google what your site is about example "fleet insurance" , but if you overuse or your backlinks look unnatural you will be penalized please refer to Penguin. Before Penguin this was very effective in getting ranked 

    Web 2.0 Links

    Build a web site on Tumbler for example for the sole purpose of sending links to your money site

    WHITE HAT

    Guest Blogging

    The process of writing a blog post for someone else’s blog is called guest blogging

    Link Baiting

    Create an amazing article info graphic that other sites may use, if you include a link to your site in the article you get more back inks as a result ( natural acquisition of back links as opposed to paid )

    Quality Content

    Search engines evaluate the content of a web page, thus a web page might get higher ranking with more information. This will make it more valuable on the index and other web pages might link to your web page if it has a high standard in content.

    Site optimization Design 

    Good menu navigation. Proper use of title tags and header tags, adding images with keyword alt tags, interlinking again with keyword anchor text. Create a, sitemap to get site crawled plus inform the spiders how often to visit site. 

    A good User Experience (UX)

    This a broad term and overlaps some other areas mentioned example page load speed. The purpose to ensure that if a use click to go to your site they stay without clicking back to the serps  immediately. Google is happy as this a quality signal as its main purpose to provide the user with relevant results.    

    Page Site Load Speed

    Fast loading pages gives the user a good experience aim for under 3 seconds

    Freshness

    Provide fresh content on a regular basis.

    Google is continually ( as are other search engines ) fighting black hat techniques that web masters employ to rank high in the serps. Investigate these 2 major algorithm updates Panda and Penguin. A good example of a current black hat practice is the use of PBN's.

    Students to investigate PBN's Panda and Penguin Quick discussion on these and what Google was targeting and how PBN's are currently being used effectively to rank sites higher ( if caught you will wake up one morning and you web site(s) have been de-indexed from Google.

    C.2.12 future challenges to search engines as the web continues to grow

    Search engines must be fast enough to crawl the exploding volume of new Web pages in order to provide the most up-to-date information. As the number of pages on the Web grows, so will the number of results search engines return. So, it will be increasingly important for search engines to present results in a way that makes it quick and easy for users to find exactly the information they’re looking for. Search engines have to overcome both these challenges.

    • Improvements in Search interface example Voice Search
    • Use of natural language process will also become more prevalent. Today, the search engine takes a set of keywords as the input and returns a list of rank-sorted links as the output. This will slowly fade and the new search framework will have questions as the input and answers as the output. The nascent form of this new framework is already available in search engines like Google and Bing.
    • Check Circle
      semantic searching by machine learning see Rank Brain. RankBrain is designed to help better interpret those queries and effectively translate them, behind the scenes in a way, to find the best pages for the searcher
    • Check Circle
      Personalized Search Because mobile is becoming the primary form of consumption, future search engines will try to use powerful sensing technologies like accelerometer, digital compass, gyroscope and GPS. Google recently bought a company called Behavio which predicts what a user might do next by using the information acquired from the different sensors on the user’s phone.

    C.4.1 Discuss how the web has supported new methods of online interaction such as social networking.

    keywords & Phrases  Web 1 and Web 2.0 

    Web 1.0

    Web 2.0

    Semantic Web

    ubiquitous

    Berners-Lee

    open protocols HTML HTTP

    decentralization

    ubiquitous

    Read Only

    Write Only

    Hyperlinks

    Web of linked documents

    decentralization

    successful companies that emerge at each stage of its evolution become monopolies market economics don’t apply.

    keywords & Phrases  Semantic Web

    The aim of the Semantic Web is to shift the emphasis of associative linking from documents to data

    Abundantly available information can be placed in new contexts and reused in unanticipated ways. This is the dynamic that enabled the WWW to spread, as the value of Web documents was seen to be greater in information rich contexts (O’Hara & Hall, 2009).

    WEB of Data

    Relational databases

    Excel Spead sheets

    WEB 3.0

    ubiquitous

    open protocols HTML HTTP

    decentralization

    ubiquitous

    Governments are making data available seehttps://data.gov.uk/

    datasets

    Democracy rules: open and free

    URL / URI

    decentralization

    successful companies that emerge at each stage of its evolution become monopolies market economics don’t apply.

    Read Only

    Write Only

    Hyperlinks

    Web of linked documents

    Students should be aware of issues linked to the growth of new internet technologies such as Web 2.0 and how they have shaped interactions between different stakeholders of the web.

    Google Maps is it free ? 

    If your business is missing can you add it 

    Can this information be monetized ? How 

    Should Google have a monopoly on location information?

    Are there Alternatives to Google Maps?

    Watch Video below and then create an account on open street map and add a building your condo/house Wells school etc. Read this post and describe in your own word how Open Street Maps is Different from Google Maps. Post your response to google classroom

    The beginnings of the web (Web 1.0 , Web of content)

    The world wide web started around 1990/91 as a system of servers connected over the internet that deliver static documents, which are formatted as hypertext markup language (HTML) files, which support links to other documents, but also multimedia as graphics, video or audio. In the beginnings of the web, these documents consisted mainly of static information and text, where multimedia were added later. Some experts describe this as a “read-only web”, because users mostly searched and read information, while there was little user interaction or content contribution.

    Web 2.0 – “Web of the Users”

    However, the web started to evolve into the delivery of more dynamic documents, enabling user interaction or even allowing content contribution. The appearance of blogging platforms as Blogger in 1999 gives a time mark for the birth of the Web 2.0. Continuing the model from before, this would be the evolution to a “read-write” web. This opened new possibilities and lead to new concept as blogs, social networks or video-streaming platforms. Web 2.0 might also be looked at from the perspective of the websites themselves evolving in more dynamic and feature-rich. For instance, improved design, JavaScript and dynamic content loading could be considered Web 2.0 features.

    Web 3.0 – “Semantic Web”

    The internet and thus the world wide web is constantly developing and evolving into new directions and while the changes described for the Web 2.0 are clear to us today, the definition for the Web 3.0 is not definitive yet. Continuing the read to read-write description form earlier, it might be argued that the Web 3.0 would be the “read-write-execute” web. One interpretation of this, is that the web enables software agents to work with documents by using semantic markup. This allows for smarter searches and the presentation of relevant data fitting into context. This is why Web 3.0 is sometimes called the semantic executive web.

    But what does this mean?

    It’s about user input becoming more meaningful, more semantic, by users giving tags or other kinds of data to their document, that allow software agents to work with the input, e.g. to make it more searchable. The idea is to be able to better connect information that is semantically connected.

    Later developments

    However, it might also be argued that the Web 3.0 is what some people call the Internet of Things, which is basically connecting every day devices to the internet to make them smarter. In some way, this also fits the read-write-execute model, as it allows the user to control a real life action on a device over the internet. Either way, the web keeps evolving and the following image provides a good overview and an idea where the web is heading to.

    C.4.2 Describe how cloud computing is different from a client-server architecture

    Students should be aware of issues linked to the growth of new internet technologies such as Web 2.0 and how they have shaped interactions between different stakeholders of the web.

    It’s worth noting that this comparison is not about two opposites. Both concepts do not exclude each other and can complement one another.

    Client-server architecture

    An application gets split into the client side and server-side. The server can be a central communicator between clients (e.g. email/chat server) or allow different clients to access and manipulate data in a database. A client-server application does also not necessarily need to be working over the internet, but could be limited to a local network, e.g. for enterprise applications.

    Cloud computing


    Lesson Plan Click Here

    Cloud computing still relies on the client-server architecture, but puts the focus on sharing computing resources over the internet. Cloud applications are often offered as a service to individuals and companies - this way companies don’t have to build and maintain their own computing infrastructure in house. Benefits of cloud computing include:

    • Pay per use: elasticity allows the user to only pay for the resources that they actually use.
    • Elasticity: cloud applications can scale up or down depending on current demands. This allows a better use of resources and reduces the need for companies to make large investments in a local infrastructure.Wiki Quote "the degree to which a system is able to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner, such that at each point in time the available resources match the current demand as closely as possible
    • Self-provisioning: allows the user to set up applications in the cloud without the intervention of the cloud provider
    • Company has options to use any of these SaaS, IaaS or PaaS 
    • Using these services offers many advantages over the server client model   : Can you think of some?

    Azure is Microsoft cloud services the other major one is amazon click here and watch the intro video 

    C.4.3 Discuss the effects of the use of cloud computing for specified organizations

    To include public and private clouds

    ***  Creates an environment conducive to innovative startups and thus the potential for disruptive innovation

    Private cloud

    In a private cloud model a company owns the data centers that deliver the services to internal users only.

  • Scalability
  • Self-provisioning
  • Direct control
  • Changing computer resources on demand
  • Limited access through firewalls improves security
  • Can you think of any disadvantages?

  • Same high costs for maintenance, staffing, management
  • Additional costs for cloud software
  • Public cloud

    In a public cloud services are provided by a third party and are usually available to the general public over the Internet.

    Advantages

  • Easy and inexpensive because the provider covers hardware, application and bandwidth costs
  • Scalability to meet needs
  • No wasted resources
  • Costs calculated by resource consumption only
  • Disadvantages

  • No control over sensitive data
  • Security risks
  • Hybrid cloud

    The idea of a hybrid cloud is to use the best of both private and public clouds by combining both. Sensitive and critical applications run in a private cloud, while the public cloud is used for applications that require high scalability on demand. As TechTarget explains, the goal of a hybrid cloud is to “create a unified, automated, scalable environment that takes advantage of all that a public cloud infrastructure can provide while still 

    Summary of obstacles/Concerns

    service availability

    data lock-in also if wish to change what format will your data be in ? Could be very expensive to convert to a new data format

    Company goes bust with all your data

    data confidentiality and auditability ( security)

    data transfer bottlenecks

    performance unpredictability

    Data Conversions

    bugs in large-scale distributed systems

    C.4.5 Describe the interrelationship between privacy, identification and authentication

    Privacy

    Identification

    Authentication

    C.3.1 Define the terms: mobile computing, ubiquitous computing, peer-2-peer network, grid computing

    C.3.2 Compare the major features of: • mobile computing • ubiquitous computing • peer-2-peer network • grid computing

    C.3.3 Distinguish between interoperability and open standards.

    C.3.4 Describe the range of hardware used by distributed networks.

    Students should be aware of developments in mobile technology that have facilitated the growth of distributed networks.

    Compression & Decompression  Week 2

    Graphic File Formats

    Graphic File Formats

    The primary web file formats are gif (pronounced “jiff”), jpeg (“jay-peg”), and, to a much lesser extent, png (“ping”) files. All three common web graphic formats are so-called bitmap graphics, made up of a checkerboard grid of thousands of tiny colored square picture elements, or pixels. Bitmap files are the familiar types of files produced by cell phone and digital cameras, and are easily created, edited, resized, and optimized for web use with such widely available tools as Adobe’s Photoshop or Elements, Corel’s Paint Shop Pro and Painter, and other photo editing programs.

    For efficient delivery over the Internet, virtually all web graphics are compressed to keep file sizes as small as possible. Most web sites use both gif and jpeg images. Choosing between these file types is largely a matter of assessing:

    • The nature of the image (is the image a “photographic” collection of smooth tonal transitions or a diagrammatic image with hard edges and lines?)
    • The effect of various kinds of file compression on image quality
    • The efficiency of a compression technique in producing the smallest file size that looks good

    GIF Graphics

    The CompuServe Information Service popularized the Graphic Interchange Format (gif) in the 1980s as an efficient means to transmit images across data networks. In the early 1990s the original designers of the World Wide Web adopted gif for its efficiency and widespread familiarity. Many images on the web are in gif format, and virtually all web browsers that support graphics can display gif files. gif files incorporate a “lossless” compression scheme to keep file sizes at a minimum without compromising quality. However, gif files are 8-bit graphics and thus can only accommodate 256 colors.

    GIF file compression

    The gif file format uses a relatively basic form of file compression (Lempel Ziv Welch, or lzw) that squeezes out inefficiencies in data storage without losing data or distorting the image. The lzw compression scheme is best at compressing images with large fields of homogeneous color, such as logos and diagrams. It is much less efficient at compressing complicated “photographic” pictures with many colors and complex textures (fig. 11.4).

    A two-part illustration showing a diagrammatic image suitable for GIF graphics on the right, and a photographic image on the left that is better served by the JPEG graphic file format.

    Figure 11.4 — The LZW compression built into the GIF graphic format is very good at efficiently saving diagrammatic graphics (right) but poor at compressing more complex photographic images (left).

    Dithering

    Full-color photographs can contain an almost infinite range of color values; gif images can contain no more than 256 colors. The process of reducing many colors to 256 or fewer is called dithering. With dithering, pixels of two colors are juxtaposed to create the illusion that a third color is present. Dithering a photographic image down to 256 colors produces an unpleasantly grainy image (fig. 11.5). In the past this technique was necessary to create images that would look acceptable on 256-color computer screens, but with today’s full-color displays there is seldom any need to dither an image. If you need a wider range of colors than the gif format can handle, try using your image editor to save the image in both jpeg and png formats (described below), compare the resulting file sizes and image qualities, and pick the best balance of file size and image quality.

    A two-part comparision of the same photographic image in full color on the left, and at right the image reduced to 256 colors with heavy color dithering.

    Figure 11.5 — Dithering a full-color photograph (left) to a 256-color image (right) results in an image with lots of visual noise and harsh transitions between pixels of different colors. Luckily, such color dithering compromises are now mostly a thing of the past, since most users now have “true color” monitors that can display thousands or millions of colors.

    Improving GIF compression

    You can take advantage of the characteristics of lzw compression to improve its efficiency and thereby reduce the size of your gif graphics. The strategy is to reduce the number of colors in your gif image to the minimum number necessary and to remove colors that are not required to represent the image. A gif graphic cannot have more than 256 colors, but it can have fewer. Images with fewer colors will compress more efficiently under lzw compression. For example, when creating gif graphics in Photoshop, don’t save every file automatically with 256 colors. A simple gif image may look fine at 8, 16, or 32 colors, and the file size savings can be substantial. For maximum efficiency in gif graphics, use the minimum number of colors that gives you a good visual result.

    Transparent GIF

    The gif format allows you to pick colors from the color lookup table of the gif to be transparent. You can use image-editing software such as Photoshop (and many shareware utility programs) to select colors in a gif graphic’s color palette to become transparent. Usually the color selected for transparency is the background color in the graphic. Unfortunately, the transparent property is not selective; if you make a color transparent, every pixel in the graphic that shares that color will also become transparent, which can cause unexpected results.

    Adding transparency to a gif graphic can produce disappointing results when the image contains anti-aliasing. If you use an image-editing program like Photoshop to create a shape set against a background color, Photoshop will smooth the shape by inserting pixels of intermediate colors along the shape’s boundary edges. This smoothing, or anti-aliasing, improves the look of screen images by softening what would otherwise look like jagged edges. The trouble comes when you set the background color to transparent and then use the image on a Web page against a different background color. The anti-aliased pixels in the image will still correspond to the original background color. In the example below, when we change the background color from white to transparent (letting the gray web page background show through), an ugly white halo appears around the graphic (fig. 11.6).

    A two-part illustration, showing at left a simple cartoon drawing of a black fish on a white background, and at right he same fish cartoon superimposed on a blue background. The fish graphic shows a white fringe when seen on the blue background.

    Figure 11.6 — The dreaded “white halo” in transparent GIF graphics.

    The same problem exists with printing. Most browsers do not print background colors, and a transparent gif anti-aliased against a colored background will not blend smoothly into the white of the printed page.

    JPEG graphics

    The other graphic file format commonly used on the web to minimize graphics file sizes is the Joint Photographic Experts Group (jpeg) compression scheme. Unlike gif graphics, jpegimages are full-color images that dedicate at least 24 bits of memory to each pixel, resulting in images that can incorporate 16.8 million colors.

    jpeg images are used extensively among photographers, artists, graphic designers, medical imaging specialists, art historians, and other groups for whom image quality and color fidelity is important. A form of jpeg file called “progressive jpeg” gives jpeg graphics the same gradually built display seen in interlaced gifs. Like interlaced gifs, progressive jpegimages often take longer to load onto the page than standard jpegs, but they do offer the user a quicker preview.

    jpeg compression uses a sophisticated mathematical technique called a discrete cosine transformation to produce a sliding scale of graphics compression. You can choose the degree of compression you wish to apply to an image in jpeg format, but in doing so you also determine the image’s quality. The more you squeeze a picture with jpeg compression, the more you degrade its quality. jpeg can achieve incredible compression ratios, squeezing graphics down to as much as one hundred times smaller than the original file. This is possible because the jpeg algorithm discards “unnecessary” data as it compresses the image, and it is thus called a “lossy” compression technique. Notice in figure 11.7 how increasing the jpegcompression progressively degrades the details of the image. The checkered pattern and the dark “noise” pixels in the compressed image are classic jpeg compression artifacts. Note the extensive compression noise and distortion present in the image below, particularly around the leading edge of the fish's head.

    A color illustration of a tropical fish, with a section of the fish graphic blown up to show how heavy JPEG compression results in visual noise that diminishes the overall quality of the graphic.

    Figure 11.7 — JPEG compression comes at a cost: a big increase in visual noise and other compression artifacts that degrade the image quality if over-used.

    Save your original uncompressed images!

    Once an image is compressed using jpeg compression, data is lost and you cannot recover it from that image file. Always save an uncompressed original file of your graphics or photographs as backup. If your digital camera produces jpeg images, set aside the “camera original” jpeg files and work with copies when you edit the files for web use. Each time you save or resave an image in jpeg format, the image is compressed further and the artifacts and noise in the image increase.

    PNG graphics

    Portable Network Graphic (png) is an image format developed by a consortium of graphic software developers as a nonproprietary alternative to the gif image format. As mentioned above, CompuServe developed the gif format, and gif uses the proprietary lzw compression scheme, which was patented by Unisys Corporation, meaning that any graphics tool developer making software that saved in gif format had to pay a royalty to Unisys and CompuServe. The patent has since expired, and software developers can use the gif format freely.

    png graphics were designed specifically for use on web pages, and they offer a range of attractive features, including a full range of color depths, support for sophisticated image transparency, better interlacing, and automatic corrections for display monitor gamma. png images can also hold a short text description of the image’s content, which allows Internet search engines to search for images based on these embedded text descriptions.

    png supports full-color images and can be used for photographic images. However, because it uses lossless compression, the resulting file is much larger than with lossy jpegcompression. Like gif, png does best with line art, text, and logos—images that contain large areas of homogenous color with sharp transitions between colors. Images of this type saved in the png format look good and have a similar or even smaller file size than when saved as gifs. However, widespread adoption of the png format has been slow. This is due in part to inconsistent support in web browsers. In particular, Internet Explorer does not fully support all the features of png graphics. As a result, most images that would be suitable for png compression use the gif format instead, which has the benefit of full and consistent browser support.

    C.3.6 Distinguish between lossless and lossy compression.

    C.3.7 Evaluate the use of decompression software in the transfer of information.

    Compression 1  Objectives Understand Compression techniques and the need for compression


    Compression Definition  :  Reduce the size of data * the number of bits used to store data, most services charge on number of bits you transmit / will reduce bandwith use

    Benefits :  Reduce storage needed and associated costs  UX less latency speed use less bandwith:

    Possible downside ?

    What can we compress?

    How?

    Types of Compression ?

    Compression 2  Hoffman Example  Objective to apply a compression method Intro to Binary Trees


    Text Compression Definition  :  Reduce the size of data * the number of bits used to store data

    Compress Hello World to 33bits

    Better Than 33 Bits How?

    How can we compress text ?

    Hoffman

    Why we need compression techniques

    The storage capacity of computers is growing at an unbelievable rate—in the last 25 years, the amount of storage provided on a typical computer has grown about a millionfold—but we still find more to put into our computers. Computers can store whole books or even libraries, and now music and movies too, if only they have the room. Large files are also a problem on the Internet, because they take a long time to download. We also try to make computers smaller—even a cellphone or wristwatch can be expected to store lots of information!

    Text Compression Huffman

    Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.
    The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit sequences) are assigned in such a way that the code assigned to one character is not prefix of code assigned to any other character. This is how Huffman Coding makes sure that there is no ambiguity when decoding the generated bit stream.
    Let us understand prefix codes with a counter example. Let there be four characters a, b, c and d, and their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity because code assigned to c is prefix of codes assigned to a and b. If the compressed bit stream is 0001, the de-compressed output may be “cccd” or “ccb” or “acd” or “ab”.

    Question Section

    Investigate LEMPEL-ZIV CODING , you just need a high level understanding of how it works.

    Update your Blog with responses to the following :

    State 2 text , 2 graphic and 2 video compression techniques.

    Describe one way that a compressed file may be decompressed 

    Can lossy compression be used on text files ? Explain your answer

    Explain how compression of data may lead to negative consequences. [3]

    Also explain the importance of compression now and in the future.

    HL C.5 Analyzing the web

    Reference for this section  please click here  please note the definition of a tube is conflicting a more simple / general definition can be found in this paper click here  

    Enter your text here...

    Enter your text here...

    Past Paper Question

    C.4.7 Explain why the web may be creating unregulated monopolies

    In theory the world wide web should be a free place where anybody can have a website. However, hosting a website usually comes with a cost - registering a domain name, getting a hosting service or investing in servers oneself, creating and maintaining the website (requires technical knowledge or the cost of hiring a web developer). In addition, to reach an audience further marketing through SEO (see C.2) is usually necessary to get good rankings in search engine results. This means that for the normal individual a traditional website is not the best option. A better alternative is to publish content on an existing platform, e.g. micro blogging on Twitter, blogging on WordPress or Blogspot, sharing social updates on Facebook, sharing photos on Flickr, etc. . This comes with improved comfort for users.

    However, it easily leads to unregulated monopolies in the market because users usually stick to one platform. Tim Berners-Lee describes today’s social networks as centralized silos, which hold all user information in one place. This can be a problem, as such monopolies usually control a large quantity of personal information which could be misused commercially or stolen by hackers. There are certainly many more concerns which won’t fit into the scope of this site.

    C.4.8 Decentralized and democratic web

    A Decentralized Web is free of corporate or government overlords. It is to communication what local farming is to food. With it people can grow their own information

    Eric Newton Innovation Chief/Cronkite News, Arizone State University

    Search Bubbles

    A Filter Bubble Demonstration - Try this at home!

    One way to see how filter bubbles work with search engines that do personalization (like Google) is to take a word that can have multiple meanings in different contexts and build up different search histories using those contexts. Then, when you search for the same word after having built up different search histories, the search engine should return results that look a bit different.

    For this demonstration to work, you need to be sure to clear your search history before you start each round. This works even better if you have 2 or 3 people working side by side at different computers. That way you can compare the results more easily.

    Try this with the word Tea.

    1. Have someone build a search history using names of countries where tea is popularor names of countries where teas orgininated. Remember, do not use the word "tea" as a search term quite yet. Examples would be England, Japan, China, Latin America, etc.

    2. Have another person build a search history using different spices, herbs, and flowersthat make up common teas. Examples would be roses, cinammon, chrysanthemum, lavender, etc.

    3. Have a third person search for anything related to politics, such as names of political parties (not the Tea Party just yet, though!), names of political movements, words like "activism," or "conservative" and "liberal."

    4. When you are performing these searches, click on some of the results (preferably general ones that might somehow later be connected to tea!). This will contribute to your search history.

    5. Finally, have everyone search for the word "Tea." Have fun comparing results!

    Note: Your results may still look very similar; the differences may be subtle. Whether or not the filter bubble is really something to be concerned about will be discussed in the next tab.

    Who does this?

    These are just a few of the websites that tailor results to you and your clicking history:

    GoogleAmazonWashington Post
    NetflixYahoo NewsNew York Times
    FacebookHuffington Post

    Resources


    https://barefootcas.org.uk/wp-content/uploads/2015/02/KS2-Search-Results-Selection-Activity-Barefoot-Computing.pdf

    https://www.hpe.com/us/en/insights/articles/how-search-worked-before-google-1703.html

    Further Reading 

    http://www.ftsm.ukm.my/ss/Book/EVOLUTION%20OF%20WWW.pdf

    https://eprints.soton.ac.uk/272374/1/evolvingwebfinal.pdf

    http://dig.csail.mit.edu/2007/Papers/AIMagazine/fractal-paper.pdf