C.2.5 Discuss the relationship between data in a meta tag and how it is accessed by a web-crawler
Answer depends on different crawlers, but generally speaking:
- The title tag, not strictly a meta-tag, is what is shown in the results, through the indexer
- The description meta-tag provides the indexer with a short description of the page
- The keywords meta-tag provides…well keywords about your page
While meta-tags used to play a role in ranking, this has been overused by many pages and therefore meta-tags are not considered by most search engines anymore.
Crawlers now mostly use meta-tags to compare keywords and description to the content of the page to give it a certain weight. For this reason while meta-tags do not play the big role it used to, it’s still important to include them.
C.2.7 Outline the purpose of web-indexing in search engines
Search engines index websites in order to respond to search queries with relevant information as quick as possible. For this reason, it stores information about indexed web pages, e.g. keyword, title or descriptions, in its database. This way search engines can quickly identify pages relevant to a search query.
Indexing has the additional purpose of giving a page a certain weight, as described in the search algorithms. This way search results can be ranked, after being indexed.
C.2.8-9 Suggest how developers can create pages that appear more prominently in search engine results. Describe the different metrics used by search engines.
The process of making pages appear more prominently in search engine results is called SEO. There are many different techniques, considered in section C.2.11. This field is a big aspect of web marketing, as search engines do not disclose how exactly they work, making it hard for developers to perfectly optimise pages.
In order to check the web presence of a website, there are different metrics to be used.
- Search Engine Share of Referring visits: how the web page has been accessed: through direct access, referral pages or search engine results. Can indicate how meaningful traffic is.
- Search Engine Referral: different search engines have different market shares; knowing which search engine traffic comes from helps to find potential improvements for certain search engines
- Search terms and phrases: identify the most common search keywords and optimize
- Conversion rate by search phrase/term: percentage of users that sign up coming from a search term
- Number of sites receiving traffic from search engines: As large websites have many pages, it is important to see if individual sites are being accessed through search engines
- Time taken: time spent by a user on a page after access through the search engine. Indicator for how relevant the page is and what resources were accessed
- Number of hits: a page hit is when a page is downloaded. This is a counter of the visitors of the page and gives a rough idea of the traffic to the page
- Quality of returns: quality of how a site gets placed in a return. Say how high it is ranked by search engines.
- Quantity of returns: how many pages are indexed by a search engine
Parameters Search Engines use to compare
- Is determined by different programs like PageRank etc. which evaluate and determine the quality of web sites and put them high on the Index
- The bigger the index the more pages the search engine can return that have relevance to each query
- User experience:
- Search engines look to find the “best” results for the searcher and part of this is the user experience a site provides. This includes ease of use, navigation; direct and relevant information; professional, modern and compatible design; high-quality, legitimate and credible content
C.2.6 Discuss the use of parallel web-crawling
- Size of the web grows, increasing the time it would take to download pages
- To make this reasonable “it becomes imperative to parallelize the crawling process (Stanford)
- Scalability: as the web grows a single process can not handle everything Multithreaded processing can solve the problem
- Network load dispersion: as the web is geographically dispersed, dispersing crawlers disperses the network load
- Network load reduction
Issues of parallel web crawling
- Overlapping: parallel web crawlers might index the same page multiple times
- Quality: If a crawler wants to download ‘important’ pages first, this might not work in a parallel process
- Communication bandwidth: parallel crawlers need to communicate for the former reasons, which for many processes might take significant communication bandwidth
- If parallel crawlers request the same page frequently over a short time it will overload servers