Answer depends on different crawlers, but generally speaking:
While meta-tags used to play a role in ranking, this has been overused by many pages and therefore meta-tags are not considered by most search engines anymore.
Crawlers now mostly use meta-tags to compare keywords and description to the content of the page to give it a certain weight. For this reason while meta-tags do not play the big role it used to, it’s still important to include them.
A web crawler, also known as a web spider, web robot or simply bot, is a program that browses the web in a methodical and automated manner. For each page it finds, a copy is downloaded and indexed. In this process it extracts all links from the given page and then repeats the same process for all found links. This way, it tries to find as many pages as possible.
Stop Bots using Band With
Save Band width less time on site crawling
Issue: A crawler consumes resources and a page might not wish to be “crawled”. For this reason “robots.txt” files were created, where a page states what should be indexed and what shouldn’t.
The most known search algorithms are PageRank and the HITS algorithm, but it is important to know that most search engines include various other factors as well, e.g.
For the following description the terms “inlinks” and “outlinks” are used. Inlinks are links that point to the page in question, i.e. if page W has an inlink, there is a page Z containing the URL of page W. Outlinks are links that point to a different page than the one in question, i.e. if page W has an outlink, it is an URL of another page, e.g. page Z.
PageRank works by counting the number and quality of inlinks of a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
The value of an in-link from a page A is proportional to
P(A) / C(A)
Where P(A) is the PageRank score of page A, and C(A) is the number of out-links page A has.
As mentioned it is important to note that there are many other factors considered. For instance, the anchor text of a link is often far more important than its PageRank score.
Based on the idea that keywords are not everything that matters; there are sites that might be more relevant even if they don’t contain the most keywords. It introduces the idea of different types of pages, authorities and hubs.
Authorities: A page is called an authority, if it contains valuable information and if it is truly relevant for the search query. It is assumed that such a page has a high number of in-links.
Hubs: These are pages that are relevant for finding authorities. They contain useful links towards them. It is therefore assumed that these pages have a high number of out-links.
The algorithm is based on mathematical graph theory, where a page is represented by a vertex and links between pages are represented by edges (in form of vectors).
The surface web is the part of the web that can be reached by a search engine. For this, pages need to be static and fixed, so that they can be reached through links from other sites on the surface web. They also need to be accessible without special configuration. Examples include Google, Facebook, Youtube, etc.
The deep web is the part of the web that is not searchable by normal search engines. Reasons for this include proprietary content that requires authentication or VPN access, e.g. private social media, emails; commercial content that is protected by paywalls, e.g. online news papers, academic research databases; personal information that is protected, e.g. bank information, health records; dynamic content. Dynamic content is usually a result of some query, where data are fetched from a database.
"A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler." Source Wiki
If you publish politically sensitive content, full site SSL is best for overall security. However, your site is more likely to be completely blocked in China. For example, China completely blocked Wikipedia after a recent switch to full site encryption.
Because of the technology of SSL, China is unable to block individual web pages on full-SSL sites (which the government previously did with non-SSL Wikipedia). Blocking a single unsecured page (like the 1989 Tiananmen Square Protests) is easier as it’s a sort of man-in-the-middle attack on a site. When SSL is not used, government censorship tools essentially inject themselves into the end-user’s browser session with the site and break the connection.
Because SSL prevents these types of intrusions and exploits, the government has no choice but to block the entire domain. This recent change for Wikipedia, while more secure for users, results in millions of articles to be blocked because of a few politically-sensitive ones.
Figure 2: The Difference Between URLs and URIs (Daniel Miessler)