How Do Search Engines Like Google and Bing Work

A high level look at how search engines such as Google, Bing and Baidu work

What is indexing and crawling? A look at the basics behind the working of search engines like Google, Baidu and Bing…

Search engine lead

May 13, 2019    By Team YoungWonks *

In this blog, we shall be looking at the meaning of the term search engine and how this search engine works. In an age where the Internet has made the world a small place and every query is just a few clicks away, it’s no doubt important to know and understand how these queries are understood and how the relevant answers are thrown up by the search engine in response to the said query. 

 

To begin with, what is a search engine? A web search engine, also called an Internet search engine, refers to a software system created to perform the function of carrying out a web or Internet search. This is a system that searches the World Wide Web in a systematic way for particular information specified in a web search query. The list of content/ links that show up on the search engine in response to a web query is known as a search engine results page (SERP). Each web search engine has its own secret search algorithm but here we shall be taking a high level look at how indexing and search broadly work across these search engines.  

 

Brief History of Search Engines 

Interestingly, search engines have existed even before the World Wide Web was opened to public in 1991. The first well documented search engine that searched content files, namely FTP files, was Archie, which debuted on 10 September 1990. Archie downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names. This was followed by the rise of other search engines, namely, Veronica and Jughead, after which came the Wandex or World Wide Web Wanderer, the main goal of which was to measure the size of the World Wide Web. Several other web search engines came about and in 1998, Google, inspired by a small search engine company named goto.com, took to selling search terms and in time, became one of the most profitable businesses on the Internet. Today, it has maximum market share in the search engine industry while some of the other major search engines in the world today are Bing, Yahoo, Baidu, Yandex and Naver. 

 

How Does A Search Engine Work?

To begin with, let’s look at what happens when you go to search engine and type in a query. 

 

Now the basic functions of a search engine are  crawling, data mining, indexing and query processing. Let’s look at them one by one. 

 

Crawling 

Crawling is the act of sending small programmed bots out to collect information. 

These bots, also called crawlers or spiders, are a computer programs that automatically search documents on the Web. Crawlers are primarily programmed to perform repetitive actions, this ensures that browsing is entirely automated. So they start at one website and collect all its information and links; these links are also recorded. Then the websites on these links lead the crawlers to other links and so the crawler continues on its path of data collection. 

Basically, after a particular period of time or when a crawler/ spider is full, it returns and uploads the content of the webpages and all the links back to the central computer. Entire webpages, preserved in HTML, are stored on the servers of the search engine. Keep in mind that the stored version is not the live version of the webpage, or what you see when you enter the URL in your browser, but a historical version called the cached version. To track the changes in the webpage content, spiders can be told to return to web pages often. For example, a news website would typically request that the spiders return often because of the frequency of their content change.

That said, spiders/ bots do not find everything on the web. If there are no links to a page, then it is basically invisible to search engines. Also, web pages needing a password are not stored in a search engine. 

The most well known web crawler is the Googlebot which collects data for the search engine Google. 

 

Data Mining 

Broadly speaking, data mining refers to the collection and storage of all the data by the spiders/ bots. It is an empirical method of using the assistance of algorithms, artificial intelligence, and statistics programs to not just collect large quantities of data such as Big Data but also to effectively evaluate the thus collected data. A common goal of data mining is detecting patterns, classifying data according to these patterns and sharing them in response to web queries. This explains why data mining helps optimize online trading on an empirical basis. 

 

Indexing 

As mentioned earlier, search engines use these crawlers to browse the world wide web and store the web links. Doing this helps the spiders build an index. Indexing then is ordering this information systematically. In other words, indexing is the process of recording each word and character in a web page and its location. We find same concept in the back of a book, where major words are listed and what pages we can find them on. Similarly, the search engine version of indexing is about tracking and recording where a particular word occurs within a page that has been crawled. The largest known Internet index is that of the search engine Google and is called the Big Table. It is so large it has indices for the indices. The indexing process relies on the binary system of 1s and 0s; everything is converted to these two numbers. So the process of searching is not based on words and letters, but on math. Typically, a new website is said to take anywhere between 4 days and 4 weeks to be crawled and indexed by Google; though it is being said by some that Google at times takes less than 4 days to do so.  

 

Query Processing

Query processing is the mathematical process where a web search query is compared to the index and the results are shown to that person. So for example, if the search words entered by the user are ‘speed of light’, the software will go through these billions of pages looking for the ones that contain these search words. 

How does this process come about? To begin with, the query entered in the search box is converted to numbers. But the search engine will ignore several stop words, words that will not be searched; examples being words such as of, the, and, it, be, will, etc. These short words are just filler to the computer. If you really wish to look up these words in the search then you must include them in quotation marks, or in Google, add the plus sign before the word. Once the key words are converted to numbers, the engine then calculates and looks for the indexed terms mathematically match or are the closest to your query. 

The algorithm is complex, but it works really quickly and in less than a second shows results / links items depending on how close it is mathematically to your query. The closer matches are listed higher and some engines even show a percent of relevance. Relevance is determined by factors such as if the words are in the title (URL) as opposed to just being in the text, if the search word is in bold or italics on the page, how many times the word can be found on a page, the number and quality of links to that page, and if the words are in the header. 

 All the above factors are considered by the search engine which then assigns a score to the pages that suit the query and as per the score, the pages have a corresponding rank. So a page with a high page rank is going to be preferred by the search engine over other pages and will show up on the first page and possibly high up too on the said SERP. Each result shown has a title, the URL and short text that the user can read at a glance and then decide if the page is worth exploring.  

Often the search engine also shows the user similar pages, the most recent versions of the pages showing in the search result and related pages that contain words related to the search words entered by the user. Also important to note that the search result doesn’t only cover relevant text available online but also includes elements such as maps, images and videos wherever applicable. 

It’s important to note that when you enter words/ phrases in the search query box, you are, in fact, only looking at an index of the Internet and not the entire Internet itself. Google has the largest index and will typically return billions of hits, while other search engines are smaller and thus have smaller index and accordingly, fewer hits. Different search engines send spiders/ bots in different directions, so they tend to index different parts of the web. Also, remember that they also work with different algorithms (some are well-guarded secrets).

 

Paid and Organic Search 

Typically, a search engine makes its money from advertising revenue as it also shows relevant ads or paid search results to the user. Usually, these are highlighted as ads or sponsored content.  

Paid search is a form of digital marketing where search engines let advertisers show ads on their search engine results pages (SERPs). Paid search works on a pay-per-click model, where the client advertising on the search engine doesn’t pay till someone clicks on their ad. This makes it a measurable and controllable marketing channel as opposed to the more traditional forms of advertising.

Organic search meanwhile is a method of entering one or many search terms as a single string of text into a search engine. Organic search results, displayed as paginated lists, are based on relevance to the search terms; and do not feature advertisements. 

It is important to note how organic search has give rise to search engine optimization (SEO). SEO refers to the process of increasing the visibility of a website or a web page to users of a web search engine. It covers techniques such as editing the web page content, adding content, modifying HTML, and associated coding to both increase its relevance to specific keywords and to remove barriers to the indexing activities of search engines. The term SEO excludes paid placements and sponsorships. 

 

*Contributors: Written by Vidya Prabhu; Lead image by: Leonel Cruz

This blog is presented to you by YoungWonks. The leading coding program for kids and teens.

YoungWonks offers instructor led one-on-one online classes and in-person classes with 4:1 student teacher ratio.

Sign up for a free trial class by filling out the form below:

By clicking the "Submit" button above, you agree to the privacy policy
help