Web Systems Lecture 7 - Search Engines

This lecture is divided into hyperlinked sections

Introduction
What is the function of a Search Engine?
Why do we need a Search Engine?
Use of a Search Tool
Search Methods
Searching by Name
Searching by Content
How does a search service operate?
Gathering the Search Information
Using a Search Engine
String Matching
Searching using Multiple Keys
Advanced Searching
Early Internet Search Tools
Conclusion
Resources
References
Tutorial Questions


Introduction

This section of the course is concerned with Search Engine technology.

To use the wide resources that the WWW has available, it is necessary to specify a URL that corresponds to a website. Having found a set of resources, further links are provided that will automatically take the user to a new resource which may be part of the original website or may be part of a completely different website held in another part of the world.

The user may not know the URL of a website and so can take advantage of the services that are offered by a Search Engine.


What is the function of a Search Engine?

A Search Engine is a web based system that allows the user on a client machine to search for specific words or character strings that are part of the content of web pages across the www. Without knowing the location nor the title of the web page, a search engine is able to provide the user with a listing of web pages that match the search string that was requested.

By using an automated search service, the user can find the URL that is required to search a remote computer for web pages relating to the desired search topic.

The search engines provide a form field to type the desired search topic or string and the results of the search are displayed as a web page. Each link provides a short description of the web page and a hyperlink that may be clicked to take the user to that particular URL.


Why do we need a Search Engine?

When searching for information on a certain topic, the user may browse the WWW by following hyperlinks from various sites, however this is not a feasible method of finding all information as websites are appearing and disappearing from the web faster than a human can browse them.

To help speed up the search, some form of automation is needed and this may be referred to as a search tool, indexing tool or a search engine. The service provided is referred to as an automated search service or an automated index.


Use of a Search Tool

Imagine that you are searching for information on networking hardware and will be using the WWW. Perhaps you already are aware of the names of a few companies that market this equipment, 3Com, Nortel, Cisco Systems etc. These companies may form the start of a search for the item in which you have interest. However, the first hurdle is to translate the company name into a URL so that the browser can take the user to the company home page.

There is no necessity that a company has to use its own name as part of the URL of its website, so this conversion of company name to URL is not always obvious, but will work in some cases.

If the information search is more abstract and perhaps you are looking for a phrase from a poem for instance then the guessing of a URL is unlikely to be fruitful.

This is an occasion where a search engine will be able to provide the user with a list of web pages that potentially have some link to the desired phrase. The user must then browse the pages that were returned by the search engine to discover which of the returned URLs was of interest.

A search tool can also be useful when the URL of an interesting page is lost. The user can enter information that he/ she remembers from the web page and a search engine may be able to find the site once more.


Search Methods

There are different methods by which URLs may be searched. Some tools may search for an item by name and will look for page and file titles, whilst other methods may look within web pages for specific content.


Searching by Name

This can be useful if the user knows the specific name of a file such as an exe file or similar. By typing the name into a search engine, the results returned can then be examined by the user for relevance by looking at the part of the URL that indicates the company's name.

This can be likened to using the dir command in dos or ls on a UNIX system. This method is less popular as most users are unaware of a specific name of a resource that they wish to access.


Searching by Content

When searching for a specific text string, it is necessary to examine the contents of files rather than their names. Perhaps you might want to find a specific phrase from a book. To perform this manually would require that you read the entire book. To perform this on a stand-alone computer you might use the services of find in a dos system or grep on a UNIX system. These commands search through the contents of a user-specified list of files to discover the specified character string.

This is the search type used by most modern search tools.


How does a Search Service Operate?

To overview the search process, consider how you would search for a given telephone number on a land line. The information that you require is kept in a book known as a telephone directory and the customer names that correspond to the telephone numbers are arranged in alphabetical order within the book. The search for a given name does not take more than a few seconds as we do not have to look through the entire book, just the few possible names that are lose to the one we require.

Similarly, when searching the WWW for a certain web page, the search tool that you use is searching through the index of a set of resources held on the search facility's hard disk(s). It is not searching the hard disks of all computers connected to the Internet.

This brings about the idea that the information is gathered before a user makes a request for a URL. Unfortunately, the nature of the WWW means that sometimes information can be held on the disks of a search tool, but the site to which the URL refers is no longer available either at that URL or indeed at all. therefore, the information must be gathered regularly in order to keep the search facility results up to date.

The request for service from a search facility is a client/ server process where the user's computer runs a search client and makes a request of the server belonging to the search facility. The server facility searches its hard disk(s) and returns to the client the result of its search.


Gathering the Search Information

Because whe WWW is so vast in terms of information available, the various search facilities run software that searches the web constantly for web pages and their contents. These pages are returned to the search site and stored on disk(s) for searching purposes.

Because of the vast amount of information held on the web, some search engine technologies filter the web pages to remove words such as a, an, and, the, for and with because these words will not help with indexing.

Some search engines will only keep a record of the web page's title. This is easy to find because it lies within HTML tags. The HTML that forms this page describes the title as

<title>Search Engine Technology</title>

Other search engines look at the META tag in the page heading.

<meta name="description" content="Cohousing is the modern approach to reclaiming a traditional village lifestyle where rich, intergenerational relationships, cooperation, and sustainability are the norm." />

<meta name="keywords" content="cohousing intentional community communities coop co-op cooperative ecovillage eco-village communal living alternative sustainable future housing permaculture" />

<meta name="author" content="The Cohousing Network" />

The inclusion of metatags allows the search engines to separate the sections within the page header so that the pages can be indexed more succinctly and efficiently.

Some search engines also store the first few hundred words on each web page so that they may be searched too.


Using a Search Engine

Whilst modern search engines are operating on a client/ server basis, there is no need for the end user to download and run a client software application. Today's search engines use a web browser and HTML to transfer requests and results.

The browser displays the home page for a web searching service that contains information on how to use the service, advertising links, a form (dialogue box) to enter the search string and a button to click to initiate the search.

Once the client's search results have been delivered to the search engine, the files that form the result to return to the user are compiled dynamically into a web page that is then returned to the client via the Internet.

This page is displayed in the user's browser, displaying the set of URLs that matched the client's request.


String Matching

This is a simple method by which the indices of the search engine may be searched. The user enters a character string and the search engine trawls through its disk(s) for exact matches of the given string.

The advantage of string matching is that it is a fairly simple process to implement in software and the string searching can be performed quickly and efficiently.

The disadvantage of string searching is that the process does not have intelligence and will provide results that are of no relevance at all to the user's search. There is no understanding of semantics (meaning) in string searches as illustrated below

If I am searching for information on a web spider and enter the term 'spider' to a search engine, I am just as likely to receive results concerning Spiderman - the movie and tarantulas as the information I was originally seeking.

A character string can be referred to as a key. Single key searching is likely to produce ambiguous or irrelelvant results due to the way in which we use language.

This situation can be vastly improved by entering more than one character string or key into the search engine. This will cut down on the number of results returned and improve the specificity of the search.


Searching using Multiple Keys

The example above that searches for information on a web spider will return improved results if we include both words in the search, but the words will be still treated independently of each other and produce irrelevant results.

To tie the search down more tightly, it is necessary to specify to the search engine that these words are to be searched for as a pair of words instead of as a single item. Some search engines require that a + sign is inserted before the words to signify that the search must consider the words as an item. The string to enter into the serch engine dialogue box would be

+web +spider

Other search engines allow the search string to be placed in inverted commas (e.g. Google)

"web spider"


Advanced Searching

Sometimes the search string that you supply will return unwanted results due to similarity of usage of certain words. If you search for a particular word you may find that it is used by another interest group too. Perhaps you may wish to exclude websites that include the word sex from your search. This is easily accomplished by including the - sign before any words that you wish to be excluded from the search results.

To exclude websites containing the word sex from being returned in a search for fun and entertainment the string to enter into the search engine dialogue box would be

fun entertainment -sex


Early Internet Search Tools

The search methods for information on the Internet has spawned may search methods, some of which are briefly introduced below.

archie      This tool searches the Internet for files available for FTP by name.

WAIS       Wide-area information servers (WAIS) is an Internet system in which specialised subject databases are created at multiple server locations, kept track of by a directory of servers at one location, and made accessible for searching by users with WAIS client
 programs.

Gopher      This was a search facility before HTTP existed. Gopher provided a way to bring text files from all over the world to a
 viewer on your computer. It is still supported by some universities. Two tools for searching Gopher file hierarchies are Veronica and Jughead.

NOTE:  Veronica and Archie are perhaps of most use for serious researchers who have already tried the Web's main search engines first
 or who already know that the topic of their search is likely to be found on Gopher and FTP servers.


Conclusion

There are many search tools and methods available to discover the URLs web pages having information of interest. Older methods required that you knew the exact title of the file for which you were searching.

Regardless of the tool being used it is still a client/ server process.

The search services compile lists of web pages and their contents and store them locally on hard disks to speed the search requests from users. The search services are constantly reviewing websites for new or updated webpages.

Today it is no longer necessary to run specialised client software to access the databases of information that are collected by the search facility.

Today's search engines look at the content of the web page, either via the meta tag or by reviewing the text of the page.

By being precise as to the content of the web page(s) that a user is searching for can dramatically cut down on the number of vague or irrelevant results that are returned to the user.



Resources

The Internet Book, Douglas E Comer Prentice Hall 1977 pp249 et seq

This section contains links to the following topics:

Search engines
Advanced search engines
Automated search tools
Standards



Comparisons of the Major Search Engines

Read the online resources published by searchenginewatch.com.

Search engines

AltaVista  http://www.altavista.com

Ask Jeeves  http://www.askjeeves.com

Direct Hit  http://www.directhit.com

Excite  http://www.excite.com

FAST Search  http://www.alltheweb.com

Go (Infoseek)  http://www.go.com

Lycos  http://www.lycos.com

HotBot  http://www.hotbot.com

Inktomi  http://www.inktomi.com/products/portal/search/tryit.html

Northern Light  http://www.northernlight.com

Advanced search engines

Google  http://www.google.com

IBM Clever project  http://www.almaden.ibm.com/cs/k53/clever.html

ResearchIndex  http://www.researchindex.com

Metasearch engine

SherlockHound  http://www.sherlockhound.com

Automated search tools

Autonomy  http://www.autonomy.com

Kenjin  http://www.kenjin.com

Standards

The Santa Fe Convention  http://www.dlib.org/dlib/february00/vandesompel-oai/02vandesompel-oai.html

W3C XML page  http://www.w3.org/XML/Activity

MathML  http://www.w3.org/Math

XLink  http://www.w3.org/TR/xlink

XPointer  http://www.w3.org/TR/xptr




References

http://whatis.techtarget.com



Tutorial questions

How would knowledge of the use of Boolean operators be of use when searching for a certain character string across the WWW using a modern search engine?

How does the technology employed by the Google search engine differ from those earlier search engine algorithms?
 
 

(c) MM Clements 2001                                                      Back to top of Page