This lecture is divided into hyperlinked sections
Introduction
What
is the function of a Search Engine?
Why do we need a Search
Engine?
Use of a Search Tool
Search Methods
Searching by Name
Searching by Content
How does a search
service operate?
Gathering
the Search Information
Using a Search Engine
String Matching
Searching using
Multiple Keys
Advanced Searching
Early Internet
Search Tools
Conclusion
Resources
References
Tutorial Questions
This section of the course is concerned with Search Engine technology.
To use the wide resources that the WWW has available, it is necessary to specify a URL that corresponds to a website. Having found a set of resources, further links are provided that will automatically take the user to a new resource which may be part of the original website or may be part of a completely different website held in another part of the world.
The user may not know the URL of a website and so can
take advantage of the services that are offered by a Search Engine.
What is the function of a Search Engine?
A Search Engine is a web based system that allows the user on a client machine to search for specific words or character strings that are part of the content of web pages across the www. Without knowing the location nor the title of the web page, a search engine is able to provide the user with a listing of web pages that match the search string that was requested.
By using an automated search service, the user can find the URL that is required to search a remote computer for web pages relating to the desired search topic.
The search engines provide a form field to type the desired search topic or string and the results of the search are displayed as a web page. Each link provides a short description of the web page and a hyperlink that may be clicked to take the user to that particular URL.
Why do we need a Search Engine?
When searching for information on a certain topic, the user may browse the WWW by following hyperlinks from various sites, however this is not a feasible method of finding all information as websites are appearing and disappearing from the web faster than a human can browse them.
To help speed up the search, some form of automation is needed and this may be referred to as a search tool, indexing tool or a search engine. The service provided is referred to as an automated search service or an automated index.
Imagine that you are searching for information on networking hardware and will be using the WWW. Perhaps you already are aware of the names of a few companies that market this equipment, 3Com, Nortel, Cisco Systems etc. These companies may form the start of a search for the item in which you have interest. However, the first hurdle is to translate the company name into a URL so that the browser can take the user to the company home page.
There is no necessity that a company has to use its own name as part of the URL of its website, so this conversion of company name to URL is not always obvious, but will work in some cases.
If the information search is more abstract and perhaps you are looking for a phrase from a poem for instance then the guessing of a URL is unlikely to be fruitful.
This is an occasion where a search engine will be able to provide the user with a list of web pages that potentially have some link to the desired phrase. The user must then browse the pages that were returned by the search engine to discover which of the returned URLs was of interest.
A search tool can also be useful when the URL of an interesting page is lost. The user can enter information that he/ she remembers from the web page and a search engine may be able to find the site once more.
There are different methods by which URLs may be searched. Some tools may search for an item by name and will look for page and file titles, whilst other methods may look within web pages for specific content.
This can be useful if the user knows the specific name of a file such as an exe file or similar. By typing the name into a search engine, the results returned can then be examined by the user for relevance by looking at the part of the URL that indicates the company's name.
This can be likened to using the dir command in dos or ls on a UNIX system. This method is less popular as most users are unaware of a specific name of a resource that they wish to access.
When searching for a specific text string, it is necessary to examine the contents of files rather than their names. Perhaps you might want to find a specific phrase from a book. To perform this manually would require that you read the entire book. To perform this on a stand-alone computer you might use the services of find in a dos system or grep on a UNIX system. These commands search through the contents of a user-specified list of files to discover the specified character string.
This is the search type used by most modern search tools.
How does a Search Service Operate?
To overview the search process, consider how you would search for a given telephone number on a land line. The information that you require is kept in a book known as a telephone directory and the customer names that correspond to the telephone numbers are arranged in alphabetical order within the book. The search for a given name does not take more than a few seconds as we do not have to look through the entire book, just the few possible names that are lose to the one we require.
Similarly, when searching the WWW for a certain web page, the search tool that you use is searching through the index of a set of resources held on the search facility's hard disk(s). It is not searching the hard disks of all computers connected to the Internet.
This brings about the idea that the information is gathered before a user makes a request for a URL. Unfortunately, the nature of the WWW means that sometimes information can be held on the disks of a search tool, but the site to which the URL refers is no longer available either at that URL or indeed at all. therefore, the information must be gathered regularly in order to keep the search facility results up to date.
The request for service from a search facility is a client/ server process where the user's computer runs a search client and makes a request of the server belonging to the search facility. The server facility searches its hard disk(s) and returns to the client the result of its search.
Gathering the Search Information
Because whe WWW is so vast in terms of information available, the various search facilities run software that searches the web constantly for web pages and their contents. These pages are returned to the search site and stored on disk(s) for searching purposes.
Because of the vast amount of information held on the web, some search engine technologies filter the web pages to remove words such as a, an, and, the, for and with because these words will not help with indexing.
Some search engines will only keep a record of the web page's title. This is easy to find because it lies within HTML tags. The HTML that forms this page describes the title as
<title>Search Engine Technology</title>
Other search engines look at the META tag in the page heading.
<meta name="description" content="Cohousing is the modern approach to reclaiming a traditional village lifestyle where rich, intergenerational relationships, cooperation, and sustainability are the norm." />
<meta name="keywords" content="cohousing intentional community communities coop co-op cooperative ecovillage eco-village communal living alternative sustainable future housing permaculture" />
<meta name="author" content="The Cohousing Network" />
The inclusion of metatags allows the search engines to separate the sections within the page header so that the pages can be indexed more succinctly and efficiently.
Some search engines also store the first few hundred words
on each web page so that they may be searched too.
Whilst modern search engines are operating on a client/ server basis, there is no need for the end user to download and run a client software application. Today's search engines use a web browser and HTML to transfer requests and results.
The browser displays the home page for a web searching service that contains information on how to use the service, advertising links, a form (dialogue box) to enter the search string and a button to click to initiate the search.
Once the client's search results have been delivered to the search engine, the files that form the result to return to the user are compiled dynamically into a web page that is then returned to the client via the Internet.
This page is displayed in the user's browser, displaying the set of URLs that matched the client's request.
This is a simple method by which the indices of the search engine may be searched. The user enters a character string and the search engine trawls through its disk(s) for exact matches of the given string.
The advantage of string matching is that it is a fairly simple process to implement in software and the string searching can be performed quickly and efficiently.
The disadvantage of string searching is that the process does not have intelligence and will provide results that are of no relevance at all to the user's search. There is no understanding of semantics (meaning) in string searches as illustrated below
If I am searching for information on a web spider and enter the term 'spider' to a search engine, I am just as likely to receive results concerning Spiderman - the movie and tarantulas as the information I was originally seeking.
A character string can be referred to as a key. Single key searching is likely to produce ambiguous or irrelelvant results due to the way in which we use language.
This situation can be vastly improved by entering more than one character string or key into the search engine. This will cut down on the number of results returned and improve the specificity of the search.
The example above that searches for information on a web spider will return improved results if we include both words in the search, but the words will be still treated independently of each other and produce irrelevant results.
To tie the search down more tightly, it is necessary to specify to the search engine that these words are to be searched for as a pair of words instead of as a single item. Some search engines require that a + sign is inserted before the words to signify that the search must consider the words as an item. The string to enter into the serch engine dialogue box would be
+web +spider
Other search engines allow the search string to be placed in inverted commas (e.g. Google)
"web spider"
Sometimes the search string that you supply will return unwanted results due to similarity of usage of certain words. If you search for a particular word you may find that it is used by another interest group too. Perhaps you may wish to exclude websites that include the word sex from your search. This is easily accomplished by including the - sign before any words that you wish to be excluded from the search results.
To exclude websites containing the word sex from being returned in a search for fun and entertainment the string to enter into the search engine dialogue box would be
fun entertainment -sex
The search methods for information on the Internet has spawned may search methods, some of which are briefly introduced below.
archie This tool searches the Internet for files available for FTP by name.
WAIS
Wide-area information servers (WAIS) is an Internet system in which specialised
subject databases are created at multiple server locations, kept track
of by a directory of servers at one location, and made accessible for searching
by users with WAIS client
programs.
Gopher
This was a search facility before HTTP existed. Gopher provided a way to
bring text files from all over the world to a
viewer on your computer. It is still supported
by some universities. Two tools for searching Gopher file hierarchies are
Veronica and Jughead.
NOTE: Veronica and Archie are perhaps of most use
for serious researchers who have already tried the Web's main search engines
first
or who already know that the topic of their search
is likely to be found on Gopher and FTP servers.
There are many search tools and methods available to discover the URLs web pages having information of interest. Older methods required that you knew the exact title of the file for which you were searching.
Regardless of the tool being used it is still a client/ server process.
The search services compile lists of web pages and their contents and store them locally on hard disks to speed the search requests from users. The search services are constantly reviewing websites for new or updated webpages.
Today it is no longer necessary to run specialised client software to access the databases of information that are collected by the search facility.
Today's search engines look at the content of the web page, either via the meta tag or by reviewing the text of the page.
By being precise as to the content of the web page(s)
that a user is searching for can dramatically cut down on the number of
vague or irrelevant results that are returned to the user.
The Internet Book, Douglas E Comer Prentice Hall 1977 pp249 et seq
This section contains links to the following topics:
Search engines
Advanced search engines
Automated search tools
Standards
Read the online resources published by searchenginewatch.com.
AltaVista http://www.altavista.com
Ask Jeeves http://www.askjeeves.com
Direct Hit http://www.directhit.com
Excite http://www.excite.com
FAST Search http://www.alltheweb.com
Go (Infoseek) http://www.go.com
Lycos http://www.lycos.com
HotBot http://www.hotbot.com
Inktomi http://www.inktomi.com/products/portal/search/tryit.html
Northern Light http://www.northernlight.com
Google http://www.google.com
IBM Clever project http://www.almaden.ibm.com/cs/k53/clever.html
ResearchIndex http://www.researchindex.com
Metasearch engine
SherlockHound http://www.sherlockhound.com
Autonomy http://www.autonomy.com
Kenjin http://www.kenjin.com
The Santa Fe Convention http://www.dlib.org/dlib/february00/vandesompel-oai/02vandesompel-oai.html
W3C XML page http://www.w3.org/XML/Activity
MathML http://www.w3.org/Math
XLink http://www.w3.org/TR/xlink
XPointer http://www.w3.org/TR/xptr
How would knowledge of the use of Boolean operators be of use when searching for a certain character string across the WWW using a modern search engine?
How does the technology employed by the Google search
engine differ from those earlier search engine algorithms?
(c) MM Clements 2001 Back to top of Page