Search engines are one of the most important gateways to the Internet. It is important to refine your search skills, not only for your own use as a student and professional, but also to help you understand how consumers and others find Web sites and their content. These notes and the included multi-task homework will strengthen your understanding of these important tools.
The main search engines that you should know about are:
Of course, a portal provides most of its services by partnering with other content or service providers, for instance, a stock market data service vendor. Even the search engine capability can be outsourced (Inktomi is the biggest provider of outsourced search solutions, and Altavista and others are moving aggressively into that market).
Among other services that a portal can provide are:
We can make an interesting observation here. The difference between a search engine and an information service (such as one for stock market data) is not that big. Both provide information in response to visitors' queries. But the former searches throughout the web, while the latter goes to a specific place to get the required information. There are also many gradations between these two extremes. For instance, a news search (or information service?) may look only at news stories in respectable newspapers and magazines. A job listing will search through jobs, but only at a specific database, not on the web as a whole. One approach is, therefore, centralized: the information comes from sources identified and approved in advance. The other is decentralized, and may look for information at a website that was just created today. There are benefits and disadvantages to both approaches, so both will probably survive.
A critical feature that crosses over all services of a portal is customization. A user (typically recognized by cookies and/or explicit login) can specify where she lives, which stocks she'd like to follow, etc. Ironically, the core of many portals, a search engine, is today never customized: technology for doing that is still lacking. This presents an entrepreneurial and academic opportunity.
There are three first-tier portals (Yahoo, MSN, and AOL) and a host of second-tier competitors (Altavista, Excite, Lycos, Go.com, Snap, etc). Some portals have their roots in a search engine that expanded beyond purely search and began to offer other services to visitors. Others were originally ISPs that differentiated themselves by offering content to their subscribers. Sometimes, an ISP and a search engine merge to create a more powerful portal, as was the case with Excite AtHome.
1. Password-protected pages. For instance, you will never find a Wall Street Journal article through a search engine. Even if it is free to register and obtain username and password for a website, no search engine is clever enough to do it (for a computer program, it is virtually impossible to understand registration instructions written for humans). The solution sometimes used today is integration of a search engine with especially important password-protected sites. This could mean, for instance, that the technical teams from Altavista and Wall Street Journal Interactive get together, and write a special program to allow Altavista to access the Journal's articles.
2. Recently-created pages. Due to the enormous amount of time required to go over the whole web, a search engine indexes the web only periodically, say once every one or two months. As a result, pages that have just come up are not indexed yet. One solution is to index the more important websites, such as magazines, more frequently. Yet, search engines do not do that. For instance, at the time of this writing (October 14, 2000), Google only indexed Red Herring's site as of 6 weeks ago (August 28, 2000). Altavista indexed different parts of the site at different times; the last page it knows from the site is dated September 13, 2000; but many other pages were not updated by Altavista since spring 2000! Of course, if pages are replacd by others at the same URL, or otherwise disappear, they may never make it to a search engine at all.
3. Very large websites. Consider Amazon.com's website. Amazon may list all the millions of products it has so that a search engine can navigate to them by simply following all possible links. However, no search engine does that, since the amount of time it requires is too large. The solution (used by Amazon) is the same as for password-protected sites: to integrate the search engine with the website, providing the information about the site's content in a compact and convenient form specifically for the purpose of indexing. Few sites go to the trouble of doing that (Amazon, of course, is ahead as usual).
4. Text in non-HTML formats. A lot of information is stored in PDF, Word, Excel, PostScript, Powerpoint files. A search engine can download and read such files, but none does -- it is considered too much hassle. Solutions are available today, but they just are not used. Even worse, some words may be written in a graphics file, and then it is virtually impossible for a software program to read that text.
5. Multimedia. Today's technology is hopelessly insufficient to allow you to search for a picture of Big Sur at sunset -- unless that picture is accompanied by text that identifies it as such. Similarly, searches for music in MP3 format must rely just on the file name and description -- not the actual content.
6. Pages behind some complex interface that a search engine cannot penetrate. For instance, when several links are buried inside a picture (based on the specific point which the user clicks), a search engine typically does not see the link. While a user sees the text or other marks to tell her where to click, a search engine only sees a million pixels that it cannot interpret in any way -- and they are too numerous to try clicking them all. Java-based and some Javascript-based menus are also an obstacle. The worst example is a dictionary that does not explicitly list all its words but rather expects a user to first type the word in. Of course, unless a search engine is specifically taught that trick, and is given a list of all English words, it will never be able to access any entries in that dictionary. A good website should always provide text links to all its content, not only for the benefit of search engines, but also for those users who do not want to use Java or extensive graphics.
7. Pages to which there are no (or very few) links. A search engine never has time to index the whole web, it stops the indexing after some time. If a page cannot be reached within a reasonable number of clicks from some original list of pages, it may be ignored. As search engines get better in their computational and bandwidth resources, this problem will get less serious. In any case, the more interesting pages usually have quite a lot of links to them.
Keep in mind that even the part of the web which is indexed is not always available to a user, due to limitations in the query language. For instance, suppose you look for a page that will help you to solve some homework problem from your statistics class. Suppose you are lucky and there is a page on the web that solves precisely that problem, only phrased differently and using different numbers. E.g., your problem may say "Bob invests $10,000 into a bond with 10% coupon rate and 5 year maturity". The web page will read "John put $25,000 into a fixed-income security that pays 8% annually for 3 years". However cleverly you try to phrase your search query, you will never find that page. Today, no technology exists that lets you search for a certain meaning, regardless of the words used to express it.
Each newsgroup is devoted to some more or less specific topic. Postings fall in several categories:
Indexing is done by a spider program. It starts from some large list of known pages (such as Yahoo's web directory), and follows all links in them, and then all links in the newly discovered pages, etc. As a result, an index of about a billion pages is built, and very complex proprietary technology is used to store them in a limited space and make them searchable with the highest speed possible. Inside each Web page, the spider looks through all HTML code, and it can tell which part is text, which part is links, and which is pictures or other objects. To a limited extent, you can search for non-text parts of the website (e.g., you can search for a picture by name).
The total number of web pages is hard to estimate, but Google has indexed over a billion pages. At the start of 1998, the total number of Web pages was estimated at 150 million. Thus, the growth rate has exceeded 100% per year for the last two years. Because of the Web's huge size, search engines have to devise various tricks to keep their searches reasonably fast, inexpensive, and useful for the user. These tricks are mostly technical, but a couple of them are important to understand.
The most straightforward approach is to search through the full text of all documents. An alternative approach is to create short "abstracts" or excerpts of each page, and then search only through these abstracts when you submit a query. This reduces the amount of information to search through, and if a word is found in the abstract one can hope that it represents the document's topic. So instead of 10,000 documents you may get just 100, each of which has more relevance to your topic. Unfortunately, the abstracts created automatically are far from perfect, and to create them manually is far too expensive.
An index looks roughly as follows. For every word ever encountered on the Web, a list of pages containing this word is maintained. E.g., there would be a word "NBA", and next to it there will be links to all the hundreds of thousands of pages mentioning NBA. So if you search for "NBA", AltaVista will just go through that list. This index requires huge amounts of storage, but it does make search many times faster.
Another trick that affects what the user sees (and not what the search engine stores) is to rank documents by relevance. Using complicated rules, the search engine tries to assign a "relevance score" to each document found, and shows you the documents in order of this score. This reduces the chances of your getting 10,000 results among which the most relevant ones come somewhere in the middle.
For Usenet search, Deja.com archives every single posting coming to Usenet (because newsgroup messages, unlike Web pages, are normally deleted after a couple of weeks). It also archives all messages posted to Usenet that it can find from past years. Currently its database extends back to 1995. As it finds older archives, it adds them to its database. Deja.com searches through the full text of messages, but to allow this it has to ignore certain very common words. Also, if your search returns too many (>1000) results, Deja.com may show you just the first 100 or so. An important service that Deja.com has is to filter all messages so that spam is mostly eliminated.
Google has taken Deja's approach and also stores every single page it indexes. This is an incredibly valuable service; many pages go down for some reason, but you can still look them up on Google.
See the link How Search Engines Work at the end of this document.
Additional sources of revenue are agreements with on-line vendors such as Amazon.com. Such vendors get a permanent direct link on the home page of the search engine, or on its "shopping" subsection. This is partly a service to the users, since these vendors usually are screened quite carefully for quality.
As discussed above, many successful search engines expanded their offering to become portals. However, as the Altavista example shows, this not easy. Altavista recently announced that it would scale back its portal activities, and focus again on being just a search engine.
To enhance revenues, search engines use focused advertising, showing different advertising depending on what search the user does. Suppose Toshiba wants to place its notebooks ad. Without focused advertising, Lycos would just show this ad on a certain proportion of pages it returns to users. Since most people are not interested in notebooks, this ad won't get many clickthroughs, and Lycos would be unwilling to place it unless it can charge Toshiba a fee that calculates out to be fairly high per clickthrough. With focused advertising, Lycos shows this ad only when search words (as selected by Toshiba) imply the visitor is likely to be interested in notebooks. The number of clicks on the ad per user seeing it increases perhaps hundreds of times. Still greater advertising focus is made possible with customized pages: when you register, you often provide information about your interests that can be used to fine-tune the ads shown to you.
Just as important as revenues are costs. Search engines normally get just 1-2 cents of advertising revenue for each search, so they have to keep costs per search below that. This leads to a lot of pioneering data collection, processing and retrieval techniques being used in the search engines. These techniques are the most closely guarded secrets in the business.
Task 2: Search for your two topics on the Web using both search engines you selected. If you get too many irrelevant results, restrict your search. E.g., if searching for "Anderson School" yields too many results from the Anderson School itself, try excluding pages whose URL contains "ucla.edu". Write down your most successful search strings.
Task 3: Search for your two topics using Deja News. Write down your most successful search string(s). Comment very briefly on the relevance of the results you obtained. Conjecture, in one or two sentences, the types of topics that are better searched for in Usenet than on the Web.
Task 4: Now use some more sophisticated search requests to look again for the same two topics using all three search engines. Record your search string(s), and comment very briefly on the results.
Task 5: The next few times you use any search engine, note the advertising you saw after you submitted the search. Do these look like focused ads to you? If focused, do you feel like a beneficiary or a privacy victim?