Clustering With Search Engines(Tara Calishain)
发表于2004/5/16 13:40:00 4260人阅读
分类： 搜索技术 开发技术
Published June 3, 2002
Search engines still aren't as smart as we'd like 'em to be. Sure, Google's great, and Yahoo comes in real handy sometimes, but sometimes your search terms just aren't finding what you're looking for.
Enter clustering. With clustering search engines gather results into groups around a certain theme, or in some cases just provide you with related keywords that perhaps you wouldn't have thought of yourself, helping you zero in on your goal. The Internet Archive (IA) is a virtual time machine. A non-profit company, the IA is working to 損revent the Internet - a new medium with major historical significance - and other "born-digital" materials from disappearing into the past.?To date, the archive抯 collection consists of 10 billion web pages, 16 million Usenet postings, 360 archival movies, and 5,000 pages from Arpanet (from the U.S. Department of Defense). Not only is IA a wonderful way to preserve the Internet, but is most helpful in answering reference questions and has been my assistant (or should that be the other way around) in many a legal research project.
In part I of Clustering With Search Engines, we'll look at regular search engines that cluster -- and boy, are there are a lot of 'em! In Part II of this article, we'll look at meta-search engines that cluster as well as specialty clustering search engines and a search engine that is still offering clustering on a limited basis.
We'll start with one you might not have heard of yet: Google Labs' clustering agent, Google Sets.
Google Sets - http://labs.google.com/sets
Google Sets doesn't provide search results. Instead, it helps you find similar terms to the ones you've already entered, letting you create more complex queries in one area.
Enter a couple of words - Tamoxifen and Arimidex work; they're drugs used to treat breast cancer. You'll get a small set of results, but it'll include items you might not have heard of. Be sure to click on them to get Google search results to see how they're related to your original search terms.
Let's do a more general example -- say dog breeds. Enter collie, chihuahua, and german shepherd in the set boxes. You'll get back an enormous list of dog breeds. You don't want to use all of these, of course, but it'll give you an idea of how to narrow your search.
Use Google sets to build queries when you're looking for similar items or brainstorm on how to put a search together. The other search engines in this article cluster in a more traditional way; we'll start with Wisenut.
Wisenut -- http://www.wisenut.com
Wisenut is a full-text search engine that was recently bought by LookSmart. Enter a search in it -- we'll use "neurosurgery" as the primary example for the rest of the article -- and you'll see that the search results include a black area at the top of the page which has related topics (neurosurgery university, pediatric neurosurgery, etc.) and a number of results. WiseNut calls this the WiseGuide. Some results have a + beside them; click on the + for subtopics. The subtopics will show up in a gray area underneath the clustered results.
There's also a [search this] link next to each of the clustered results, which runs another search with those keywords. Those keywords take you to a different set of clustered results in addition to Web page results, and so on and so on.
Teoma -- http://www.teoma.com
Teoma was recently purchased by Ask Jeeves, and has gotten a lot of press as a potential "Google Killer." While I don't think I'd go that far, it does have interesting clustering technology.
Run the neurosurgery search and you'll get four sets of results. Top left are sponsored results. Bottom left are Web site (non-sponsored) results. Top right are the suggestions for refining the result (that's what we'll focus on). and bottom right are the "Link Collections from Experts and Enthusiasts," as Teoma calls them. If you're just looking for general information then use the link collections. If you're interested in narrowing your search, though, use the suggestions.
Just click on one and your search will be run again, with the suggested term you searched on included in the link. You'll get a different set of site results, suggestions, and expert link collections, too.
Infonetware.com -- http://www.infonetware.com/
This site isn't a search engine per se but is rather a demonstration of Infonetware's "RealTerm Technology."
Enter a search term at the top of the page. The results page is framed. The area on the left provides you with topics related to your search term, while the frame on the right shows the Web page search results. The topics have a number in parens beside them that shows how many results are in that particular topic.
Click on a topic and the results for that topics will appear in the right frame. With some of the terms, you'll see sub- topics that allow you to narrow your search results even more.
While Infonetware works with full-text searching, the Oingo engine uses the Open Directory Project and offers suggestions for searching.
Oingo -- http://www.oingo.com/
Since Oingo uses the Open Directory Project as its search source, it's already clustered in a way. (ODP is a searchable subject index like Yahoo.) When you do a search, the search results page will first give you a drop-down list of potential meanings for your search, if any. Beneath that is a list of categories which relate to your search (listed in order of relevance.) Finally, site results from the directory itself.
Unfortunately, the suggestions are limited; searching for neurosurgery provides very few suggestions. It's only when you do a search for a more general term does Oingo's usefulness come through. Searching for Rose, for example, provides several suggestions (plant life, pink wine, several
different American towns, etc.) and a manageable list of categories.
If you pick a suggested definition, Oingo will run a search again using the definition you specified. All the definitions I looked at for "rose" provided just category results, not results of individual sites. This is a good one to try if you're searching for something that's in a pretty broad category, like flowers, trees, animals, etc.
AlltheWeb -- http://www.alltheweb.com
Now that the Northern Light Web search is no longer publicly available (supposedly), my favorite search engine that nobody remembers is AlltheWeb. AlltheWeb provides two ways to narrow search results. They're both on the right side of the results screen.
The first way is FAST Topics, which apparently uses both ODP topics and dynamically generated topics. Click on a topic and you'll get a list of Web sites related to that topic.
There's also a "Narrow Your Search" option that lists search terms related to your search. Click on one of those and your search will be run again with the term you clicked. Not all search terms have both Topics and Narrow Your Search terms, but all the ones I looked at had either one or the other.
That's it. Next week we'll look at meta-clusterers, and a full-text search engine that's still testing its clustering.
In part one of this article we took a look at general search engines that offer clustering features. In this episode we're going to look at one more general search engine that is still offering clustering -- AltaVista -- but not yet offering it to the general public. Then we'll take a look at a few meta-search engines that cluster, and a specialty search engine that clusters.
AltaVista -- http://www.altavista.com
You may remember that several weeks ago AltaVista was testing their clustering technology with a small percentage of their users. They're still testing it, but I was able to take a second look at it.
AltaVista's paraphrase looks a little like AlltheWeb's recommended terms results; once you run a search, AltaVista's recommendations for narrowing down search results show up at the top of the page. A search result for "neurosurgery" shows about a dozen results, including brain, functional results, and Johns Hopkins. Clicking on one of the results to narrow down the search leads to another collection of recommended narrowing terms (Clicking on Johns Hopkins leads to suggestions that include pediatric neurosurgery, Johns Hopkins Hospital, and Johns Hopkins University) and so on.
As I mentioned, this is not yet publicly available, but I like the suggestions it makes. If you use AltaVista keep an eye out for it.
In addition to AltaVista and many other general search engines, there are some meta-search engines that cluster their results. Vivisimo is probably the most famous, but there are other ones available too.
Vivisimo -- http://www.vivisimo.com
Vivisimo has a very simple front page, but the search results are organized in groups. A search for neurosurgery provides 163 results. On the left side of the screen are the groups of results, which in this case include Neurosurgeons, Programs, and Nervous System. Click on the + beside the search results to get narrower and narrower search results, until you get to actual page listings. Click on the page title and get the page on the right side of the screen. This page design makes it really easy to explore several categories without "losing your place."
Don't forget to check out Vivisimo's advanced search, which allows you to specify the search engines you want to use and specify how many results you want (the more results you specify the more interesting the categories get -- that's what my experimenting showed, anyway). You can also specify in what language your search results should be and how you want your pages to display (in a frame, in a window, or in a new window). There's even a filter for removing offensive content (though that does limit the number of search engines available.)
While Vivisimo is fairly well known, Query Server is more just a demo site. But it's a demo site worth looking at -- it offers clustering search for several different categories of Web search.
Query Server -- http://www.queryserver.com/
Query Server offers several different types of search on the left side of the front page. You'll see links to search there for Web, News, Health, Money, and Government. Each of these searches cluster results, and they all have pretty much the same interface. But they each delve into different resources.
Search results are presented in a frame on the right side of the site. The top of the frame has a query box. Below that is a listing of the search engines queried. Below that is a listing of the groups that search results were clustered into, while below that are the results themselves. Results are divided by cluster and assigned scores based on how relevant they are. A search for "neurosurgery" provided several different clusters, including Cyber Museum of
Neurosurgery, UCLA Neurosurgery, and Harvard Medical School.
The other search engines provide results in much the same way, but I encourage you to check out each engine, and especially the small customize link on the lower right of each query box. The customize lets you specify the engines used, specify whether or not you want to search for ALL or ANY term given, how many results you want total, and how long you want to Query Server to search.
Surfwax -- http://www.surfwax.com
Before you start playing with Surfwax, I have to tell you something: I have never been able to get Surfwax to work except with Internet Explorer.
Surfwax is a service that offers both subscription-based and free services. The subscription-based service gives you access to more search engines and more features, but there is some searching that you can do for free.
After you've done a search, you'll see a "focus" link in the upper-left corner. Click on the little box beside the word. You'll get "focus words" that you can add to the search. Focus words are divided into narrower or broader, and the big difference between this list and others you've seen is that this
list contains generic words, and not links to specific people or places like Johns Hopkins or Harvard Medical School. This makes for a different set of search results than the other ones I've mentioned in this article.
Surfwax has been around for a while, but it's not been around nearly as long as the old reliable Northern Light. And while Northern Light no longer offers Web search, it still uses its clustering technology for news search.
Northern Light News Search -- http://www.northernlight.com/news.html
I'm not able to use neurosurgery for this example since a search has to have a certain number of results in order to be classified into folders.
"George Bush" works well for a search, though. Search results are divided into several different folders, including stock markets, macroeconomics, terrorism, and Pakistan. Pick a folder and you'll get the results that appear in the folders. Unfortunately the folder listing does not provide information about what's in a particular folder, but there are subfolders provided if the topic is broad enough. It also appears that the search results are listed by order of date; handy if you're looking for recent stuff.
You can't always come up with a search query that's specific enough that you'll only find a few search results. In that case, using clustering search engines can break out several hundred results into manageable packages, or provide you suggestions that reduce the ocean of information to a reasonable level. Enjoy!
- Competitive Caching of Query Results in Search Engines
- Search Engines Information Retrieval in Practice
- google's page rank and beyond&&&Understanding search engines
- RankPreserving TwoLevel Caching for Scalable Search Engines
- A Comparison of Open Source Search Engines开源搜索引擎比较
- UIUC大学之Coursera课程Text Retrieval and Search Engines：Week 2 Quiz
- UIUC大学之Coursera课程Text Retrieval and Search Engines：Week 1 Quiz
- CMU 11642 Search Engines - 大纲梳理
- 搜索引擎导论-Mark Levene_An introduction to Search Engines and Web Navigation
- Performance of Compressed Inverted List Caching in Search Engines