The Penn State search engine, http://search.psu.edu/, is pretty nifty. It allows us to type in keywords or phrases and presto we find exactly what we're looking for, really quickly! Maybe. Alas, we are disappointed when the search engine returns results that we would never dream of associating with our desired search. So what gives? Is there a way that Penn State Webmasters and server administrators can control how their pages are ranked by the search engine? Likewise, can search engine users better optimize their searches? Yes, there is. Keep reading!
Before we get into tips for searching and getting searched, let's do a rundown of search engine basics. The Advanced Information Technologies (AIT) group of the Center for Academic Computing maintains the Penn State search engine. It uses Inktomi's Search/Site (formerly UltraSEEK Server) search engine, version 3.1.10 on Sun Microsystems operating system, Solaris 2.5.1. Following the license agreement Penn State holds with Inktomi, we index only .psu.edu sites. The search engine currently indexes over 420,000 documents!
So, what is an indexed site anyway?
By now, you're probably wondering what an indexed site is anyway. The search engine administrator adds servers to the search engine on a per request basis, so an indexed site is a server that the server administrator has included in the search engine index. When a site is added to the list of indexed sites, the search engine spider, a program that scans the Web, looks for a particular file in the root directory of that site. Once the site is added to the index, the site is added overnight. An example of an indexed site is http://cac.psu.edu/. The file name the spider looks for depends on the type of server being indexed. For example, www.psu.edu uses the default index.html. The spider indexes index.html and follows all the links in that file. It is not necessary for all the documents of an indexed site to be added to the search engine index by the search engine administrator. Files that are linked to indexed files will automatically index. If the search engine spider does not find the server-specific default file (index.html/.htm or default.html/.htm in most cases), then it will visit all the sub-directories, looking for the default file in each sub-directory. The indexing process continues in the same manner. The search engine will not follow links from your pages to URLs that are not served by the search engine. For example, the indexed site http://www.aaa.psu.edu/ includes a link to http://www.some.org/. The site, http://www.some.org/, will not be indexed by the search engine.
If you have departmental Web space and would like your documents to be searched, then make sure that you link to all site files from either the server-specific default file or another file. Consider the case where the main index.html file at your site, which is indexed, links to your "index.html" file. Suppose that you have the files "staff.html," "meetings.html," and "contactinfo.html" in your directory. If "staff.html" and "meetings.html" are the only files that are referenced from your "index.html" file, then "contactinfo.html" will not be indexed as it is not referenced by any file that the search engine spider can find. Either reference "contactinfo.html" from your "index.html" file or from your "staff.html" or "meetings.html" files.
All of this sounds great, but there is a tidbit of bad news. As of June 1999, faculty, staff, and student personal Web pages on the personal server http://www.personal.psu.edu/ are no longer indexed with our search engine. Please do not send a request to webmaster@psu.edu to request that we index your personal page. The good news is that we do not prohibit other robots from searching our site. Penn State pages and sites can be registered with other search engines, such as Yahoo, Google, and Lycos to name a few.
Becoming a member of the indexed crowd
Now that you know what an indexed site is, you're probably wondering how you can enter this rite of Web passage and index your site. It isn't quite that dramatic nor is it time-consuming. If a site is not indexed, it is very likely that the Webmaster/server administrator for the site has not requested that the site be added to the Penn State search engine index. It would be impossible for us to contact every single Webmaster/server administrator within the Penn State network so we rely on Penn State Webmasters to send their requests to webmaster@psu.edu to index sites.
Now that you're in, make it advantageous
Once a site is indexed, Webmasters can earn good rank for their respective sites with the search engine by following these simple tips:
<META NAME="description" CONTENT="this document provides information about scholarship opportunities for undergraduate students in mass communications studies">
Keywords help the search engine to categorize your site. Choose keywords that best describe your content. The search engine gives priority to the first few keywords it finds. The keyword META tag looks like this:
<META NAME="keywords" CONTENT="scholarships, mass communications, undergraduate students">
Invoking the search engine
You can help visitors to search the Web from your site or to search your site exclusively by invoking Penn State's search engine, that is, making it available from your site. Penn State colleges, departments, and units can invoke the search engine from Web pages by creating a form that either passes the requested parameters to the search engine or that directly starts the search process. Only indexed sites can invoke the search engine from Web pages. Instructions on how to invoke the search engine from your Web pages can be found in the Search Engine FAQ at http://search.psu.edu/psu/searchfaq.html.
Some common problems and solutions
Penn State Webmasters, from time to time, have reported some "strange happenings" with indexed sites. Luckily, we have answers for the more common problems. You might have noticed that files from your old site are still being indexed with the search engine, even though you created a new site, which is indexed with the search engine. To make the old site inaccessible to the search engine, it will need to be removed. If the old site was hosted through Penn State Departmental Web space, then the site's administrator/supervisor will need to contact the CAC Computer Accounts Office to request that the space be removed. If the old site was hosted on an independent server (for example, http://www.aaa.psu.edu), then the site administrator for the independent server will need to be contacted to remove the site. It then takes about ten days after the site has been removed for the search engine to purge the files from the index. If there are only a few URLs in the index, then the URLs can be deleted manually by the search engine administrator. If there are many URLs, it is more efficient to wait until the server purges the URLs from its index.
Sites not appearing in search results, even though they reside on an indexed server, is another common problem. The most likely cause is that the indexed server on which your site resides does not provide links to your site from any of its pages/sites. If this is the case, contact the server administrator or Webmaster for the server to request that a link be added to your site. If a link can be added, it may then take up to ten days for the search engine to recognize the change and find your page. If for some reason you cannot be listed, contact webmaster@psu.edu again and the site can be added as a separate site for the search engine to index.
For those problems that leave you still scratching your head, send a descriptive message to webmaster@psu.edu and we'll investigate it as best we can.
Reaping the benefits of good searching practices
There are many things that searchers can do to produce optimal search results when using the simple search interface at http://search.psu.edu/. Always keep in mind that not every Penn State server is indexed by our search engine. For example, if you perform a search for "ice cream" but pages are not returned for the site http://www.icecream.psu.edu/, that server has not been indexed with the Penn State search engine. Since the search engine searches documents for keywords, it will bring up all the indexed documents that contain the word "ice cream." Limit yourself: Using defined search engine collections and URL strings to limit your search
Wouldn't it be great if you were able to restrict your search to a particular group of sites? You can! The search engine's collection names let you search a particular group or collection of sites. Our search engine has three collections: psu, polreg, and uinfonet. The table below shows the different collections. For example, if you want to search for advising information, you can restrict your search to the uinfonet collection rather than search the entire psu collection.
| Value Name | String Name | Contents of Collection |
|---|---|---|
| psu | Penn State | All allowed/indexed Penn State Web servers |
| polreg | Penn State Policies | Only policies register pages |
| uinfonet | Penn State Undergraduate Information Network | Only the Undergraduate Information Network pages |
To limit your search to a particular collection, click one of the collection check boxes, located in the search field area at http://search.psu.edu/. Learn more about limiting your search in the Introduction to the Search Engine document at http://search.psu.edu/psu/search.html.
You can also limit your search by using a URL string, that is, a portion of a Web address. This type of search is limited to URLs that contain the characters following the URL tag. This is a great way to search for information when you have only a rough idea about the URL. You may have visited a particular site before, but not created a bookmark for it. If you wanted to find Penn State policies, for example, but cannot remember the URL, you could type the following in the search field:
The search engine will return all documents that contain the word "policies" in their URL and reside in the site: www.psu.edu.
Be Picky: Use specific terms and capitalization
Try to use discriminating terms that are likely to be found only in the documents you want. The more words you give the better results you'll get. The search engine will find documents containing as many of these words and phrases as possible, ranked so that the documents most relevant to your query are presented first. You also can identify phrases with quotation marks and separate phrases with commas. A phrase can be entered using double quotation marks and only matches those words that appear adjacent to each other, as in the following phrase:
Multiple phrases or proper names can be separated with a comma. To get an exact match, use capitalization where appropriate for terms or words. For example, a query on cac will find matches for cac, CAC, or Cac. A query on CAC will only match CAC.
Be Really Picky: Use the Advanced Search Option
The Advanced Search option is another great way to refine your search. This option allows you to perform very precise searches. The Advanced Search option basically provides the syntax tools for you, taking away all the guesswork. You can narrow your search down to include or exclude documents that use specific words and phrases that were updated at anytime, or before or after a specific date. You can also use advanced search to show a specific number of hits, which can be sorted by relevance, date, or title with or without summaries.
The query suggestions above only outline some very basic ways to search. For more information about refining your search, special searches, and search syntax, please see the "Search Engine Help" documents at http://search.psu.edu/help/. When in doubt, refer to the help documents. They will likely provide you with the assistance you need.
Resources for Webmasters and Searchers
Getting Help
If all else fails and the documents listed above do not answer your questions, send an inquiry to webmaster@psu.edu.