FP98: Limiting the Access of the Import Web Wizard

ID: Q193942


The information in this article applies to:


SUMMARY

This article describes the method used to prevent Web robots, also called Web spiders or Web crawlers (such as the FrontPage Import Web Wizard) from searching through your Web and taking files meant to be private.

This article also describes some examples on how to use this method on your server to prevent a FrontPage 98 (or any Web robot) user from bypassing your security.


MORE INFORMATION

Overview

Web robots are programs that cross many pages in the World Wide Web by recursively retrieving linked pages. FrontPage 98 has an Import Web Wizard the works just like a Web robot.

There have been occasions where Web robots have visited Web servers where they were not allowed.

This particular type of situation required many Web server administrators to implement a method to prevent Web robots, like the Import Web Wizard, from accessing areas where they may not be allowed or wanted.

The Method

The method used to exclude Web robots from a server is to create a file on the server, which specifies an access policy for them. For this method to be effective, the following criteria are used: This approach was chosen because it can be easily implemented on any existing Web server, and a Web robot can find the access policy with only a single document retrieval.

The Format of Robots.txt

The record starts with one or more User-agent lines, followed by one or more disallow lines. The following lines describe the structure for your "robots.txt" file.

NOTE: Unrecognized headers are ignored.
# The pound sign (#) is used for comments.

User Agent: *
Disallow: folder name
The following example restricts access to the "bak" and "_private" folders located in the subWeb named "myweb." To use the example, follow these steps:
  1. Create a file in the Root Web called robots.txt .


  2. Place the following lines of text in the "robots.txt" file:

    # do not access these 2 folders

    User-agent: *
    Disallow: /myweb/_private/ # This is my private URL space
    Disallow: /bak/ # these are backup folders


If you want to restrict access to the entire web site, follow these steps:
  1. Create a file in the Root Web called robots.txt.


  2. Place the following lines of text in the "robots.txt" file:

    # do not access anything on the web

    User-agent: *
    Disallow: /


NOTE: The presence of an empty "robots.txt" file has no explicit access restrictions. It will be treated as if it was not present and Web robots will be allowed throughout the web.


REFERENCES

For more information about Web robots, go to a search engine on the Word Wide Web (for example, www.yahoo.com or www.infoseek.com) and search on robots.txt.

For more information about Web robots, please visit the following World Wide Web site:

http://info.webcrawler.com/mak/projects/robots/robots.html

Additional query words: 99 Import Web Wizard WWW bot


Keywords          : 
Version           : WINDOWS:
Platform          : WINDOWS 
Issue type        : 

Last Reviewed: July 30, 1999