Filter Types

Crawl Depth

Specifies how 'deep' WebReaper is permitted to crawl for links. If a page contains a link, then the object referred to by that link is a 'child link' of the original page. Each child link represents one level crawl depth. For example, if MYPAGE.HTML has a link to HISPAGE.HTML, which in turn has a link to HERPAGE.HTML, the HERPAGE.HTML is two levels deep from MYPAGE.HTML. The maximum crawl depth specifies the maximum depth to which WebReaper can crawl; when this depth is reached any links within the current page will be ignored.

The exclusions page allows you to specify exact URLs which will not be downloaded during a WebReaper session. Any objects which contain that URL will also be excluded. This feature enables you to exclude entire areas of a site from the reap session.

For example, were you to reap www.otway.com, you could specify www.otway.com/family as a URL exclusion. This would exclude the entire webreaper area from the site (e.g., www.otway.com/family/index.html and www.otway.com/family/familytree.html would both be exluded). You can add URLs by typing them into the edit box and hitting the 'Add' button, or you can drag & drop links from IE, Netscape or the desktop.

Date Last Modified

Filters out files depending on how recently they changed. If most of a website hasn't been updated for several months, but a couple of pages have up-to-date information on them, it would be useful to download only those pages which have changed recently. In particular, this can be used to refresh a site that was reaped a while ago, to update the local copy with recent changes.

Days Since Last Modification

Similar to the 'Date Last Modified' filter, but this is filter uses a relative date rather than an absolute one.

Download Time

This filter sets a maximum download time for each object. Once that time limit has been reached, the download of that object will be aborted.

File Extension

Allows filtering by the file extension. Bear in mind that this is the file extension of the URL of an object (with any extra URL parameters after a '?' or '#' removed), so may not always yield sensible results.

The 'Exclude files with these extensions' checkbox determines whether this is an 'exclusive' or 'inclusive' filter. If it is checked, files with the extensions listed will be excluded from the download. If it is unchecked, only files with extensions matching those listed will be downloaded (i.e., all others will be excluded from the download).

HTML/Binary File

Limits whether HTML files or binary files will be filtered from the download. By adding a filter for both HTML and binary files at the same node in a filter tree (as in the example above) the overall filter behaviour can be subdivided with different filters for binary and HTML files.

MIME Type

The MIME type is the object type (for example "text/html" or "image/jpeg") which is passed back in response to the original request made to the server This filter allows you to limit the download to particular types. Check the box next to each type you want to be included in the download; all unchecked types will be ignored. You can select/unselect groups of MIME types by right-clicking on the list (for example, right-clicking on the "image/jpeg" entry will allow you to check/uncheck all 'image' entries in the list.

If WebReaper should come across any new (previously unseen) MIME types, it will add them to the list of choices for future use. The new types won't be downloaded, but will be available for use in future filters.

File Size

This filter allows the download to be limited to either a maximum or minimum file size. For example, when downloading a site with a lot of graphics, it could filter out large images, so increasing the overall download speed for the site. Alternatively, by specifiying a minimum size

Server Depth

Limits how 'far' away from the server specified in the starting URL WebReaper is allowed to crawl. For example, if the starting URL was, say, www.microsoft.com, then a link from that site to another server (e.g., www.dept-of-justice.com) would be counted as a distance of 'one'. A further link from that site to a third server would be a distance of two, and so on. So specifying a distance of one will allow downloads from all servers referenced on the starting URL's server, but not any others. A distance of two will download from all servers referenced on the starting URL's server, plus any which are referenced on those servers.

Because each page can have a large number of links to other servers, the effect of increasing the server depth option will be exponential. For example, if every page linked to, say, 10 other servers, then a distance of 1 would allow downloads from 11 servers (root + 10 others). A distance of 2 would allow downloads from 111 servers (root + (10 x 10)) servers. Obviously this depends on the site being reaped.

A server depth of zero indicates that WebReaper should restrict downloads to the server on which the starting URL was located.

URL Contents

Similar to the file extension filter, but matches any word/phrase within the entire URL, rather than just the file extension. For example, you may wish to filter out all URLs containing the word 'adserver' which might remove advertising links. Since any part of the URL can be filtered, this can also be used to filter out links on particular servers, or in particular sub-directories.

This filter has a special keyword: "%root%". This keyword will be substituted with the directory of the current starting URL whenever the filter is tested. This allows a filter to be generated to force the download to ignore all files except for those in the same directory as the starting URL.

Keywords

Allows you to filter pages by their actual content. So, for example, you might just want to download links from pages containing the string 'webreaper'.

The string check can be either case sensitive or not, by checking the option on the configuration dialog.