Constructing Complex Filters

When building complex filters, you must realise that the relative positions of the filters is important.

A filter node which is the child of another indicates that an object will be filtered out unless it satisfies both criteria - effectively ANDing the two conditions. For example, in the following filter, only files which are on the root server and which have been modified in the last 2 days will be downloaded.

If more than one filter object is configured at the same level in the filter hierarchy, then these conditions will be OR'd together. That is, only one of the conditions will need to be met for the object to get through the filter. For example, in the following filter objects will be downloaded if either they are binary objectes, or if they have been modified in the last 2 days.

This means that all binary objects will be downloaded regardless of their modification date, but non-binary (i.e., HTML) objects will only be downloaded if they've recently changed. Combining filters like this can allow real flexibility when reaping sites. For example, many sites update their web pages regularly, but may use bitmaps or illustrations which were created a long time ago. In order to reap just the pages which have changed recently, but to still include graphics embedded within those recently changed pages regardless of when the image was created, a filter such as the one above could be used.

Here's an example which shows how a complicated filter can be built up, producing a precisely targetted download:

Firstly, this filter will remove any links whose URL has the word 'adserver' in - saving bandwidth normally wasted by downloading advertisements from automatic servers.

Next the tree branches - one half will only accept HTML files, the other half will only accept binary files. Between the two branches all files will be accepted, but it allows different filters to be configured for each type of file. HTML files are limited to the current server, and will only be downloaded if they have been modified in the last 7 days, whereas binary files can be on the 'next door' server, and must be either GIF or JPEG files to be downloaded.