Internet Researcher ($59.95)

Offline Commander ($39.95)
User Manual

How to Use Filters

You can use filters to limit the paths and types of files the spider will explore. For instance, you can tell the spider not to download graphics embedded on webpages, to download only from a particular website, not to download from a certain directory on a website, or to exclude particular websites.

The scope of filters can be project-wide or be limited to a single task. Project-wide filters are set in the Project Properties dialog box. To launch Project Properties window, click on the Properties button on the Project Toolbar. Task filters are set in the Task Properties dialog box. To launch Task Properties window, select a task and click on the Properties button on the Task Toolbar or click on the New Task button to create and set properties of a new task. Task filters are set in the Task Properties dialog box. Task filters are combined with current project filters using AND logical operation every time the program needs to decide whether to retrieve a URL or not. Thus, project filters affect every task in a project.

There are four types of filters:

  • link filters are used to prevent the spider from following certain links on webpages. The most common link filter you will use is "Retrieve only from" which is used to limit the downloads to a particular website. This is the default option when you download websites several levels deep. If you want more control over the link filters, you should use advanced pattern matching filters.
  • image filters are used to exclude embedded images on all or some of the pages. The most common image filters you will use are "Retrieve only from" which can be used to retrieve images only from a particular website or a directory, and "Do not retrieve" which can be set if you are interested only in text of webpages and do not want to download embedded graphics. If you want more control over the image filters, you should use advanced pattern matching filters.
  • embedded object filters are used to exclude embedded objects other than images such as sounds, Flash movies, or Java Applets.
  • dynamic object filters are project-wide filters used to exclude images, pages, and other objects dynamically generated by scripts when you browse them in the built-in browser. This filter is available only in Project Properties window and has effect only when you browse webpages using the built-in browser.

You can also filter files by size. You can specify that the program should retrieve pages or files that are no more than a certain maximum size and/or no less than a certain minimum size. Type in the maximum size or a range of sizes in the Max Size or Min Size - Max Size text box for each type of filter. This is an optional parameter. If you do not provide the sizes, the program will download files regardless of their sizes.

Pattern Matching Filters:

This is a logical expression that is evaluated for each URL of a task to determine whether the program should retrieve the URL. Logical expression can evaluate either to true or to false. If a filter applied to a URL evaluates to true, the URL is accepted and put into the queue to be downloaded. If a filter evaluates to false, a URL is not accepted and not retrieved.

The operands of the logical expression are URLs or URL patterns. The operators are AND, OR, and NOT (operators can be written in lowercase or uppercase). Expressions can include subexpressions enclosed in parenthesis. URL patterns can include two kinds of wildcard characters: an asterisk (*) and a question mark (?).

Character Usage Example
* Matches zero or more characters. In most cases it is used as the first or last character in a URL pattern. http//domain.com/* matches http://domain.com/, http://domain.com/index.com, http://domain.com/images/logo.jpg
? Matches any single alphabetic character. img??.gif matches img01.gif, img35.gif

  • In almost every case you will use only the asterisk. And in almost every case you will use at least one asterisk per pattern.
  • In most cases you will have to type two asterisks: one at the start of a pattern, and the other at the end of a pattern.
  • If a pattern does not start with http, https, ftp, or file, the program will automatically prepend an asterisk to the pattern when matching the pattern to a URL.
  • If a pattern is not meant to match the endings of URLs (e.g. file extensions like *.shtm or *.jpeg), you must type an asterisk at the end of a URL pattern.

Note that a filter is not a list of URLs that must be included or excluded. This is a logical expression that is evaluated for each URL. Each URL must match the whole filter to be accepted. This is why you must use OR to join different URLs. If you join two URL patterns with AND, it will mean that, to be accepted, a URL must match both patterns. Joining two fully qualified URLs (not containing wildcard characters) with AND will have no sense at all. This is a logical expression that is evaluated for each URL to decide whether to download it or not.

You can check if a URL is accepted or rejected by a task filter on the Test Filters page. To view the Test Filters page, click on a task and select Test Filters from the Task Menu. Type in a URL, select a type of filter, and press the Test button. The page will reload showing you the result of the test (accepted or not) as well as the detailed explanation of why the URL was accepted or rejected by the filter. Every pattern in both the task filter and the project filter will be painted green or red depending on whether a pattern accepts or rejects the URL.

Examples for link filters:

Filter Description
*domain-1.com* or *domain-2.com* Matches any URL that contains either domain-1.com or domain-2.com such as:
  • http://domain-1.com/
  • http://domain-2.com/
  • http://www.domain-1.com/
  • http://www.domain-2.com/about.htm
  • ftp://ftp.domain-1.com/archive.zip
  • http://go.to/cgi-bin/redirect.exe?domain-1.com
Does not match URLs that do not contain domain-1.com or domain-2.com in them.
http://domain.com/* or http://www.domain.com/* Matches any URL that begins with http://domain.com/ or with http://www.domain.com/

Does not match:

  • ftp://ftp.domain.com/archive.zip
  • http://mail.domain.com/
  • https://www.domain.com/
http://*.domain.com/* Matches http://www.domain.com/

Does not match http://domain.com/ because there is no dot before domain.com

http://*domain.com/* and (*.shtm or *.htm) Matches any URL from domain.com website ending with *.shtm or *.htm

Does not match http://domain.com/ because it does not end with *.shtm or *.htm

(*domain-1.com* or domain-2.com) and (*.shtm or *.htm) Matches any URL from domain-1.com or domain-2.com ending with either *.shtm or *.htm
*domain.com* and not *domain.com/cgi-bin/* Matches any page from domain.com but not from the cgi-bin directory on that site.
*domain.com* and not (*domain.com/dir1* and not *domain.com/dir1/dir2*)

This can also be written as:

*domain.com* and not *domain.com/dir1* or *domain.com/dir1/dir2*

Matches any page from domain.com but not the pages within dir1 directory except for pages in its subdirectory dir2, which are accepted.

Examples for image filters:

*.jpeg OR *.jpg will download only jpeg files
not (*doubleclick.com* or *humanclick.com*) will not retrieve images from those two ad banner serving websites.
*cool-images.com* will download images only from the cool-images.com website.
*my-domain.com/my-photos/* will download images only from my-photos directory on the my-domain.com website.
not * This can be used in to exclude all embedded images. It has the same effect as setting the "Do not retrieve" option. This pattern does not match any URL.