How to Use Filters
You can use filters to limit the paths and types of files the spider will explore.
For instance, you can tell the
spider not to download graphics embedded on webpages,
to download only from a particular website, not to download
from a certain directory on a website, or to exclude particular
websites.
The scope of filters can be project-wide or
be limited to a single task. Project-wide filters are set in the Project Properties dialog box. To launch
Project Properties window, click on the Properties button on the Project Toolbar.
Task filters are set in the Task Properties dialog box. To launch Task Properties
window, select a task and click on the Properties button on the Task Toolbar or click
on the New Task button to create and set properties of a new task. Task filters are
set in the Task Properties dialog box.
Task filters are combined with current project filters using AND
logical operation every time the program needs to decide whether
to retrieve a URL or not. Thus, project filters affect every task
in a project.
There are four types of filters:
- link filters are used to prevent the spider from following certain links on webpages.
The most common link filter you will use is "Retrieve only from" which is used to limit
the downloads to a particular website. This is the default option when you download websites
several levels deep. If you want more control over the link filters, you should use advanced
pattern matching filters.
- image filters are used to exclude embedded images on all or some of the pages. The most
common image filters you will use are "Retrieve only from" which can be used to retrieve
images only from a particular website or a directory, and "Do not retrieve" which can be set
if you are interested only in text of webpages and do not want to download embedded graphics.
If you want more control over the image filters, you should use advanced pattern matching filters.
- embedded object filters are used to exclude embedded objects other than images such as
sounds, Flash movies, or Java Applets.
- dynamic object filters are project-wide filters used to exclude images, pages, and
other objects dynamically generated by scripts when you browse them in the built-in browser.
This filter is available only in Project Properties window and has effect only when you
browse webpages using the built-in browser.
You can also filter files by size. You can specify that the program should retrieve pages or files
that are no more than a certain maximum size and/or no less than a certain minimum size.
Type in the maximum size or a range of sizes in the Max Size or Min Size - Max Size text
box for each type of filter. This is an optional parameter. If you do not provide the sizes,
the program will download files regardless of their sizes.
Pattern Matching Filters:
This is a logical expression that is evaluated for each URL of a task
to determine whether the program should retrieve the URL. Logical expression
can evaluate either to true or to false. If a filter applied to a URL evaluates
to true, the URL is accepted and put into the queue to be downloaded. If a
filter evaluates to false, a URL is not accepted and not retrieved.
The operands of the logical expression are URLs or URL patterns. The operators are
AND, OR, and NOT (operators can be written in lowercase or uppercase).
Expressions can include subexpressions enclosed in parenthesis.
URL patterns can include two kinds of wildcard characters: an asterisk (*)
and a question mark (?).
| Character |
Usage |
Example |
| * |
Matches zero or more characters. In most cases it is used as the first or last character in a URL pattern. |
http//domain.com/* matches http://domain.com/, http://domain.com/index.com, http://domain.com/images/logo.jpg |
| ? |
Matches any single alphabetic character. |
img??.gif matches img01.gif, img35.gif |
- In almost every case you will use only the asterisk. And in almost every case you will use at least one asterisk per pattern.
- In most cases you will have to type two asterisks: one at the start of a pattern, and the other at the end of a pattern.
- If a pattern does not start with http, https, ftp, or file, the program will automatically prepend an asterisk to the pattern when matching the pattern to a URL.
- If a pattern is not meant to match the endings of URLs (e.g. file extensions like *.shtm or *.jpeg),
you must type an asterisk at the end of a URL pattern.
Note that a filter is not a list of URLs that must be included
or excluded. This is a logical expression that is evaluated for each URL. Each
URL must match the whole filter to be accepted. This is why you must use OR
to join different URLs. If you join two URL patterns with AND, it
will mean that, to be accepted, a URL must match both patterns. Joining
two fully qualified URLs (not containing wildcard characters) with AND will have no sense at all.
This is a logical expression that is evaluated for each URL to
decide whether to download it or not.
You can check if a URL is accepted or rejected by a task filter
on the Test Filters page. To view the Test Filters page,
click on a task and select Test Filters from the Task Menu.
Type in a URL, select a type of filter, and press the Test button.
The page will reload showing you the result of the test (accepted or not)
as well as the detailed explanation of why the URL was accepted or
rejected by the filter. Every pattern in both the task filter and
the project filter will be painted green or red depending on whether
a pattern accepts or rejects the URL.
Examples for link filters:
| Filter |
Description |
| *domain-1.com* or *domain-2.com* |
Matches any URL that contains either domain-1.com or domain-2.com such as:
- http://domain-1.com/
- http://domain-2.com/
- http://www.domain-1.com/
- http://www.domain-2.com/about.htm
- ftp://ftp.domain-1.com/archive.zip
- http://go.to/cgi-bin/redirect.exe?domain-1.com
Does not match URLs that do not contain domain-1.com or domain-2.com in them.
|
| http://domain.com/* or http://www.domain.com/* |
Matches any URL that begins with http://domain.com/
or with http://www.domain.com/
Does not match:
- ftp://ftp.domain.com/archive.zip
- http://mail.domain.com/
- https://www.domain.com/
|
| http://*.domain.com/* |
Matches http://www.domain.com/
Does not match http://domain.com/ because there is no dot before domain.com
|
| http://*domain.com/* and (*.shtm or *.htm) |
Matches any URL from domain.com website ending with *.shtm or *.htm
Does not match http://domain.com/ because it does not end with *.shtm or *.htm
|
| (*domain-1.com* or domain-2.com) and (*.shtm or *.htm) |
Matches any URL from domain-1.com or domain-2.com ending with either *.shtm or *.htm |
| *domain.com* and not *domain.com/cgi-bin/* |
Matches any page from domain.com but not from the cgi-bin directory on that site.
|
| *domain.com* and not (*domain.com/dir1* and not *domain.com/dir1/dir2*)
This can also be written as:
*domain.com* and not *domain.com/dir1* or *domain.com/dir1/dir2*
| Matches any page from domain.com but not the pages within dir1 directory
except for pages in its subdirectory dir2, which are accepted.
|
Examples for image filters:
| *.jpeg OR *.jpg | will download only jpeg files |
| not (*doubleclick.com* or *humanclick.com*) | will not retrieve images from those two ad banner serving websites. |
| *cool-images.com* | will download images only from the cool-images.com website. |
| *my-domain.com/my-photos/* | will download images only from my-photos directory on the my-domain.com website. |
| not * | This can be used in to exclude all embedded images. It has the same effect as setting
the "Do not retrieve" option. This pattern does not match any URL. |
|