Custom Parsers in Internet Researcher
HTML parser in Offline Commander and Internet Researcher does not
extract Java-Script links from Web pages. The problem can be partially
solved by loading the page in the built-in browser and manually
clicking on the Java-Script links. However this works only if the number of
Java-Script links on a Web site is relatively small. If you intend to
download a large Web site with essential information hidden behind
Java-Script links, clicking through thousands of links is not an option.
Internet Researcher 1.6 solves this problem by providing ability to plug in
your own custom parsers. The following description of custom parsers is intended
only for programmers.
Custom parser is a Windows DLL. You can use any programming language capable
of creating Windows DLLs. The DLL must export this function:
BOOL __stdcall CpFunc(LPCTSTR pszUrl, LPBYTE pBody, DWORD dwBodySize, LPCTSTR pszResultsFileName);
This function will be called by IR each time it downloads a new page.
The name of the function must be CpFunc. The 1st parameter is a pointer to a null-terminated string containing the URL of the page.
The 2nd parameter is a pointer to the memory buffer with the contents of the page.
The 3rd parameter is a 32 bit integer holding the size of the memeory buffer pointed to by the 2nd parameter.
You should not read past the end of the buffer. You can modify the contents of the page.
The 4th parameter is a null terminated pointer to the name of a file with results returned
by your parser (see below). The function should return 1 if the page was parsed and the
results were written to the file pointed to by the 4th parameter. Otherwise the function should
return zero.
The name of the dll should start with "cp" (Example: cp123.dll). The DLL must be placed in the
"Parsers" subdirectory of the directory where the file ir.exe is located. In most installations this
will be "c:\Program Files\Internet Researcher\Parsers\".
The Parsers directory may contain more than one parser DLL. All DLLs in this directory will be loaded
at run time. For each downloaded page IR will call DLLs in unspecified order until any DLL returns 1.
If a DLL returns zero, the next DLL will be called, and so on. IR logs all activity of plug-ins
to file cp.log which you may use to debug your custom parser.
The format of the file with results must be as follows:
URL\r\n
Relation\r\n
URL\r\n
Relation\r\n
......
URL\r\n
Relation\r\n
where URL is an URL extracted from the page (must be one of http, https, file, ftp),
"\r\n" is the end of line (CR LF), and Relation is a number specifying the relation
of this URL to the parent URL. Possible relation values:
| Value | Meaning |
| 0 | A HREF |
| 1 | IMG SRC |
| 2 | BODY BACKGROUND |
| 3 | TABLE BACKGROUND |
| 4 | FRAME SRC |
| 5 | IFRAME SRC |
| 6 | SCRIPT SRC |
| 7 | BGSOUND SRC |
| 8 | INPUT SRC |
| 9 | EMBED SRC |
| 10 | APPLET |
| 11 | CSS |
| 12 | Refresh |
| 13 | Redirect |
| 14 | used internally |
| 15 | TD BACKGROUND |
| 16 | TH BACKGROUND |
C programmers may use this enumeration:
enum
{
REL_A_HREF =0,
REL_IMG_SRC =1,
REL_BODY_BACKGROUND=2,
REL_TABLE_BACKGROUND=3,
REL_FRAME_SRC=4,
REL_IFRAME_SRC=5,
REL_SCRIPT_SRC=6,
REL_BGSOUND_SRC=7,
REL_INPUT_SRC=8,
REL_EMBED_SRC=9,
REL_APPLET=10,
REL_CSS=11,
REL_REFRESH=12,
REL_REDIRECT=13,
REL_DYNAMIC=14,
REL_TD_BCKGRND =15,
REL_TH_BCKGRND = 16
};
Example of file with results:
http://www.example.com/pages/somepage.html
0
http://www.example.com/somepage1.html
0
http://www.example.com/img/pic.gif
1