Internet Researcher ($59.95)

Offline Commander ($39.95)
Custom Parsers

Custom Parsers in Internet Researcher

HTML parser in Offline Commander and Internet Researcher does not extract Java-Script links from Web pages. The problem can be partially solved by loading the page in the built-in browser and manually clicking on the Java-Script links. However this works only if the number of Java-Script links on a Web site is relatively small. If you intend to download a large Web site with essential information hidden behind Java-Script links, clicking through thousands of links is not an option.

Internet Researcher 1.6 solves this problem by providing ability to plug in your own custom parsers. The following description of custom parsers is intended only for programmers.

Custom parser is a Windows DLL. You can use any programming language capable of creating Windows DLLs. The DLL must export this function:

BOOL __stdcall CpFunc(LPCTSTR pszUrl, LPBYTE pBody, DWORD dwBodySize, LPCTSTR pszResultsFileName);

This function will be called by IR each time it downloads a new page.

The name of the function must be CpFunc. The 1st parameter is a pointer to a null-terminated string containing the URL of the page. The 2nd parameter is a pointer to the memory buffer with the contents of the page. The 3rd parameter is a 32 bit integer holding the size of the memeory buffer pointed to by the 2nd parameter. You should not read past the end of the buffer. You can modify the contents of the page. The 4th parameter is a null terminated pointer to the name of a file with results returned by your parser (see below). The function should return 1 if the page was parsed and the results were written to the file pointed to by the 4th parameter. Otherwise the function should return zero.

The name of the dll should start with "cp" (Example: cp123.dll). The DLL must be placed in the "Parsers" subdirectory of the directory where the file ir.exe is located. In most installations this will be "c:\Program Files\Internet Researcher\Parsers\".

The Parsers directory may contain more than one parser DLL. All DLLs in this directory will be loaded at run time. For each downloaded page IR will call DLLs in unspecified order until any DLL returns 1. If a DLL returns zero, the next DLL will be called, and so on. IR logs all activity of plug-ins to file cp.log which you may use to debug your custom parser.

The format of the file with results must be as follows:

URL\r\n
Relation\r\n
URL\r\n
Relation\r\n
......
URL\r\n
Relation\r\n

where URL is an URL extracted from the page (must be one of http, https, file, ftp), "\r\n" is the end of line (CR LF), and Relation is a number specifying the relation of this URL to the parent URL. Possible relation values:

Value Meaning
0 A HREF
1 IMG SRC
2 BODY BACKGROUND
3 TABLE BACKGROUND
4 FRAME SRC
5 IFRAME SRC
6 SCRIPT SRC
7 BGSOUND SRC
8 INPUT SRC
9 EMBED SRC
10 APPLET
11 CSS
12 Refresh
13 Redirect
14 used internally
15 TD BACKGROUND
16 TH BACKGROUND

C programmers may use this enumeration:

enum
{
      REL_A_HREF  =0, 
      REL_IMG_SRC =1,
      REL_BODY_BACKGROUND=2,        
      REL_TABLE_BACKGROUND=3,       
      REL_FRAME_SRC=4,              
      REL_IFRAME_SRC=5,             
      REL_SCRIPT_SRC=6,
      REL_BGSOUND_SRC=7,
      REL_INPUT_SRC=8,
      REL_EMBED_SRC=9,
      REL_APPLET=10,
      REL_CSS=11,
      REL_REFRESH=12,
      REL_REDIRECT=13,
      REL_DYNAMIC=14,
      REL_TD_BCKGRND =15,
      REL_TH_BCKGRND = 16
}; 

Example of file with results:

http://www.example.com/pages/somepage.html
0
http://www.example.com/somepage1.html
0
http://www.example.com/img/pic.gif
1