Web Content Extractor Documentation

Crawling Rules

Crawling Rules are rules that determine links the program should follow.

First, you should select the level of pages the rules will be applied to. The page level is the number of links the program has to follow to move from the start page to the current one. The level of all start pages is 1. The level of all pages that can be opened from the start page is 2. The level of all pages that can be opened from a page with level 2 is 3 and so on.

There are two ways to add a level: by clicking the "Add" button or by clicking the "Duplicate" button. A click on the "Add" button just adds a new level that contains no rules. A click on the "Duplicate" button adds a new level and copies all rules from the current level to the new one. You can delete the level by selecting it and and clicking the "Delete" button (you can delete only the last level).

The "Page URL" field contains the URL of the page that will be used to create rules. You can type the URL manually or you can click the button and open the necessary page using the built-in browser. For level 0, the wizard sets the initial value of the URL equal to the URL of the start page. You can leave this field empty.

There are two types of rules: Basic Rules and Advanced Rules:

  • Basic Rules determine the positions of the links the program should follow on a page.
  • Advanced Rules determine the patterns of the links the program should or should not follow.

Basic Rules allow you to specify the positions of the necessary links on a page. The program will look through these positions on every page of the corresponding level and if there is a link in this position, the program will extract its URL and add it to the task list to download it later.

To specify link positions, click the "Add" button, wait till the page is loaded and click the links the program should follow. The positions of these links will be automatically selected in the right window. If you click a link by mistake, click it once again to clear the selection. Click the "OK" button and all the selected positions will be saved to the list. You will be able to edit this list later by clicking the "Edit" button.

You can specify a script to change link URLs. This parameter is optional and you can skip it. It increases the performance of the program if the link URL is written in Javascript because you can turn it into the URL of the target page without executing it. For example, if the link contains the following script: "javascript:goto('http://www.domain.com')" and its execution takes the program to the page with the URL "http://www.domain.com", here you can specify the script that will turn the URL from "javascript:goto('http://www.domain.com')" into "http://www.domain.com" so the program will not waste time on loading the page and executing the script, but it will just load the target page at once. To specify the script, click the "URL Transformation Script" button. This script will be applied to all links.

The "Delete" button deletes the selected position. The "Up" and "Down" buttons move the selected position up or down the list respectively.

Advanced Rules allow you to specify the patterns of the links the program should or should not follow. If a page contains many links or links are located in different positions on different pages, it is difficult or impossible to use Basic Rules so you have to use Advanced Rules in these cases. You can specify six types of rules:

  • "Follow links if link text equals" - if the link text equals the pattern, the program will follow this link. For example, if the pattern is "Next", the program will follow all links whose text is "Next". You can use the * wildcard character in the pattern, it stands for any sequence of characters. For example, if the pattern is "Next*", the program will follow all links whose text begins with "Next" ("Next >", "Next Page", "Next Record", etc.). If the pattern is "*Next*", the program will follow all links whose text contains the "Next" substring ("Go To Next Page", "Go To Next Record", etc.).
  • "Do NOT follow links if link text equals" - if the link text equals the pattern, the program will not follow this link. For example, if the pattern is "Last", the program will not follow any links whose text is "Last". You can use the * wildcard character in the pattern, it stands for any sequence of characters. For example, if the pattern is "Last*", the program will not follow any links whose text begins with "Last" ("Last >>", "Last Page", "Last Record", etc.). If the pattern is "*Last*", the program will not follow any links whose text contains the "Last" substring ("Go To Last Page", "Go To Last Record", etc.).
  • "Follow links if URL contains" - if the link URL contains the pattern, the program will follow this link. For example, if the pattern is "ProductDetails.php", the program will follow all links whose URL contains "ProductDetails.php" (http://www.domain.com/ProductDetails.php?id=1, http://www.domain.com/ProductDetails.php?id=2).
  • "Do NOT follow links if URL contains" - if the link URL contains the pattern, the program will not follow this link. For example, if the pattern is "&print=", the program will not follow any links whose URL contains "&print=" (http://www.domain.com/ProductDetails.php?id=1&print=1, http://www.domain.com/ProductDetails.php?id=2&print=1).

It is recommended to use the first two types of rules for links whose text remains the same on different pages. For example, Next, More, Details, etc. It is recommended to use the other two types of rules if the necessary links have some common part in their URLs (to see the URL of a link, move the mouse pointer over it and the URL will appear in the status bar at the bottom of the browser window). For example, for these links:

  • http://www.domain.com/Category.php?id=1,
  • http://www.domain.com/Category.php?id=2,
  • http://www.domain.com/ProductDetails.php?id=1,
  • http://www.domain.com/ProductDetails.php?id=2,
  • http://www.domain.com/ProductDetails.php?id=3,

you can specify the following rule: "Follow links if URL contains: /Category.php; /ProductDetails.php"

For these links:

  • http://rateyourmusic.com/artist/the_beatles,
  • http://rateyourmusic.com/artist/radiohead,

you can specify the following rule: "Follow links if URL contains: /artist/".

You can type the patterns manually (separating them with a semicolon) or select them from the list generated by the program, but this list includes not all possible patterns but only the most obvious ones. To select patterns from the list, click the "Edit" button and after the page is loaded in the new window click one of its links. If the list contains the corresponding pattern, its checkbox will become selected and all links corresponding to this pattern will be automatically highlighted. If the list does not contain the corresponding pattern, you should add it manually by clicking the "Add" button.

Options:

  • "Follow all internal links" - the program will follow all links with the same domain as the current page.
  • "Follow all external links" - the program will follow all links with other domains than the domain of the current page.
  • "Simulate User Click" - if this option is enabled, the program will generate the javascript to simulate the user click instead of follow the link url.
  • "Simulate Scroll Page" - if this option is enabled, the program will automatically scroll down a page.
  • "Extract URL from onclick attribute" - if this option is enabled, the program will extract URL from "onclick" attribute of the tag.
  • "Use these rules for the next levels" - it is a very useful option. If it is enabled, all rules for the current level will be applied to all pages of the next level. It allows you to specify rules for one level and apply them to all next levels instead of specifying rules for each level, the number of which can be as large as several dozens or hundreds. You should use this option if you use Basic Rules and if the link position does not change on pages of the next levels. For example, links to details are located in the same positions on every page with search results so you can specify these link positions only for the first page and enable this option. You should also use this option if you use Advanced Rules.
  • "Maximum Crawling Depth" - it is the page level where the program stops applying rules and adding new links.
  • "Crawling Order" - it is the order of loading pages:
    • "Breadth-First Crawling" - if this option is selected, pages of lower level are loaded first.
    • "Depth-First Crawling" - if this option is selected, pages of higher level are loaded first.