Part 6: Designing and Testing Raid Scripts
Introduction
In order to easily write Raid Scripts you will need to test the scripts as they are constructed. In order to do this you will need to use the program RaidScriptTester.exe that is installed along with NewsRaider in C:\Artciels\Scripts\Test\. For the Time being all raid Scripts operate in C:\artciles\scripts\ or C:\artciles\scripts\test\
In order to be picked up by the raid Script engine the script files must end with the extension raid.txt,. such as WorldNewsraid.txt or WorldNews_raid.txt
Using RaidScriptTester.exe
When you run raidScripttester.exe you will see a list of available Raid Scripts, these are all the files in C:\artciles\scripts\test\
that end with raid.txt.
If you select a script from the list and click the Test Raid Script button then the script will be processed and the out will be placed in the C:\artciles\scripts\test\ folder as a single file called:
ARTICLE_TEST_OUTPUT.txt
This file can be opened in a text editor or imported into TomeRaider 3 for Windows ( http://www.tomeraider.com ) so you can see if it has worked properly.
If you have used the command TEST_LINK_PROCESSING then another file will be created called:
LINK_TEST_OUTPUT.txt
All testing of raid scripts outputs into one of these test files.
The Designing and Testing process
Stage 1: Determine which Links are needed from the start page.
Once you have a site you wish to raid the first thing that needs to be done is to work out which links are articles and which are not. The best way to see this is to simply get all of the links from the page using the INCLUDE_ALL_LINKS command.
The following example shows how you can do this:
START_URL " http://www.cnn.com "
SOURCE "CNN"
CALL Start
DOWNLOAD_PAGE
INCLUDE_ALL_LINKS
GET_LINKS
TEST_LINK_PROCESSING
END
This will create the text file LINK_TEST_OUTPUT.txt in the scripts test folder that will contain all of links from CNN.com. You will then be able to fine tune it using the EXCLUDE_LINKS commands. If there are a huge number of links to exclude then it might be better for your script to not use the INCLUDE_ALL_LINKS command at all, but rather to specify exactly which links you want using INCLUDE_LINKS commands and then use EXCLUDE_LINKS along with it.
Stage 2: Designing the Article Processing
Once you have your set of article links it is time to start to write the script for processing the individual articles. You need to get at least 2 bits of information from the article page:
- The article title
- The article text
In order to allow easy testing it is advisable to test extracting one article rather than waiting till all articles are extracted. This will save your valuable time. Following example 2.1 shows how you can do this.
//Example 2.1
BEGIN_DEF
//START_URL "http://www.cnn.com"
START_URL “http://www.cnn.com/2005/WORLD/americas/04/20/ecuador/index.html”
SOURCE "CNN"
//CALL Start
CALL GetArticle
END
BEGIN_PROCESS Start
DOWNLOAD_PAGE
INCLUDE_ALL_LINKS
GET_LINKS
//TEST_LINK_PROCESSING
//REPEAT_FOR_ALL_LINKS GetArticle
END
BEGIN_PROCESS GetArticle
DOWNLOAD_PAGE
ARTICLE_FROM "<!--endclickprintexclude--><p>" to "<!--endclickprintinclude-->"
//Images
INCLUDE_IMAGES = ".jpg"
//Acquire Title
FIND_LINE "<title>CNN.com -"
VAR=LINE
VAR_REMOVE_FROM VAR_START to "<title>CNN.com - "
VAR_REMOVE_FROM " -" TO VAR_ENd
TITLE=VAR
WRITE_ARTICLE
END
You will notice that the START_URL is changed from the CNN's index page to a specific article and then we are calling "GetArticle" process instead of "Start" from DEF. This will write the output to ARTICLE_TEST_OUTPUT.txt file. If you are satisfied with the formatting of the article then you can make DEF call "Start" process and change the START_URL back to CNN's index page. Then remove the comments in //REPEAT_FOR_ALL_LINKS GetArticle and you have a working script.
Modifying Other Peoples Raid Scripts
Learning raid script to write your own scripts is great, but it is also going to be useful in tailoring other peoples raid scripts to your own needs.
For example, you might have a script for your favorite news site but you don't want any sports news, so you could change their script to exclude such articles.
|