Skip to content

en spider advanced

斟酌 鵬兄 edited this page Sep 18, 2016 · 3 revisions

Advanced Spider

This section we will talk about the remaining procedures that are not mentioned in general guide lines.

  1. URL Generator
  2. Script
  3. Result View
  4. Encoding
  5. List Builder
  6. Parameter

Encoding

Encoding

Translates the result on previous step to target encoding

URL Generator

![URL Generator](https://tgckpg.github.io/wenku10/en-us/url generator.png)

URL Generator differs from general text enumerator. It is not about setting a range of characters and then let the computer generate the URLs for you. It is in fact, a crawler the crawls against the page content and determine whats could be the next URL based on that content.

Arguments:

  1. Incoming: If checked, the entry point will be the parameter passed from the previous step. Entry point defined here will be ignored.
  2. Entry point: The starting point of stepping procedure
  3. Continue if matched with Url: The matched pattern of given regular expression. It's matches will become the next URL to download. ( Multiple patterns could be defined but only the first match will be passed as next URL. The rest act as a condition to continue. )
  4. Stop if: Stop stepping if any of the defined pattern matched
    1. "2": Skip the first match, useful for excluding the entry point matches
    2. "X": Discard the unmatched, useful for discarding the stopping point

Script

If all other procedure does not suit your needs. The Script procedure is your last shot. It is very powerful as it runs a custom script inside the retrieved Html.

A sample script look like this:

// This script retrieve all <p>, <hX> tags and get it's displayable contents then enclosed them with a [C]/[V] tag.
var c = document.all;
var n = [ "P", "H1", "H2", "H3", "H4", "H5" ];
var p = [];

for( var i in c ) {
	if( n.indexOf( c[i].nodeName ) != -1 ) 	{
		p.push( c[i] );
	}
}

var result = "";
var pOpened = false;
for( var i in p ) {
	var cont = p[i].textContent.replace( /^\s+|\s+$/g, "" );
	switch( p[i].nodeName ) {
		case "P":
		if( !pOpened ) {
			result += "[C]";
			pOpened = true;
		}
		result += cont;
		break;
		default:
		if( pOpened ) {
			result += "[/C]";
			pOpened = false;
		}
		result += "[V]" + cont + "[/V]";
	}
}

return result;
  • The returned result of the script will be carried as the result of this procedure.
  • There are 2 special return value
    1. WAIT - This tells the spider that we will return the value asynchronously
    2. ERROR - An error occurred

Asynchronous WAIT

The WAIT magic return will tells the spider that the result is returned asynchronously. However, the current value of TTL of a script procedure is set to 6 seconds. If your script does not return a value in time. This script will be terminated.

Here is an example on how to return result asynchronously.

var result = "GGG";

// Use the SetResult to return the asynchronous result to the spider
setTimeout( () => { SetResult( result ) }, 1000 );

// Tell the spider that this script is asynchronous.
return "WAIT";

Exception Handling

You can just throw an error like you did in every script. The error will be handled and transferred to the console.

Actual script location

The script works like this:

  1. The <head> in the original Html will be replaced entirely.
  2. Any <style>, <script> tag will also be removed.
  3. The <body>``s style will be set to display: hidden;`, thus it will not be possible to get an element height as there as no rendering done.
  4. Then the custom script will be put inside this setup. The setup script will become the new <head> of the Html.

Result View

The result view joins the results from previous procedure. It generally act as a content transfer. You could also inspect the content of previous procedure by pressing the Test button.

Clone this wiki locally