Skraper offers 2 ways of scraping web pages: the easy way and the hard way. This tutorial is about the hard way where you configure a data scraper completely from scratch by providing your own commands.

Prerequisites:

  • Understanding of the basic programming principles
  • HTML/CSS/JS knowledge (you will write your own CSS/XPath selectors)

How does it work?

A job is a chain of commands that runs on a web page. Skraper opens the Start URL of your job and runs your commands until the end (e.g. until navigation is not possible or the item you want to scrape doesn't exist). There are commands to navigate between pages, find elements, extract data, and more.

Skraper commands provide a clean promise-like interface and it supports hybrid CSS 3.0 and XPath 1.0 selectors. If you are a developer who has worked with JavaScript before, you will feel at home 🏡


Commands

These are all of the commands that are available for chaining in Skraper:

  • find ( selector )

Find elements based on selector anywhere within the current document and set the context for the next commands to these elements.

  • filter ( selector )

Discard any nodes from the current context that do not match selector

  • match ( RegExp )

Discard any nodes from the current context whose contents do not match RegExp

  • set ( isMultiple, key, dataType [, selector] )

Set key to the value of selector (if a selector is not given, sets the current element text in the current context). If isMultiple is true, sets all of the values found as an array. dataType lets Skraper parse the values appropriately.

set(false, 'title') // set current element text to `title` column
set(false, 'title', 'a.title') // set text of 'a.title' to `title` column
set(true, 'images', 'img@src') // set `src` of images to `images` column
  • set-nested ( isMultiple, key, commands )

Just like the set command, but it evaluates the provided commands and sets their value to the key column so you can create dynamic columns and nested values.

// follow the link in the 3rd column (`td[3] a@href`) and set `lat` to text of `.latitude` element, `lng` to text of `.longitude` element in the next page
set-nested(
    false,
    'location',
    [
        follow('td[3] a@href'),
        set(false, 'lat', '.longitude'),
        set(false, 'lng', '.latitude')
    ]
)
  • paginate ( selectorOrJsonString [, limit] )

Paginate the next commands limit times based on selector. If a limit is not given, Skraper will paginate until it can't find the selector.

selectorOrJsonString: A string for either:

  • an element with the next page URL in its inner text or in an attribute that commonly contains a URL (href, src, etc.)
  • an element whose name and value attributes will respectively be added or replaced in the next page query.
  • a JSON object where each key is a query parameter name and each value is either a selector string or an increment amount (+1, -1, etc.).
paginate('a.nextPage') // go to `a.nextPage` `@href`
paginate('link[rel="next"]@href') // go to `link` `@href`
paginate('input[name="offset"]') // update `offset` parameter of the next query

// adds 20 to the `startIndex` query parameter
// sets `page` query parameter to `a.nextPage` content
// stops after 15 requests are made
paginate({ startIndex: +20,  page: 'a.nextPage' }, 15)
  • follow ( selector )

Follow URLs found via selector.

  • delay ( seconds )

Delay starting the next command for seconds.

  • login ( user , pass [, success] [, fail] )

Submit a login form.

user - A string containing a username, email address, etc.
pass - A password string
success (optional) - A selector string determining if the login attempt succeeded
fail (optional) - A selector string determining if the login attempt failed

How it works: login finds the first form containing input[type="password"] and uses that input as the password field. It will use the preceding input element as the user field.

  • submit ( selector [, data] )

Submit a form.

selector - A selector for the form element or submit button.
data (optional) - An object where each key and value represents a form input name and value


Examples

Examples below will assume the Start URL of the job is this page on Wikipedia, which contains population information for US States.

1. Simple Page Scrape

First, let's scrape some basic information from the page using basic selectors.

Commands:

set(false, 'heading', 'text', 'h1')
set(false, 'title', 'text', 'title')

Results:

heading title
List of U.S. states and territories by population List of U.S. states and territories by population - Wikipedia

2. Scrape Using find

Next, let's say you want to get a list of all the states along with their populations. You can see that this data is in the first table on the page. In order to do this, you can use the command find, which sets the current context using selectors. Once you tell Skraper to select the rows from the first table, you can pick out the state and population values and pull them into an object using the set command.

Commands:

find('.wikitable:first tr:gt(0)')
set(false, 'state', 'text', 'td[3]')
set(false, 'population', 'text', 'td[4]')

Results:

state population
California 39,250,017
Texas 27,862,596
Florida 20,612,439
New York 19,745,289
... ...

3. Scrape Multiple Parts

You can also use multiple set calls in different contexts to pull out different pieces of data.

Commands:

set(false, 'title', 'text', 'title')
find('.wikitable:first tr:gt(0)')
set(false, 'state', 'text', 'td[3]')
set(false, 'population', 'text', 'td[4]')

Results:

title state population
List of U.S. states and territories by population - Wikipedia California 39,250,017
List of U.S. states and territories by population - Wikipedia Texas 27,862,596
List of U.S. states and territories by population - Wikipedia Florida 20,612,439
List of U.S. states and territories by population - Wikipedia New York 19,745,289
... ... ...

Notice how the title field is appended to all the data extracted after that. Skraper takes the cartesian product of the data collected in different contexts, thus flattens them.

Now, let's say you wanted some information from each state. You can use the follow command to scrape each state page. Since the URL is inside the table that you're scraping, you can pass that URL to the follow command using the a@href selector inside the 3rd column (td[3]). After you follow each page, you can use set again to pull data from each state page.

Commands:

find('.wikitable:first tr:gt(0)')
set(false, 'state', 'text', 'td[3]')
set(false, 'population', 'text', 'td[4]')
follow('td[3] a@href')
set(false, 'longitude', 'text', 'longitude')
set(false, 'latitude', 'text', 'latitude')

Results:

state population longitude latitude
California 39,250,017 119°21'19"W' 35°27'31"N'
Illinois 12,801,539 88°22'49"W' 41°16'42"N
... ... ... ...

5. Pagination

When you're scraping a news page, e-commerce page, or a blog, you'll often need to navigate between pages of pages of content. This is where the paginate command comes handy. You can use that command to continuously click a next page button or dynamically change the query parameters and scrape data from all of the pages.

For this example, we're going to scrape this page, which contains Shopify app names and their ratings. Following commands scrape all of the app names and their ratings (parses the ratings as a number) and navigates to the next page by continously clicking .a.search-pagination__next-page-text until the end.

Commands:

paginate('a.search-pagination__next-page-text')
find('div.ui-app-card')
set(false, 'appName', 'text', 'h4.ui-app-card__name')
set(false, 'ratings', 'number', 'span.ui-review-count-summary')

Results:

appName ratings
Loox ‑ Photo Reviews 4702
Free Shipping Bar 9471
Ultimate Sales Boost 6084
... ...

What's next?