Skraper offers 2 ways of scraping web pages: the easy way and the hard way. This tutorial is about the hard way where you configure a data scraper completely from scratch by providing your own commands.
Prerequisites:
- Understanding of the basic programming principles
- HTML/CSS/JS knowledge (you will write your own CSS/XPath selectors)
How does it work?
A job is a chain of commands that runs on a web page. Skraper opens the Start URL
of your job and runs your commands until the end (e.g. until navigation is not possible or the item you want to scrape doesn't exist). There are commands to navigate between pages, find elements, extract data, and more.
Skraper commands provide a clean promise-like interface and it supports hybrid CSS 3.0 and XPath 1.0 selectors. If you are a developer who has worked with JavaScript before, you will feel at home 🏡
Commands
These are all of the commands that are available for chaining in Skraper:
find ( selector )
Find elements based on selector
anywhere within the current document and set the context for the next commands to these elements.
filter ( selector )
Discard any nodes from the current context that do not match selector
match ( RegExp )
Discard any nodes from the current context whose contents do not match RegExp
set ( isMultiple, key, dataType [, selector] )
Set key
to the value of selector
(if a selector
is not given, sets the current element text in the current context). If isMultiple
is true, sets all of the values found as an array. dataType
lets Skraper parse the values appropriately.
set(false, 'title') // set current element text to `title` column
set(false, 'title', 'a.title') // set text of 'a.title' to `title` column
set(true, 'images', 'img@src') // set `src` of images to `images` column
set-nested ( isMultiple, key, commands )
Just like the set
command, but it evaluates the provided commands
and sets their value to the key
column so you can create dynamic columns and nested values.
// follow the link in the 3rd column (`td[3] a@href`) and set `lat` to text of `.latitude` element, `lng` to text of `.longitude` element in the next page
set-nested(
false,
'location',
[
follow('td[3] a@href'),
set(false, 'lat', '.longitude'),
set(false, 'lng', '.latitude')
]
)
paginate ( selectorOrJsonString [, limit] )
Paginate the next commands limit
times based on selector
. If a limit
is not given, Skraper will paginate until it can't find the selector
.
selectorOrJsonString
: A string for either:
- an element with the next page URL in its inner text or in an attribute that commonly contains a URL (href, src, etc.)
- an element whose
name
andvalue
attributes will respectively be added or replaced in the next page query. - a JSON object where each
key
is a query parameter name and eachvalue
is either a selector string or an increment amount (+1, -1, etc.).
paginate('a.nextPage') // go to `a.nextPage` `@href`
paginate('link[rel="next"]@href') // go to `link` `@href`
paginate('input[name="offset"]') // update `offset` parameter of the next query
// adds 20 to the `startIndex` query parameter
// sets `page` query parameter to `a.nextPage` content
// stops after 15 requests are made
paginate({ startIndex: +20, page: 'a.nextPage' }, 15)
follow ( selector )
Follow URLs found via selector
.
delay ( seconds )
Delay starting the next command for seconds
.
login ( user , pass [, success] [, fail] )
Submit a login form.
user
- A string containing a username, email address, etc.pass
- A password stringsuccess (optional)
- A selector string determining if the login attempt succeededfail (optional)
- A selector string determining if the login attempt failed
How it works: login
finds the first form containing input[type="password"]
and uses that input as the password field. It will use the preceding input
element as the user field.
submit ( selector [, data] )
Submit a form.
selector
- A selector for the form element or submit button.data (optional)
- An object where each key
and value
represents a form input name and value
Examples
Examples below will assume the Start URL
of the job is this page on Wikipedia, which contains population information for US States.
1. Simple Page Scrape
First, let's scrape some basic information from the page using basic selectors.
Commands:
set(false, 'heading', 'text', 'h1')
set(false, 'title', 'text', 'title')
Results:
heading | title |
---|---|
List of U.S. states and territories by population | List of U.S. states and territories by population - Wikipedia |
2. Scrape Using find
Next, let's say you want to get a list of all the states along with their populations. You can see that this data is in the first table on the page. In order to do this, you can use the command find
, which sets the current context using selectors. Once you tell Skraper to select the rows from the first table, you can pick out the state and population values and pull them into an object using the set
command.
Commands:
find('.wikitable:first tr:gt(0)')
set(false, 'state', 'text', 'td[3]')
set(false, 'population', 'text', 'td[4]')
Results:
state | population |
---|---|
California | 39,250,017 |
Texas | 27,862,596 |
Florida | 20,612,439 |
New York | 19,745,289 |
... | ... |
3. Scrape Multiple Parts
You can also use multiple set
calls in different contexts to pull out different pieces of data.
Commands:
set(false, 'title', 'text', 'title')
find('.wikitable:first tr:gt(0)')
set(false, 'state', 'text', 'td[3]')
set(false, 'population', 'text', 'td[4]')
Results:
title | state | population |
---|---|---|
List of U.S. states and territories by population - Wikipedia | California | 39,250,017 |
List of U.S. states and territories by population - Wikipedia | Texas | 27,862,596 |
List of U.S. states and territories by population - Wikipedia | Florida | 20,612,439 |
List of U.S. states and territories by population - Wikipedia | New York | 19,745,289 |
... | ... | ... |
Notice how the title
field is appended to all the data extracted after that. Skraper takes the cartesian product of the data collected in different contexts, thus flattens them.
4. Following Links
Now, let's say you wanted some information from each state. You can use the follow
command to scrape each state page. Since the URL is inside the table that you're scraping, you can pass that URL to the follow
command using the a@href
selector inside the 3rd column (td[3]
). After you follow each page, you can use set
again to pull data from each state page.
Commands:
find('.wikitable:first tr:gt(0)')
set(false, 'state', 'text', 'td[3]')
set(false, 'population', 'text', 'td[4]')
follow('td[3] a@href')
set(false, 'longitude', 'text', 'longitude')
set(false, 'latitude', 'text', 'latitude')
Results:
state | population | longitude | latitude |
---|---|---|---|
California | 39,250,017 | 119°21'19"W' | 35°27'31"N' |
Illinois | 12,801,539 | 88°22'49"W' | 41°16'42"N |
... | ... | ... | ... |
5. Pagination
When you're scraping a news page, e-commerce page, or a blog, you'll often need to navigate between pages of pages of content. This is where the paginate
command comes handy. You can use that command to continuously click a next page
button or dynamically change the query parameters and scrape data from all of the pages.
For this example, we're going to scrape this page, which contains Shopify app names and their ratings. Following commands scrape all of the app names and their ratings (parses the ratings as a number
) and navigates to the next page by continously clicking .a.search-pagination__next-page-text
until the end.
Commands:
paginate('a.search-pagination__next-page-text')
find('div.ui-app-card')
set(false, 'appName', 'text', 'h4.ui-app-card__name')
set(false, 'ratings', 'number', 'span.ui-review-count-summary')
Results:
appName | ratings |
---|---|
Loox ‑ Photo Reviews | 4702 |
Free Shipping Bar | 9471 |
Ultimate Sales Boost | 6084 |
... | ... |
What's next?