
Introduction
First, web scrapping is illegal. So be sure that you ask permission and you are allowed to scrape the website to avoid some legal issues. Second, this post is not to promote stealing private data on the internet. Its purpose is to demonstrate how to use nodejs’s xray package.
I used to scrape the internet for a Big Data company and it was not fun. I have to use multiple nodejs packages like express, request, cheerio, async, and some utility packages. I need to work on a very complicated callback hell just to create at least 3 levels deep of nested object. Sigh.
But now, good guy Mat created xray and shared it to public. It’s so easy to use so let’s start scraping. For this deme I’ll be getting some data from IMDB. (Shhhhh… I did not ask permission to use their data).
On this post, we will learn how to :
- scrape
- crawl the links
- proceed to the next page
- and stream our data
1. Basic scrapping
First, you need to install xray
.
npm install x-ray --save
Getting a content from a given URL is very easy. require
xray in your application and then initialize it.
var Xray = require('x-ray');
var x = Xray();
x()
xray
needs some arguments to work.
- URL - this is some description
- data structure - this is some description
- callback - this is some description
So the code above will display the title of the page of the given URL. Now let’s get all the posts listed in that page. With the help of selector, that would be an easy task.
// code here
Now that code will display the list of posts in that page.
// list of posts
2. A Bit More Exciting Scraping
Now that we can get each posts in that page, lets try to crawl to each link of every posts. To do that, al you have to do is initialize another instance of xray
as a value of an object
x(
url,
[{
postTitle : '.post',
postDetails : x(
'.post@href',
[{
content: '.content'
}]
)
}]
)(function(err,data){
if(err){
console.log(err);
}else{
console.log(data);
}
})
What this code does is it loops on every post then crawl to the link from .post@href
and get the data from .content
element on that page and save it on the postDetails
of that iteration. If you didn’t encounter an error, you’ll see a new item on your post object call postDetails
with the blog article as value
// sample result