Scraping with NightmareJS – getting started

A quick overview of setting up and using the browser automation library.

What is NightmareJS, and when should you use it?
Installing and setting up
Basic example
Scrape and save your data to disk
Looping through urls
Run automatically/on a server
Other headless browsers/tools

What is NightmareJS, and when should you use it?

What is it?

Nightmare, A ‘high-level browser automation library.’ Or more commonly called a headless browser.

You can run it with a live view to watch what your script is doing, run it in the background, or even run it on your server.

It is built with electron, so you are basically running chrome. You control it using node, and the nightmare js API

What is it used for?

Headless browsers are useful when you need to automatically interact with a webpage, or collect information that is only visible after the pages JavaScript has run.

It can also used when developing your own apps, for testing the UI : https://segment.com/blog/ui-testing-with-nightmare/

Not illegal but..

Website owners would probably prefer you do not use these tools on their sites at all. And some will even have decent countermeasures to try and prevent it. You should check if they have an API first, and if you can get the information you want that way.

Note: Always use with a proxy or vpn. You could quickly find yourself blocked, or even blacklisted across multiple sites

Try to be aware of the effect your scraper might have on the website. Badly written ones can take up large amount of resources.

You can prevent assets from loading such as the CSS, images, or video. Which can speed up your scraper, as well as being less resource intensive.

Installing and setting up

Install node js

Nightmare uses node.js, so make sure you have it installed first. https://nodejs.org/en/download

Note: npm (node package manager) will be installed automatically with node.

Install nightmare js

Create a new folder maybe call it scraper. And in your folder create a new project by initialising npm:

$ npm init

Install nightmare locally, in the same folder run:

$ npm install --save nightmare

That’s it. When your nightmare script is written, just call it with node:

$ node yourscript.js

Basic example

Using Nightmare is similar to how you would browse normally: Go to the URL, click stuff, get the info you need. close tab, or go to another URL.

It is promise based so you easily chain functions in a human readable way.

Starting with the very same basic example from the docs:

const Nightmare = require('nightmare')
const nightmare = Nightmare({ show: true })

nightmare
  .goto('https://duckduckgo.com')
  .type('#search_form_input_homepage', 'github nightmare')
  .click('#search_button_homepage')
  .wait('#r1-0 a.result__a')
  .evaluate(() => document.querySelector('#r1-0 a.result__a').href)
  .end()
  .then(console.log)
  .catch(error => {
    console.error('Search failed:', error)
  })

Some things to note

In your script, you can only visit one URL at a time, and the nightmare instance nightmare = Nightmare() will store cookies, cache, local and session storage until you end it with end()

You call need to call end() before then() because:

Promises don’t expose .end(), Nightmare does. Once you call .then(), you’re not dealing with a Nightmare instance anymore, you’re dealing with a native promise. If you wanted .then() to execute (and clear the queue) before calling .end(), something like:
nightmare
  .goto(url)
  .click('some-selector')
  .then(function(){
    //current chain has run
    //do some other logic
    return nightmare.end();
  })
  .then(function(){
    //nightmare is now ended
  })
– rosshinkly https://github.com/segmentio/nightmare/issues/546#issuecomment-208173589

You can use new instances after ending the previous one. The author even recommends doing it regularly.

What is evaluate()?

evaluate() allows you to run your own javascript on the page, so you can interact with the DOM to do whatever tasks, or scrape the info you need.

Variables and functions declared outside evaluate are not accessible. It is a completely separate process. You can pass variables into the evaluate() context, but not functions:

const Nightmare = require('nightmare');
const nightmare = Nightmare({ show: true });

const name = "dr.acula";
const url = "https://example.com";

nightmare
  .goto('https://google.com')
  .evaluate((name, url) => {
    let string = `passed variables: ${name}, ${url}`;
    return string
  }, name, url)
  .end()
  .then(console.log);

Here we passed global variables name and url into evaluate(), created a new string within this context, and then returned it back, where we simply console log it. Running this example you should see the completed string in your terminal.

Anything you return you can process in then() as you will see in the next example.

Scrape and save your data to disk

In this example we will be visiting a website, typing in a search term, collecting the results, and saving to disk.

But first, If you are completely new to node, I will go over how to access the file system and save our data as JSON.

Saving as JSON

Note: For larger amounts of data I would suggest creating a CSV file and periodically appending to it instead. Rather than risk running out of memory by keeping an ever growing object in a variable.

Saving in node is pretty simple. First you need to require the fs module.

const fs = require('fs');

This exposes an API to interact with the file system. You can now read and write to files, either synchronously(blocking) or asynchronously.

Save it

I will be using the synchronous method writeFileSync() and saving it as JSON

const fs = require('fs');

let scrapedData = [{id:1, name:'Dr Frankenstein'}]; //our data
let data = JSON.stringify(scrapedData); //convert to JSON
fs.writeFileSync('scraped_data.json', data); //save to disk

This will save it as a single line string. If you would like it to save it in a more human readable format, you can add newlines and indentations with:

let data = JSON.stringify(scrapedData, null, 2);

Reading it back

As I will be saving the scraped info as JSON, it is also worth mentioning how to read and convert the data back into an JavaScript object:

const fs = require('fs');

let rawdata = fs.readFileSync('scraped_data.json');
let dataObj = JSON.parse(rawdata);

Full Example

In this example, We will go to gumtree.com (a local selling page, a bit like craigslist). Search for n64s for sale, scrape all posted adverts of their information. Such as ad title, price, location, and description. And then save that as a JSON file.

const Nightmare = require('nightmare');
const nightmare = Nightmare({ show: true });
const fs = require('fs');

nightmare
  .goto('https://www.gumtree.com')
  .wait('.search-bar .keyword-search-container input')
  .type('.search-bar .keyword-search-container input', 'n64 console')
  .click('.search-bar button[type="submit"]')
  .wait('#srp-results')
  .evaluate(() => {

    //get all ads on a page and filter out the non for sale ones
    let ads = [...document.querySelectorAll('li.natural')];
    let forSaleAds = ads.filter(ad=> ad.querySelector('.listing-price'));

    //loop through and extract info from each ad into an object
    let data = forSaleAds.map(ad => {
      let title = ad.querySelector('.listing-title').innerText;
      let price = Number.parseInt(ad.querySelector('.listing-price strong').innerText.slice(1));
      let location = ad.querySelector('.listing-location').innerText;
      let desc  = ad.querySelector('.listing-description').innerText;
			
      return {title, price, location, desc};
    });
    //return the array of objects
    return data;
  })
  .end()
  .then(data => {
    //convert to JSON and save as file
    data = JSON.stringify(data, null, 2);
    fs.writeFileSync('gumtree.json', data);
  })
  .catch(error => {
    console.error('Scraping failed:', error)
  })

And you should end up with something like this in the ‘gumtree.json’ file:

[
  {
    "desc": "Tested working all wires. Can post out for additional costs Nintendo n64 console wetup. Posted by lisa in Consoles, Other Consoles in Huddersfield. 20 June 2019",
    "location": "Huddersfield, West Yorkshire\n",
    "price": 40,
    "title": "Nintendo n64 console"
  },
  {
    "desc": "Retro game console for any TV or PC screen with HDMI input Play all games from your childhood. Great present which will bring your family together. PLUG&PLAY - already configured, no setup required You can save progress in all games at any time",
    "location": "Wallasey, Merseyside\n",
    "price": 60,
    "title": "Retro Game Console, PlayStation + Nintendo N64 Games 128GB Retropie"
  },
  {
    "desc": "N64 console. 4 controllers and selection of games; Mario Kart Goldeneye Quake Bomberman",
    "location": "Angus\n",
    "price": 75,
    "title": "N64 console, 4 controllers and games"
  },
]

Looping through urls

The next thing you probably want to do is loop through an array of urls. But there is a problem, nightmare uses a single instance, and uses promises to chain actions off of it. .goto() .click() .evaluate().

This asynchronous nature, makes it difficult to figure out how to do it. If you used a normal loop, even if you created a new instance each time, the chained operations would happen out of order.

One common solution is to use Array.reduce() to sequentially resolve a bunch of promises

Basic example:

This is the example given in the nightmare docs. (slightly modified, to end when done)

const urls = [
  'http://example1.com', 
  'http://example2.com',
  'http://example3.com'
];

urls.reduce(function(accumulator, url) {
  return accumulator.then(function(results) {
    return nightmare.goto(url)
      .wait('body')
      .title()
      .then(function(result){
        results.push(result);
        return results;
      });
  });
}, Promise.resolve([])).then(function(results){
    console.dir(results);
    return nightmare.end();
});

So this will go to each url, fetch the title, and push to an array. Returning the completed array to the final then() , where the results are printed to console.

Example using async/await

You can also write this using async/await which may make more sense to you. As it is similarly written to the css-tricks explanation linked above:

const Nightmare = require('nightmare');
const nightmare = Nightmare({ show: true });

const urls = ['https://google.com', 'https://www.bing.com'];

let counter = 0;

const getTitle = async url => {
  counter++;
  console.log(`scraping [${counter}/${urls.length}] ${url}`);
  let result = await nightmare
    .goto(url)
    .title()
    .then(result=>result);
  return result;
}

const results = urls.reduce(async (accumulator, url) => {
  const dataArray = await accumulator;
  dataArray.push(await getTitle(url));
  return dataArray;
}, Promise.resolve([]));

results.then(data => {
  console.dir(data);
  return nightmare.end();
})

I also added a counter, that logs to the console. Useful to keep track of your progress.

Run automatically/on a server

It is possible to run your script on a server, assuming you have node already installed.

Install xvfb

Because electron is a graphical application and you are running it on a headless server. You need to install xvfb, which allows electron to run.

xvfb or X virtual framebuffer, runs an in memory display server, to perform graphical operations. But without a screen output or graphics card.

Run this command to install it and its dependencies:

$ sudo apt-get install -y xvfb x11-xkb-utils xfonts-100dpi xfonts-75dpi xfonts-scalable xfonts-cyrillic x11-apps clang libdbus-1-dev libgtk2.0-dev libnotify-dev libgnome-keyring-dev libgconf2-dev libasound2-dev libcap-dev libcups2-dev libxtst-dev libxss1 libnss3-dev gcc-multilib g++-multilib

Running it

Now it should be possible to run your nightmare scripts on the server:

$ xvfb-run node scraper.js

Automate it with cron

It is easy to set your scraping script to run automatically at set time, date, or intervals with cron. A Linux utility for scheduling jobs.

Open crontab file:

$ crontab -e

And add this line to run your scraper every day at 12:00:

$ 0 0 12 1/1 * ? * cd /path/to/working/directory && xvfb-run node scraper.js

If you don’t want to have to learn how to create cron expressions you can use one of the many free online generators such as CronMaker.com

Other headless browsers/tools

There are lots of other tools you can use for scraping websites. Here is a selection of some of them.

Daydream – A chrome extension to record your actions into a Nightmare or Puppeteer script. *UPDATE: 08/01/21 This is no longer maintained and is not available on the chrome webstore. But there is an alternative actively maintained web extension called Headless Recorder available. I have not personally tested or used it, so can give no guarantees as to its performance. *
Puppeteer – Chrome headless browser.
Scrapy – A fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
Axios – Promise based HTTP client for the browser and node.js.
Cheerio – Parses markup and provides an API for traversing/manipulating the resulting data structure.
BeautifulSoup – Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

5 replies on “Scraping with NightmareJS – getting started”

Zahidul Islam says:

December 16, 2019 at 6:45 pm

console.log doesn’t work for me in evalulate(). Did you test it out?
1. Nick Hart says:
  
  January 6, 2020 at 5:55 pm
  
  Hello Zahidul,
  
  Sorry for the late reply. You are completely right It did not run, thank you for notifying me. I have now fixed this mistake and updated the example.
  
  – Nick
HTML5VideoBank says:

August 15, 2020 at 8:38 pm

Nice article. Concise and clear. Thanks. I like NightmareJS much better than Puppeteer which I think is a bit too bloated for simple stuff.
Sam says:

January 8, 2021 at 11:27 am

Need to remove the daydream link. Slightly annoying finding out it’s no longer available
1. Nick Hart says:
  
  January 8, 2021 at 11:59 am
  
  Thank you Sam, for mentioning this. I have updated the post to include an (hopefully useful) alternative.

Comments are closed.