Phamtomjs – More than UI testing

August 21, 2015

Phantomjs is well known as a “HEADLESS WEBSITE TESTING” tool. People often use Phantomjs in form of a WebDriver tool (such as Selenium) to test whether their webpages work like expected.

In this blog post, I will demonstrate a page of this functionality and to show you how I use Phantomjs to crawl/scrape webpages. Crawling is fun!

Foreword: This blog post assumes you’re at least familiar with Javascript code. This doesn’t cover how to install Phantomjs, for example.

I. Installation

I myself choose to install latest version of PhantomJS from sources.

II. Capture page differences

Imagine development team update the website with new version, and we want to capture the changes, by any DOM elements. How can we even do that? Yes we can use our eyes but how about subtle changes that we can’t point out by our eyes? Phantoms came to rescue!

First, install page-monitor, an extension of Phantomjs (*)

Let’s say we’ll check http://vnexpress.net. Create a file named monitor.js with the following content:

var Monitor = require('page-monitor');

var url = 'http://vnexpress.net/';
var monitor = new Monitor(url);
monitor.capture(function(code){
    console.log(monitor.log); // from phantom
    console.log('done, exit [' + code + ']');
});

Execute by running:

$ node monitor.js

Output look somehow like this

Later (30 minutes or so), we will comeback and check the page again, output will *slightly* the same.

We know that vnexpress.net is a news page, so the articles will be up and down almost in no time. In 30′ or so, the page will change somehow somewhere.

We have new file named check_diff.js to check diff:

var Monitor = require('page-monitor');

var url = 'http://vnexpress.net/';
var monitor = new Monitor(url);

// 1440132126640 is the first time to we check page - get this number from output of first run
// 1440132126640 is the second time to we check page
monitor.diff(1440132126640, 1440138104855, function(code){
    console.log(monitor.log.info); // diff result
    console.log('[DONE] exit [' + code + ']');
});

Output of this process will be an image that shows exactly the diff, visualized!!

[Large image to demonstrate]

We can go much further, depends on how we need. For example to create a dashboard that extract stored webpage’s status as revision and show the diff..

III. Crawling

There’re some important things should be resolved when using PhantomJS to crawl a webpage:

– How to login (1)

– How to maintain that logged-in status (store and use cookies) (2)

(1) depends on how website’s login system works.

(2) is just a technique we should remember with Phantomjs

Here is the code. Within this, we

– Try to login Pluralsight.

– Check if existing cookies is valid or not. If yes, we don’t need to login again. If no, we process logging-in

– Render the homepage that show we already logged-in

// pretend to be a different browser, helps with some shitty browser-detection scripts
page.settings.userAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36";

// Parse cookies, and let phantom save cookies
if (fs.isFile(CookieJar)) {
  Array.prototype.forEach.call(JSON.parse(fs.read(CookieJar)), function(x) {
    phantom.addCookie(x);
  });
}

// Config to write data to cookies when we load new pages
page.onResourceReceived = function(response) {
  fs.write(CookieJar, JSON.stringify(phantom.cookies), "w");
};

page.onLoadStarted = function() {
  loadInProgress = true;

  // Mark requestingUrl to check "redirect" behaviour while logging-in
  requestingUrl = page.evaluate(function() {
    return window.location.href;
  });
  console.log("load started");
};

page.onLoadFinished = function() {
  loadInProgress = false;
  console.log("load finished");
};

page.onUrlChanged = function(targetUrl) {
  // RULE: If from historyUrl, we got new URL => not logged in
  if (requestingUrl === historyUrl) {
    loggedIn = false;
  }
};

var steps = [
  function() {
    // This is needed to prevent script auto go to next step
    loadInProgress = true;
    page.open(historyUrl);
  },
  function() {
    if (!loggedIn) {
      // Load Login Page
      console.log("Not logged in yet. Gonna attemp to login now");
      page.open(loginUrl);
    } else {
      console.log(">> Cookies is still valid. Logged In!");
    }
  },
  function() {
    if (!loggedIn) {
      // This is needed to prevent script auto go to next step
      loadInProgress = true;

      window.setTimeout(function() {
        // Enter Credentials
        page.evaluate(function() {

          var loginForm = $("form.reg")[0] || $("form")[0];

          loginForm.elements["Username"].value = "[TYPE-USERNAME-HERE]";
          loginForm.elements["Password"].value = "[TYPE-PASSWORD-HERE]";
          // document.createElement('form').submit.call(document.getElementById('login-form'));
          HTMLFormElement.prototype.submit.call(document.querySelectorAll("form")[0])
        });
      }, 1500);
    }
  },
  function() {
    loadInProgress = true;
    window.setTimeout(function() {
      // Render screenshot of current page
      page.render("plural.png");
      loadInProgress = false;
    }, 1000);
  }
];

Execute by running:

$ phantomjs script.js

Full code here: https://gist.github.com/tuannh99/ef247247fda68793efcb

Hope this is helpful to you all. In case you need discussion, drop me an email to tuan_nh@septeni-technology.jp, or leave comment on Github page.

(*): Install using npm, by this command:

$ npm install --save page-monitor

Post Views: 629

Tags:Crawler, crawling, headless, phantomjs, test, testing, webdriver

About The Author

Nguyen Huy Tuan