Phamtomjs – More than UI testing
August 21, 2015
Phantomjs is well known as a “HEADLESS WEBSITE TESTING” tool. People often use Phantomjs in form of a WebDriver tool (such as Selenium) to test whether their webpages work like expected.
In this blog post, I will demonstrate a page of this functionality and to show you how I use Phantomjs to crawl/scrape webpages. Crawling is fun!
Foreword: This blog post assumes you’re at least familiar with Javascript code. This doesn’t cover how to install Phantomjs, for example.
I. Installation
I myself choose to install latest version of PhantomJS from sources.
II. Capture page differences
Imagine development team update the website with new version, and we want to capture the changes, by any DOM elements. How can we even do that? Yes we can use our eyes but how about subtle changes that we can’t point out by our eyes? Phantoms came to rescue!
First, install page-monitor, an extension of Phantomjs (*)
Let’s say we’ll check http://vnexpress.net. Create a file named monitor.js with the following content:
var Monitor = require('page-monitor'); var url = 'http://vnexpress.net/'; var monitor = new Monitor(url); monitor.capture(function(code){ console.log(monitor.log); // from phantom console.log('done, exit [' + code + ']'); });
Execute by running:
$ node monitor.js
Output look somehow like this
Later (30 minutes or so), we will comeback and check the page again, output will *slightly* the same.
We know that vnexpress.net is a news page, so the articles will be up and down almost in no time. In 30′ or so, the page will change somehow somewhere.
We have new file named check_diff.js to check diff:
var Monitor = require('page-monitor'); var url = 'http://vnexpress.net/'; var monitor = new Monitor(url); // 1440132126640 is the first time to we check page - get this number from output of first run // 1440132126640 is the second time to we check page monitor.diff(1440132126640, 1440138104855, function(code){ console.log(monitor.log.info); // diff result console.log('[DONE] exit [' + code + ']'); });
Output of this process will be an image that shows exactly the diff, visualized!!
We can go much further, depends on how we need. For example to create a dashboard that extract stored webpage’s status as revision and show the diff..
III. Crawling
There’re some important things should be resolved when using PhantomJS to crawl a webpage:
– How to login (1)
– How to maintain that logged-in status (store and use cookies) (2)
(1) depends on how website’s login system works.
(2) is just a technique we should remember with Phantomjs
Here is the code. Within this, we
– Try to login Pluralsight.
– Check if existing cookies is valid or not. If yes, we don’t need to login again. If no, we process logging-in
– Render the homepage that show we already logged-in
// pretend to be a different browser, helps with some shitty browser-detection scripts page.settings.userAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36"; // Parse cookies, and let phantom save cookies if (fs.isFile(CookieJar)) { Array.prototype.forEach.call(JSON.parse(fs.read(CookieJar)), function(x) { phantom.addCookie(x); }); } // Config to write data to cookies when we load new pages page.onResourceReceived = function(response) { fs.write(CookieJar, JSON.stringify(phantom.cookies), "w"); }; page.onLoadStarted = function() { loadInProgress = true; // Mark requestingUrl to check "redirect" behaviour while logging-in requestingUrl = page.evaluate(function() { return window.location.href; }); console.log("load started"); }; page.onLoadFinished = function() { loadInProgress = false; console.log("load finished"); }; page.onUrlChanged = function(targetUrl) { // RULE: If from historyUrl, we got new URL => not logged in if (requestingUrl === historyUrl) { loggedIn = false; } }; var steps = [ function() { // This is needed to prevent script auto go to next step loadInProgress = true; page.open(historyUrl); }, function() { if (!loggedIn) { // Load Login Page console.log("Not logged in yet. Gonna attemp to login now"); page.open(loginUrl); } else { console.log(">> Cookies is still valid. Logged In!"); } }, function() { if (!loggedIn) { // This is needed to prevent script auto go to next step loadInProgress = true; window.setTimeout(function() { // Enter Credentials page.evaluate(function() { var loginForm = $("form.reg")[0] || $("form")[0]; loginForm.elements["Username"].value = "[TYPE-USERNAME-HERE]"; loginForm.elements["Password"].value = "[TYPE-PASSWORD-HERE]"; // document.createElement('form').submit.call(document.getElementById('login-form')); HTMLFormElement.prototype.submit.call(document.querySelectorAll("form")[0]) }); }, 1500); } }, function() { loadInProgress = true; window.setTimeout(function() { // Render screenshot of current page page.render("plural.png"); loadInProgress = false; }, 1000); } ];
Execute by running:
$ phantomjs script.js
Full code here: https://gist.github.com/tuannh99/ef247247fda68793efcb
Hope this is helpful to you all. In case you need discussion, drop me an email to tuan_nh@septeni-technology.jp, or leave comment on Github page.
(*): Install using npm, by this command:
$ npm install --save page-monitor