Build a full RSS crawling system using cURL & PhpQuery

In this post, i will show you how to create a simple PHP rss crawler system using cURL, Google feed API and phpquery.

http://i.imgur.com/i51lHIM.jpg

 

What’s phpquery?

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.

Great, a jquery in server-side, awesome !

Firstable, you have to download the phpquery class at:

http://code.google.com/p/phpquery/downloads/list

Extract the downloaded file then we have folder named phpQuery, copy it to out project folder then make sure your php cURL and openssl extension is enable.

Ok, let write some code 🙂

At first, we create a php class named Crawler that contain our necessary crawling functions
File: crawler.class.php

http://i.imgur.com/qC6IdiB.jpg

The constant ENTRY is the limit of  number entry return from feed URL, the GAPI constant is the URL of Google feed API that we use to convert RSS feed into JSON object, the $url variable is passed in class contructor.

Now create main function to receive a rss URL then return feed data in JSON(using google feed API):

http://i.imgur.com/QcSFxCc.png

As above, we using cURL to request to Google Feed API with RSS url, then get the response from API and return. The response data is JSON string so we have to using json_decode to parse JSON string into an object or array.

Look good enough, now we create main index.php file to display full content from a RSS Feed using Crawler class above.

File: index.php

http://i.imgur.com/V8bgCWZ.png

A variable to keep out feed URL in maintainable way 🙂

$crawler = new Crawler($rssFeed);
$data = $crawler-˃getFeed();

initialize out Crawler class and call getFeed method to convert RSS feed into an array.

Next step, we’ll loop all the returned array to crawl deeper in targeted website using link from RSS feed.

using newDocumentHTML method of phpQuery to get HTML data from an remote URL, after this, we can use pq method similar with jQuery selector as you can see above:
– Select the main div(div.cxtLeft)
– remove unwanted tag like script, social share div tags..
– Print out the content

And you’re done 😉

Checkout our work:

http://i.imgur.com/u0LTmLt.jpg

You may customize this script to make it more useful or bigger crawling system using only cURL and phpquery class.

You can download my script here:

https://drive.google.com/file/d/0B00Slak--E0hR0xuTlhGT3YxMkU/edit?usp=sharing

Add a Comment

Scroll Up