Thursday 13 September 2012

PHP Web Scraper : Extract RSS Title

Lets Grab SkaliCloud RSS
Assalamualaikom w.b.t and Greetings,

Today i will show you how to grab RSS feed and parse its titles. The function and flow is:

  1. File get contents to grab the Rss page
  2. Regex to extract the desired text

Next, file_get_contents can handle remote file access. Even its slow i hope you can learn it as beginner. Next time i will show you how to use my favorite CURL.

Back to the tutorial. Here how we can grab RSS:

$result = file_get_contents("http://www.skalicloud.com/category/blog/feed/");

Then the $result will contain XML containing all post information like title, date of post, author and etc..

Here how rss xml looks like:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
    >

<channel>
    <title>SKALI Cloud : Malaysia First True Public Cloud Service &#187; Blog</title>
    <atom:link href="http://www.skalicloud.com/category/blog/feed/" rel="self" type="application/rss+xml" />
    <link>http://www.skalicloud.com</link>
    <description>On-Demand Cloud Computing Highly Elastic</description>
    <lastBuildDate>Tue, 04 Oct 2011 01:53:46 +0000</lastBuildDate>
    <language>en</language>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <generator>http://wordpress.org/?v=3.0.1</generator>
        <item>
        <title>Cloud vs Traditional Hosting</title>
        <link>http://www.skalicloud.com/blog/article/cloud-vs-traditional-hosting/</link>
        <comments>http://www.skalicloud.com/blog/article/cloud-vs-traditional-hosting/#comments</comments>
        <pubDate>Tue, 04 Oct 2011 01:53:14 +0000</pubDate>
        <dc:creator>admin</dc:creator>
                <category><![CDATA[Article]]></category>

        <guid isPermaLink="false">http://www.skalicloud.com/?p=1264</guid>
        <description><![CDATA[An article by Mr. Edward from Brilliant Thinking compared traditional hosting methods with cloud hosting. About 3 years ago, I<a href="http://www.skalicloud.com/blog/article/cloud-vs-traditional-hosting/">(more...)</a>]]></description>


Our objective is to get any title between <item> and </item> where all post information are located.

Also the problem is there are other <title> before <item> tag so we need to remove this portion of source to just starting from <item> element. But using array we can filter out unused array from matched Regex.

My best weapon is using Xpath for complex extraction but for beginner i will show you my 2nd weapon is uing REGEX. Regex is very good if we know how to pattern it.

Here my regex pattern:

~<title>(.*?)<\/title>~is //this will extract any chars from between any <title> tag

Ok here my full source code:

<?php
$url ='http://www.skalicloud.com/category/blog/feed/';


$result = file_get_contents($url);

$pat = "~<title>(.*?)<\/title>~is";
preg_match_all($pat, $result, $match);

echo '<pre>';

print_r($match[1]);
?>

And result:

Array
(
    [0] => SKALI Cloud : Malaysia First True Public Cloud Service » Blog
    [1] => Cloud vs Traditional Hosting
    [2] => When is the Public Cloud Most Useful for Enterprises?
    [3] => Planting the Seeds for Your Business Growth…
    [4] => Announcing a Community Discount Model of SKALI Cloud Services
    [5] => TV News Coverage of the SKALI-Microsoft Partnership
    [6] => PRESS RELEASE: SKALI Adds Microsoft Windows-based Hosting
    [7] => SKALI Cloud @ Intuit Technology Day 2011
    [8] => SKALI Cloud Capacity Increased, Serving More Clients
    [9] => Reseller Program Officially Launched at the NEF-Awani ICT Fair
    [10] => Securing with Virtual Private Cloud (VPC)
)
 
This how its looks like.
 
Ok. If you compare between the RSS and our result. 
You will notice array index zero (0) is not the item title. Its title of blog.

To remove index(0) i prefer using array_slice.
 

Here how:

<?php
$url ='http://www.skalicloud.com/category/blog/feed/';

 
$result = file_get_contents($url);

$pat = "~<title>(.*?)<\/title>~is";
preg_match_all($pat, $result, $match);

echo '<pre>';

print_r(array_slice($match[1],1)); //slice index 0
?>


 
And the result:
 
Array
(
    [0] => Cloud vs Traditional Hosting
    [1] => When is the Public Cloud Most Useful for Enterprises?
    [2] => Planting the Seeds for Your Business Growth…
    [3] => Announcing a Community Discount Model of SKALI Cloud Services
    [4] => TV News Coverage of the SKALI-Microsoft Partnership
    [5] => PRESS RELEASE: SKALI Adds Microsoft Windows-based Hosting
    [6] => SKALI Cloud @ Intuit Technology Day 2011
    [7] => SKALI Cloud Capacity Increased, Serving More Clients
    [8] => Reseller Program Officially Launched at the NEF-Awani ICT Fair
    [9] => Securing with Virtual Private Cloud (VPC)
) 
 
Its fun right?
 
You can test the code here:
SEE LIVE ACTION
 
Wahuallam... 

7 comments:

  1. Good Job!
    But if i want extract ALL the title how can I do that?

    ReplyDelete
    Replies
    1. Thanks for comment.
      As you can see the shown code above is to extract all the titles in RSS.

      Delete
    2. Mmh, no! :-)
      Because if the RSS file is only for the last 5 elements I can't see every post :-)

      Delete
    3. To grab all the titles from specific site or blogs, we need special script that can:

      1. Extract specific html elements or dom
      2. Find and follow post link pages

      Actually its not about RSS crawler but more than advance bot.

      Thank you for your comments.

      Delete
    4. I solved using the xml sitemap :-)

      Delete
    5. Thanks for pointed and reminded me regarding this. Yes, sitemap.xml will contains all the title.

      Delete