PHP Web Scraper : Extract RSS Title

Thursday, 13 September 2012

PHP Web Scraper : Extract RSS Title

By skalian0712 at 11:33 am Labels: code, php, rss, scraper, web

Lets Grab SkaliCloud RSS

Assalamualaikom w.b.t and Greetings,

Today i will show you how to grab RSS feed and parse its titles. The function and flow is:

File get contents to grab the Rss page
Regex to extract the desired text

Next, file_get_contents can handle remote file access. Even its slow i hope you can learn it as beginner. Next time i will show you how to use my favorite CURL.

Back to the tutorial. Here how we can grab RSS:

$result = file_get_contents("http://www.skalicloud.com/category/blog/feed/");

Then the $result will contain XML containing all post information like title, date of post, author and etc..

Here how rss xml looks like:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
    >

<channel>
    <title>SKALI Cloud : Malaysia First True Public Cloud Service » Blog</title>
    <atom:link href="http://www.skalicloud.com/category/blog/feed/" rel="self" type="application/rss+xml" />
    <link>http://www.skalicloud.com</link>
    <description>On-Demand Cloud Computing Highly Elastic</description>
    <lastBuildDate>Tue, 04 Oct 2011 01:53:46 +0000</lastBuildDate>
    <language>en</language>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <generator>http://wordpress.org/?v=3.0.1</generator>
        <item>
        <title>Cloud vs Traditional Hosting</title>
        <link>http://www.skalicloud.com/blog/article/cloud-vs-traditional-hosting/</link>
        <comments>http://www.skalicloud.com/blog/article/cloud-vs-traditional-hosting/#comments</comments>
        <pubDate>Tue, 04 Oct 2011 01:53:14 +0000</pubDate>
        <dc:creator>admin</dc:creator>
                <category><![CDATA[Article]]></category>

        <guid isPermaLink="false">http://www.skalicloud.com/?p=1264</guid>
        <description><![CDATA[An article by Mr. Edward from Brilliant Thinking compared traditional hosting methods with cloud hosting. About 3 years ago, I<a href="http://www.skalicloud.com/blog/article/cloud-vs-traditional-hosting/">(more...)</a>]]></description>

Our objective is to get any title between <item> and </item> where all post information are located.

Also the problem is there are other <title> before <item> tag so we need to remove this portion of source to just starting from <item> element. But using array we can filter out unused array from matched Regex.

My best weapon is using Xpath for complex extraction but for beginner i will show you my 2nd weapon is uing REGEX. Regex is very good if we know how to pattern it.

Here my regex pattern:

~<title>(.*?)<\/title>~is //this will extract any chars from between any <title> tag

Ok here my full source code:

<?php
$url ='http://www.skalicloud.com/category/blog/feed/';


$result = file_get_contents($url);

$pat = "~<title>(.*?)<\/title>~is";
preg_match_all($pat, $result, $match);

echo '<pre>';

print_r($match[1]);
?>

And result:

Array
(
    [0] => SKALI Cloud : Malaysia First True Public Cloud Service » Blog
    [1] => Cloud vs Traditional Hosting
    [2] => When is the Public Cloud Most Useful for Enterprises?
    [3] => Planting the Seeds for Your Business Growth…
    [4] => Announcing a Community Discount Model of SKALI Cloud Services
    [5] => TV News Coverage of the SKALI-Microsoft Partnership
    [6] => PRESS RELEASE: SKALI Adds Microsoft Windows-based Hosting
    [7] => SKALI Cloud @ Intuit Technology Day 2011
    [8] => SKALI Cloud Capacity Increased, Serving More Clients
    [9] => Reseller Program Officially Launched at the NEF-Awani ICT Fair
    [10] => Securing with Virtual Private Cloud (VPC)
)

This how its looks like.

Ok. If you compare between the RSS and our result.

You will notice array index zero (0) is not the item title. Its title of blog.

To remove index(0) i prefer using array_slice.

Here how:

<?php
$url ='http://www.skalicloud.com/category/blog/feed/';

 
$result = file_get_contents($url);

$pat = "~<title>(.*?)<\/title>~is";
preg_match_all($pat, $result, $match);

echo '<pre>';

print_r(array_slice($match[1],1)); //slice index 0
?>

And the result:

Array
(
    [0] => Cloud vs Traditional Hosting
    [1] => When is the Public Cloud Most Useful for Enterprises?
    [2] => Planting the Seeds for Your Business Growth…
    [3] => Announcing a Community Discount Model of SKALI Cloud Services
    [4] => TV News Coverage of the SKALI-Microsoft Partnership
    [5] => PRESS RELEASE: SKALI Adds Microsoft Windows-based Hosting
    [6] => SKALI Cloud @ Intuit Technology Day 2011
    [7] => SKALI Cloud Capacity Increased, Serving More Clients
    [8] => Reseller Program Officially Launched at the NEF-Awani ICT Fair
    [9] => Securing with Virtual Private Cloud (VPC)
)

Its fun right?

You can test the code here:

SEE LIVE ACTION

Wahuallam...

7 comments:

Anonymous28 September 2013 at 04:05
Good Job!
But if i want extract ALL the title how can I do that?
ReplyDelete
Replies

Add comment