In previous post i've told that PHP web scraping involving several flows techniques. Today i will showing you a very basic PHP Web Scraping.
For example i will show you how to extract latest news from http://www.skali.net/web/guest/news and sort its headlines into the list.
|
Skali.Net News Page |
Fisrt we need a very php basic code to obtain a copy of Skali News. For basic lesson i will using this code:
<?php
$data = file_get_contents(
'http://www.skali.net/web/guest/news'
);
?>
file_get_contents is quick code to grab remoted site as string code. $data variable containing Skali News html source code.
Next, we need to parsing its headlines. For basic i will using Regex method. Before using Regex, we need to know which pattern of News Headline and what its unique angle compare to others htmls code.
|
Skali News Html Source Code |
Refer to the screen snapshot, i have highlighted the text pattern that describing Skali news headlines.
As example:
<span style="font-size: larger;">EmbunWeb.com Are at Wordcamp Malaysia!</span>
My Regex pattern should looks like this to match all headlines:
#<span style="font-size\: larger;">(.*?)<\/span>#is
The Regex pattern need to match just between <span> with font larger style. The matched pattern need to dump as array list to make sure its success.
<?php
$data = file_get_contents('http://www.skali.net/web/guest/news');
$regex = '#<span style="font-size\: larger;">(.*?)<\/span>#is';
preg_match_all($regex,$data,$match);
echo '<pre>';
echo print_r($match[1]);
echo '</pre>';
?>
Here how output using my localhost testing system:
|
Weehee...we got it!! |
Output:
Array
(
[0] => EmbunWeb.com Are at Wordcamp Malaysia!
[1] => Kepakaran tempatan memacu MEB Oleh Aimi Aizal Nasharuddin
[2] => Govt websites have markedly improved: MDeC
[3] => Skali sees 20% growth in web hosting revenue
[4] => Utusan Malaysia - Program SPIKE lahirkan Aisoft Solution
[5] => The Edge Malaysia - Sure and Steady Skali
[6] => Skali Looks Forward To Stage 2 Of MPS Project Feb 8 2010
)