PDA

View Full Version : Using RegEx to extract certain links based on anchor text



David Bowley
24 Oct 2007, 11:59 AM
<?php


$url = "http://www.somedomain.com/curltest/testpage.html";

$ch = curl_init(); // initialize curl handle

curl_setopt($ch, CURLOPT_URL,$url); // set url to post to

curl_setopt($ch, CURLOPT_FAILONERROR, 1);

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// allow redirects

curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable

curl_setopt($ch, CURLOPT_TIMEOUT, 3); // times out after 4s

$myHtml = curl_exec($ch); // run the whole process

curl_close($ch);

$pattern = "<HTML\b[^>]*>(.*?)</HTML>"

preg_match_all($pattern, $myHtml, $result, PREG_SET_ORDER);

echo $pattern;

?>

At the moment nothing is being output here. Now what should be happening just as a tester of the RegEx is that everything between the HTML tags on the page should be output to the page, however nothing happens.

Once I've got this sorted out I want a different RegEx so that it can extract all links on a page that have a certain phrase in the anchor text.

Any help anyone?