What to become the next Google? Creating your own search engine may seem naively ambitious, but building a PHP link scraper with cURL could very well be your starting point for doing just that. Of course, you can find better ways, but they aren’t half as much fun.
By following Marc Plotz’s PHPBuilder tutorial, you will build a robot that scrapes links from web pages and dumps them in a database. Then it reads those links from the database and follows them, scraping up the links on those pages, and so on ad infinitum (or until your server times out or your database fills up, whichever comes first). This little robot has quite interesting applications and uses if you really have the time to play with and fine-tune it.
Using cURL with PHP
cURL (or “client for URLS”) is a command-line tool for getting or sending files using URL syntax. As an example, the following command is a basic way to retrieve a page from example.com with cURL:
PHP is one of the languages that provide full support for cURL. (Find a listing of all the PHP functions you can use for cURL here.) Luckily, PHP also enables you to use cURL without invoking the command line, making it much easier to use cURL while the server is executing. The example below demonstrates how to retrieve a page called example.com using cURL and PHP.
<?php $ch = curl_init("http://www.example.com/"); $fp = fopen("example_homepage.txt", "w"); curl_setopt($ch, cURLOPT_FILE, $fp); curl_setopt($ch, cURLOPT_HEADER, 0); curl_exec($ch); curl_close($ch); fclose($fp); ?>
The Link Scraper
For the link scraper, you will use cURL to get the content of the page you are looking for, and then you will use some DOM to grab the links and insert them into your database. Read Marc Plotz’s entire PHPBuilder tutorial to find out how to build a PHP link scraper with cURL and to get the complete source code.