I Finished to write a Web Crawler in PHP.
Here’s a Tutorial about it. I hope it will be useful for you.
The web crawler is very easy to use. To run it, just do this in the console:
~$ php main.php
Inside the main directory of the application.
The configuration will be taken from the config.ini file. For example:
host = “localhost:3307”
user = “root”
pass = “root”
db = “jm”
start_url = “http://www.google.com/”
max_depth = 0
log = “1”
The first 4 parameters are the database connection. I assume this is know to you.
The start_url param is the url to start to craw. Note: The url must be complete! Don’t ignore the http:// or https:// if it correspond.
You can specify the maximum of recursive searches in the max_depth param. 0 only crawls the start url. 1 crawls the start_url and all the urls inside the given url. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!
Finally the log parameter indicates if the application shows the crawled urls in the console.
The config.ini can be edited by the web UI:
It’s very intuitive and you eventually can start to crawl from it. You can also watch all that’s crawled to the moment clicking in the “see what’s crawled” button.
Finally I left a list of features about the PHP Web Crawler:
– The crawler can be run as multiple instances
– It can be run by a cron job.
– All the crawls are saved in a mysql database. It generates the table “urls” to store the crawls.
– For each url it saves the url of source, the url of the destiny and the anchor text.
– Validates the urls via a regular expression. It avoids the links to static data into the site. Including the unnecessary media files. Despite this I can’t ensure that the crawler avoids all the media files. That be more complex to validate.
And here is a demo of 6 processes crawling at the same time.
The crawler is now hosted at: http://www.binpress.com/app/php-web-crawler/113
Regards, Juan Manuel