Python Web Crawler

Python Web Crawler is a reimplementation of a crawler that I write in PHP some time Ago.

This Crawler provides a similar functionality than the older but with the advantages of python. The code is much more cleaner, redable efficient and extensible than the PHP version.

Here’s a Getting started Guide (Tested on ubuntu 10.10):

Pre-requisites:

apt-get install python-MySQLdb

Usage:

To configure the crawler do edit the config.ini file. I.E:

========================================================================
[connection]
host = localhost
user = root
pass = root
db = testDB

[params]
start_urls = http://www.google.com,https://codescience.wordpress.com/,http://www.python.org
max_depth = 1
log = 1
========================================================================

The connection section indicates the common connection configuration to a Mysql DB.

The params section contains:

start_urls: A list of urls (must be the complete url!. Don’t forget to indicate http:// or https:// whichever is applicable) to start the crawl. The list must be separated by commas.

max_depth: The depth to crawl. 0 only crawls the start urls. 1 crawls the start_urls and allthe urls inside the given urls. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!

log: Indicates if the application shows the crawled urls in the console.

Run:

~$ python run.py

Check Out:

http://www.binpress.com/app/python-web-crawler/208

 

Regards,

Juan Manuel

Advertisements

PHP Web Crawler

I Finished to write a Web Crawler in PHP.

Here’s a Tutorial about it. I hope it will be useful for you.

The web crawler is very easy to use. To run it, just do this in the console:

~$ php main.php

Inside the main directory of the application.

The configuration will be taken from the config.ini file. For example:

config.ini
=========================================
[connection]
host = “localhost:3307”
user = “root”
pass = “root”
db = “jm”

[params]
start_url = “http://www.google.com/”
max_depth = 0
log = “1”
=========================================

The first 4 parameters are the database connection. I assume this is know to you.

The start_url param is the url to start to craw. Note: The url must be complete! Don’t ignore the http:// or https:// if it correspond.

You can specify the maximum of recursive searches in the  max_depth param. 0 only crawls the start url. 1 crawls the start_url and all the urls inside the given url. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!

Finally the log parameter indicates if the application shows the crawled urls in the console.

The config.ini can be edited by the web UI:

It’s very intuitive and you eventually can start to crawl from it. You can also watch all that’s crawled to the moment clicking in the “see what’s crawled” button.

Finally I left a list of features about the PHP Web Crawler:

– The crawler can be run as multiple instances

– It can be run by a cron job.

– All the crawls are saved in a mysql database. It generates the table “urls” to store the crawls.

– For each url it saves the url of source, the url of the destiny and the anchor text.

– Validates the urls via a regular expression. It avoids the links to static data into the site. Including the unnecessary media files. Despite this I can’t ensure that the crawler avoids all the media files. That be more complex to validate.

And here is a demo of 6 processes crawling at the same time.

 

The crawler is now hosted at: http://www.binpress.com/app/php-web-crawler/113

 

Regards, Juan Manuel