Python Web Crawler

Python Web Crawler is a reimplementation of a crawler that I write in PHP some time Ago.

This Crawler provides a similar functionality than the older but with the advantages of python. The code is much more cleaner, redable efficient and extensible than the PHP version.

Here’s a Getting started Guide (Tested on ubuntu 10.10):


apt-get install python-MySQLdb


To configure the crawler do edit the config.ini file. I.E:

host = localhost
user = root
pass = root
db = testDB

start_urls =,,
max_depth = 1
log = 1

The connection section indicates the common connection configuration to a Mysql DB.

The params section contains:

start_urls: A list of urls (must be the complete url!. Don’t forget to indicate http:// or https:// whichever is applicable) to start the crawl. The list must be separated by commas.

max_depth: The depth to crawl. 0 only crawls the start urls. 1 crawls the start_urls and allthe urls inside the given urls. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!

log: Indicates if the application shows the crawled urls in the console.


~$ python

