Python Web Crawler is a reimplementation of a crawler that I write in PHP some time Ago.
This Crawler provides a similar functionality than the older but with the advantages of python. The code is much more cleaner, redable efficient and extensible than the PHP version.
Here’s a Getting started Guide (Tested on ubuntu 10.10):
Pre-requisites:
apt-get install python-MySQLdb
Usage:
To configure the crawler do edit the config.ini file. I.E:
========================================================================
[connection]
host = localhost
user = root
pass = root
db = testDB
[params]
start_urls = http://www.google.com,https://codescience.wordpress.com/,http://www.python.org
max_depth = 1
log = 1
========================================================================
The connection section indicates the common connection configuration to a Mysql DB.
The params section contains:
start_urls: A list of urls (must be the complete url!. Don’t forget to indicate http:// or https:// whichever is applicable) to start the crawl. The list must be separated by commas.
max_depth: The depth to crawl. 0 only crawls the start urls. 1 crawls the start_urls and allthe urls inside the given urls. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!
log: Indicates if the application shows the crawled urls in the console.
Run:
~$ python run.py
Check Out:
http://www.binpress.com/app/python-web-crawler/208
Regards,
Juan Manuel
is it a joke ??
be open man !
It’s Open Source. Just that Open Source doesn’t mean wich it must be free.
I’m developing a crawling framework and it’s totally free.
https://github.com/jmg/crawley
Check that out if you will.
” Open Source ”
so why i can not gt source code ?
http://www.opensource.org/docs/osd
& thnks for reply ! 🙂
You should make a video or something to show us how to use the program. I can’t figure out how to start the application