Python Web Crawler

Python Web Crawler is a reimplementation of a crawler that I write in PHP some time Ago.

This Crawler provides a similar functionality than the older but with the advantages of python. The code is much more cleaner, redable efficient and extensible than the PHP version.

Here’s a Getting started Guide (Tested on ubuntu 10.10):

Pre-requisites:

apt-get install python-MySQLdb

Usage:

To configure the crawler do edit the config.ini file. I.E:

========================================================================
[connection]
host = localhost
user = root
pass = root
db = testDB

[params]
start_urls = http://www.google.com,https://codescience.wordpress.com/,http://www.python.org
max_depth = 1
log = 1
========================================================================

The connection section indicates the common connection configuration to a Mysql DB.

The params section contains:

start_urls: A list of urls (must be the complete url!. Don’t forget to indicate http:// or https:// whichever is applicable) to start the crawl. The list must be separated by commas.

max_depth: The depth to crawl. 0 only crawls the start urls. 1 crawls the start_urls and allthe urls inside the given urls. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!

log: Indicates if the application shows the crawled urls in the console.

Run:

~$ python run.py

Check Out:

http://www.binpress.com/app/python-web-crawler/208

 

Regards,

Juan Manuel

Proxy Server Over NodeJs

I finish writing a HTTP Proxy Server using NodeJs.

NodeJs is a nonblocking event-driven I/O framework for the V8 JavaScript engine (The Chrome JS engine).

Clarifying, nothing is blocking in NodeJs. Everything is handled by events. This feature provides networking applications that can handle a thousands of request and do more stuff at same time!

On my experience, complex and critical applications, like web servers, run so fast on NodeJs.

Let’s take a look at my HTTP Proxy Server Features.

  • A Complete Proxy Server
  • Written over Node JS framework
  • Customizable request handlers
  • Javascript to write server side code!
  • Easy configuration to handle all browsers

Setting up the Server

To Run the server you need to install nodejs first:

Just go to http://nodejs.org/ and download the latest version. Then go to https://github.com/joyent/node/wiki/Installation and follow the instructions to install node.

Then you can checkout the proxy:

svn checkout http://node-proxy-server.googlecode.com/svn/trunk/ node-proxy-server-read-only

Next, open a terminal, navigate to the proxy directory and type:

~$ node proxy.js

Now the server is running on localhost:8000.

It’s time to setup your browser to listen it. Do the follow (On Mozila FireFox):

  • Go to Edit > Preferences on the menu bar
  • In the tab “Advanced” Select “Network” and click on “Settings”
  • Select the option “Manual proxy configuration”
  • In the textbox HTTP Proxy write: localhost and set the port to 8000.
  • Click on Ok and it’s done! You’re Ready to Browse.

Extending the server

You can extend your server adding request handlers in the handlers.js file.

A handler must be an object that contains a ‘pattern’ (string which matches the urls) and an ‘action’ (function that indicates what to do with the request). For example:

var handler = {
    pattern : 'facebook',
    action : function(response) {
        response.writeHead(200, {'Content-Type': 'text/html'});
        response.end("Hello facebook!");  
    }
}

The handler must be in a list of handlers that are exported to the server module:

 exports.handlers = [handler];

That’s all! If you want to write more complex modules, you can read the node js documentation (http://nodejs.org/docs/v0.4.3/api/http.html) or contact me.

Check out the repository

svn checkout http://node-proxy-server.googlecode.com/svn/trunk/ node-proxy-server-read-only

Enjoy the code!