Crawley – A Scraping / Crawling Framework Built On Eventlet

A few weeks ago I started a new project. This is a Crawling / Scraping framework aimed to make easy the way we extract data from the web and store it in a relational database.

Today I released the early version 0.0.4 and I wrote several examples wich explains what the framework can do. I promise to make more real world examples and more documentation in the next days. In the mean time you can follow the project advances on the official repository at github and play with the examples.

You can also download crawley from pip running:

~$ pip install crawley 

and check the documentation.

That’s all for now. Keep watching the repository  =).

Advertisements

Non-blocking I/O, Node Js and Python’s Eventlet

Non-blocking I/O and Node JS

A while ago I researched about Non-blocking I/O. I started with Node Js (An Non-blocking I/O framework built on the google chrome’s JS engine intended to write high scalable networking applications) and I was suprised about how an HTTPServer built with this framework can fast handle a thousand of concurrent requests and do it with a very efficient memory usage.

It can be done because Node Js doesn’t start a new thread or process when a new request come to the server. Everything in Node Js run in a single thread and nothing is blocking. It does asynchronus I/O calls and tells the operating system to notify it back when the I/O task is completed using epoll (Linux), kqueue (FreeBSD), select or whatever your OS provides to do this kind of things. In the meantime, Node Js can continue processing other requests or doing extra stuff. It Never ever blocks.

Another remarkable thing is that you don’t have a particular stack for each connection since you don’t have threads. That’s cause a huge memory save when you have high concurrency levels on your server.

Read more at node official’s page. It’s a very promising project and it’s on the earlier stages.

The issues of Non-blocking Node Js programming model 

An issue related to this model of programming is that your code must be written as a set of callbacks that are invoked when the I/O operation it’s done. To be more explicit, lets look at this example:

var http = require("http") var server = http.createServer(function (req, res) {

    http.get({ 'host' : 'google.com'}, function (google_response) {

        setTimeout(function () {
            res.end(google_response.headers['location'])
        }, 2000)
    })

    res.writeHead(200, {'Content-Type': 'text/plain'})
    res.write("hello ")
})

server.listen(8000)

The code just run a server on localhost at port 8000. When you make a request to http://localhost:8000 it will write “hello”, do an http get request to google, wait for 2 seconds and then print the location header. Note that I write the code using callback functions. Normally in Node Js, almost all your code looks like this.

In addition, you need to write Javascript on the server side. Although, if you don’t like Js you can write CoffeScript for Node instead. If you come from languages like python or ruby you probably like CoffeScript.

Eventlet (The Pythonic Way)

‘Cause I’m a Python enthusiastic and I don’t want to write code the way Node Js proposes I switch my research to eventlet. A python library that provides a synchronous interface to do asynchronus I/O operations.

Green Threads And Coroutines

Eventlet uses green threads to achieve cooperative sockets. Python’s Green threads are built on the top of greenlets, a module of stackless python that implements coroutines for the python languaje. One good thing of green threads is that they are cheap. Spawn a new green thread is much more faster than create a new posix thread and it consumes much less memory too!

Taking advantage of coroutines Eventlet can patch the socket-related modules of the python standard library and make it work with them in order to change the synchronous behaviour to asynchronous behaviour. So it means you don’t need to change your synchronous code to be asynchronous!.

If you want examples of what Eventlet can do read this.

Benchmarking 

Finally I made a little bechmark between A Node Js Server, A WSGI Server using Eventlet and the python HTTPServer of the standard library.

The Node Js Server:

var http = require('http');

http.createServer(function (req, res) {
    res.writeHead(200, {'Content-Type': 'text/plain'});
    res.end('Hello World\n');
    console.log(req.headers['host'] + " - - [" + req.client._idleStart + "] \"" + req.method + " " + req.url + " " + req.httpVersion + "\" " + res.statusCode + " -");
}).listen(6000, "127.0.0.1");

console.log('Server running at http://127.0.0.1:6000/');

The Eventlet Wsgi Server:

from eventlet import wsgi
import eventlet

def handler(env, start_response):
    start_response('200 OK', [('Content-Type', 'text/plain')])
    return ['Hello, World!\r\n']

wsgi.server(eventlet.listen(('', 7000)), handler)

The Stdlib HTTP Server :

from SocketServer import ThreadingMixIn
from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler

class Handler(BaseHTTPRequestHandler):

    def do_GET(self):
        self.send_response(200)
        self.send_header("Content-type", "text/plain")
        self.end_headers()
        self.wfile.write('Hello, World!\r\n')

class SimpleHTTPServer(ThreadingMixIn, HTTPServer):
    pass

server = SimpleHTTPServer(("localhost", 8000), Handler)
print "Serving on port: %s" % 8000
server.serve_forever()

Now I have the Node Js server running on port 6000, the Eventlet Wsgi server on port 7000 and the python Http Server on port 8000.

Lets use the linux apache benchmark command to make 10K requests to each server with a concurrency level of 5:

Python Http Server Results:

Server Software:        BaseHTTP/0.3
Server Hostname:        localhost
Server Port:            8000

Document Path:          /
Document Length:        15 bytes

Concurrency Level:      5
Time taken for tests:   8.956 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      1320000 bytes
HTML transferred:       150000 bytes
Requests per second:    1116.51 [#/sec] (mean)
Time per request:       4.478 [ms] (mean)
Time per request:       0.896 [ms] (mean, across all concurrent requests)
Transfer rate:          143.93 [Kbytes/sec] received

Eventlet Wsgi Server Results:

Server Software:
Server Hostname:        localhost
Server Port:            7000

Document Path:          /
Document Length:        15 bytes

Concurrency Level:      5
Time taken for tests:   3.796 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      1360000 bytes
HTML transferred:       150000 bytes
Requests per second:    2634.18 [#/sec] (mean)
Time per request:       1.898 [ms] (mean)
Time per request:       0.380 [ms] (mean, across all concurrent requests)
Transfer rate:          349.85 [Kbytes/sec] received

Node Js Server Results:

Server Software:
Server Hostname:        localhost
Server Port:            6000

Document Path:          /
Document Length:        15 bytes

Concurrency Level:      5
Time taken for tests:   1.821 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      790000 bytes
HTML transferred:       150000 bytes
Requests per second:    5489.98 [#/sec] (mean)
Time per request:       0.911 [ms] (mean)
Time per request:       0.182 [ms] (mean, across all concurrent requests)
Transfer rate:          423.54 [Kbytes/sec] received

Now let increase the concurrency level. Let set it to 100.

Eventlet Wsgi Server Results:

Server Software:
Server Hostname:        localhost
Server Port:            7000

Document Path:          /
Document Length:        15 bytes

Concurrency Level:      100
Time taken for tests:   9.063 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      1360000 bytes
HTML transferred:       150000 bytes
Requests per second:    1103.35 [#/sec] (mean)
Time per request:       90.633 [ms] (mean)
Time per request:       0.906 [ms] (mean, across all concurrent requests)
Transfer rate:          146.54 [Kbytes/sec] received

Node Js Server Results:

Server Software:
Server Hostname:        localhost
Server Port:            6000

Document Path:          /
Document Length:        15 bytes

Concurrency Level:      100
Time taken for tests:   1.463 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      790000 bytes
HTML transferred:       150000 bytes
Requests per second:    6834.49 [#/sec] (mean)
Time per request:       14.632 [ms] (mean)
Time per request:       0.146 [ms] (mean, across all concurrent requests)
Transfer rate:          527.27 [Kbytes/sec] received

Python Http Server Results:

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
apr_socket_recv: Connection reset by peer (104)
Total of 7830 requests completed

Ups! the server breaks with a broken pipe (I run the test several times and it never completes the 10K requests)

Note: I run the test with a concurrency level of 1K and just Node Js could pass the test. The both python server breaks at one point.

Conclusion

Based on the benchmarks I think there’s no discussions possibility about wich framework have more scalability and is more efficient.

However, if you don’t need to handle a huge quantity of requests concurrently and you want to write your app in pure python I recommend Eventlet instead of the standard sinchronous socket library. The advantages of cheap green threads makes the difference when you need to do concurrent I/O operations. In addition, green threads offers you a deterministic behaviour  and doesn’t have context switch overhead (unlike posix threads and processes). This video shows it better.

A great feature of eventlet is you don’t have to rewrite your code to make it asynchronous. You start with this and learn how to change your application behaviour patching the socket library using eventlet.

Looking fordward

This post was not intended to build an opinion about wich framework or library is better or wich is more efficient or beautifull. It’s just a mind opener article. I shown you a different model to do I/O stuff on networking applications. This’s just the start!. I’ll recommend you to get deep on researchs about this model of I/O. It seems to become stronger in the next years with the advent of real time web applications and comet technologies.

Now it’s time to think about my new project… And by the way, it includes non-blocking I/O, a bunch of networking, and of course, Python =).