Proxy Dispatcher implemented in PHP

I want to share a piece of code which might be very usefull when you have to deal with objects introspection in PHP. I played for years with the python’s introspection system and I loved it.

But now I’m back on PHP. A language that have very good metaprogramming tools but which is less pragmatic than python or ruby in this aspects (and maybe in almost all aspects) under my point of view.

In this piece of code I’m trying to replace the *args of python with the php function call_user_func_array. The functionally behind this differents implementations is very similar in the end. But I ever think python’s approach is far better =).

Let the code talk:

/**
* Proxy Dispatcher using php call_user_func_array (http://us2.php.net/manual/en/function.call-user-func-array.php)
* */

class Foo {

    function bar1($arg, $arg2, $arg3, $arg4) {
         return "arg: $arg, arg2: $arg2, arg3: $arg3, arg4: $arg4\n";
    }
    function bar2($arg, $arg2) {
        return "arg: $arg, arg2: $arg2\n";
    }
    function bar3($arg) {
        return "arg: $arg\n";
    }
}

class FooWrapper {

    public function __construct() {
        $this->_foo = new Foo();
    }

    public function __call($method, $arguments) {
        return call_user_func_array(array($this->_foo, $method), $arguments);
    }
}

$fooWrapper = new FooWrapper();
echo $fooWrapper->bar1(1,2,3,4);
echo $fooWrapper->bar2(1,2);
echo $fooWrapper->bar3(1);

And here is the python’s code for the same:

class Foo(object):

    def bar1(self, arg, arg2, arg3, arg4):
        print "arg: %s, arg2: %s, arg3: %s, arg4: %s" % (arg, arg2, arg3, arg4)

    def bar2(self, arg, arg2):
        print "arg: %s, arg2: %s" % (arg, arg2)

    def bar3(self, arg):
        print "arg: %s" % arg


class FooWrapper(object):

    foo = Foo()

    def __getattr__(self, name):
        return lambda *args, **kwargs: getattr(self.foo, name)(*args, **kwargs)


fooWrapper = FooWrapper()
fooWrapper.bar1(1,2,3,4)
fooWrapper.bar2(1,2)
fooWrapper.bar3(1)
Advertisements

PHP Web Crawler

I Finished to write a Web Crawler in PHP.

Here’s a Tutorial about it. I hope it will be useful for you.

The web crawler is very easy to use. To run it, just do this in the console:

~$ php main.php

Inside the main directory of the application.

The configuration will be taken from the config.ini file. For example:

config.ini
=========================================
[connection]
host = “localhost:3307”
user = “root”
pass = “root”
db = “jm”

[params]
start_url = “http://www.google.com/”
max_depth = 0
log = “1”
=========================================

The first 4 parameters are the database connection. I assume this is know to you.

The start_url param is the url to start to craw. Note: The url must be complete! Don’t ignore the http:// or https:// if it correspond.

You can specify the maximum of recursive searches in the  max_depth param. 0 only crawls the start url. 1 crawls the start_url and all the urls inside the given url. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!

Finally the log parameter indicates if the application shows the crawled urls in the console.

The config.ini can be edited by the web UI:

It’s very intuitive and you eventually can start to crawl from it. You can also watch all that’s crawled to the moment clicking in the “see what’s crawled” button.

Finally I left a list of features about the PHP Web Crawler:

– The crawler can be run as multiple instances

– It can be run by a cron job.

– All the crawls are saved in a mysql database. It generates the table “urls” to store the crawls.

– For each url it saves the url of source, the url of the destiny and the anchor text.

– Validates the urls via a regular expression. It avoids the links to static data into the site. Including the unnecessary media files. Despite this I can’t ensure that the crawler avoids all the media files. That be more complex to validate.

And here is a demo of 6 processes crawling at the same time.

 

The crawler is now hosted at: http://www.binpress.com/app/php-web-crawler/113

 

Regards, Juan Manuel