PHP Web Crawler

I Finished to write a Web Crawler in PHP.

Here’s a Tutorial about it. I hope it will be useful for you.

The web crawler is very easy to use. To run it, just do this in the console:

~$ php main.php

Inside the main directory of the application.

The configuration will be taken from the config.ini file. For example:

config.ini
=========================================
[connection]
host = “localhost:3307”
user = “root”
pass = “root”
db = “jm”

[params]
start_url = “http://www.google.com/”
max_depth = 0
log = “1”
=========================================

The first 4 parameters are the database connection. I assume this is know to you.

The start_url param is the url to start to craw. Note: The url must be complete! Don’t ignore the http:// or https:// if it correspond.

You can specify the maximum of recursive searches in the  max_depth param. 0 only crawls the start url. 1 crawls the start_url and all the urls inside the given url. 2 All the urls inside the urls given by previous and so on… Warning: A factor of 3 or greater can take for hours, days, month or years!

Finally the log parameter indicates if the application shows the crawled urls in the console.

The config.ini can be edited by the web UI:

It’s very intuitive and you eventually can start to crawl from it. You can also watch all that’s crawled to the moment clicking in the “see what’s crawled” button.

Finally I left a list of features about the PHP Web Crawler:

– The crawler can be run as multiple instances

– It can be run by a cron job.

– All the crawls are saved in a mysql database. It generates the table “urls” to store the crawls.

– For each url it saves the url of source, the url of the destiny and the anchor text.

– Validates the urls via a regular expression. It avoids the links to static data into the site. Including the unnecessary media files. Despite this I can’t ensure that the crawler avoids all the media files. That be more complex to validate.

And here is a demo of 6 processes crawling at the same time.

 

The crawler is now hosted at: http://www.binpress.com/app/php-web-crawler/113

 

Regards, Juan Manuel

Advertisements

25 thoughts on “PHP Web Crawler

  1. I like to spend my free time by reading varied web sites and today i came across your blog and I believe that it is one of the best free resources available! Well done! Keep on this quality!

  2. Pingback: penis enlargement

  3. Excellent posting. Undoubtedly you are an expert when it comes to this writing. It is just the first time I went through your post and to tell the truth it has made me visit here time and again.And yes i have book mark your site codescience.wordpress.com .

  4. I wanted to follow along and let you know how really I valued discovering your site today. We would consider it a honor to work at my company and be able to make use of the tips shared on your blog and also take part in visitors’ remarks like this. Should a position connected with guest publisher become on offer at your end, you should let me know.

    • The crawler tries to avoid images but it needs just a little modification to grab those.
      If you need pick up images I can make the change and send you a customized version.

      Regards,
      Juan

  5. Hello Juan,
    I would like to know if your crawler can grab physical image files from source site and store it on destination server. Actually, I don’t want backlinks with images files and want to get them on my server.
    Please respond me on this so that I can buy this package.
    Thanks.
    Tab

    • Hi Tabinda,

      The current version of the crawler only can grab links (images or whatever) from pages. However, it’s easy to make modifications on the code in order to get the physical image and store it in a database.
      If you want these modifications let me know and I’ll make it for you.
      You can send me a mail or contact me via linkedin to discuss this issue.

      Regards.

  6. Hey Jaun,
    I am building a price comparison website and would like to use your crawler to collect all relevant links and product information including the price and insert into the database, the crawler has get the details of all the products in a category irrespective of the number of pages. can it be done?

    • Hey!
      Yeah, the crawler will help you. But you also need to write the scraper in order to extract the data from the websites.
      I worked a lot doing data scraping. Maybe I could help you. Just send me an email to jmg.utn@gmail.com.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s