Thursday, November 28, 2013

Build a Web Crawler Using BFS

Hello everyone, today I'll be running through how to build a simple web crawler using one of the most popular graph algorithms, Breadth First Search. If you're wondering about the picture of the spider, it simply is exactly what the post is intended to be about; a web-crawler =)



I encourage you to understand the BFS algorithm before continuing this article, as you'll understand the information presented here a lot better.

Essentially, what our web crawler will aim to do is load the HTML from a starting page, say www.example.com and use a regular expression match to find any valid URL's on that site and recursively visit each one of those links and perform the same task ; load the HTML and find any links to other websites. Most importantly, we will also implement a mechanism for keeping track of websites we've already visited, otherwise we may end up visiting the same websites over and over again and end up in an infinite loop of the same websites! A quick diagram should help clear up any confusion:

So, as is shown in the diagram, we have an initial website where we will start our crawl, and every URL within that website will be crawled if it hasn't been visited. Now the problem comes down to  3 factors: Choosing a mechanism for keeping track of visited sites, pulling the HTML from a given site, and how to recognize valid URL's within the HTML of a page.

The first factor, choosing how to keep track of visited sites, is easy to accomplish: A Map Data Structure! More specifically, for the best performance, a hash table where the keys to the table are the URL's and the values can simply be a boolean. Although you can use a BST or any other implementation of a Map if you'd like, I believe that a hash table will be the best option due to its benefits over a BST, and we don't necessarily need ordering of our data.

The second factor, reading the HTML from a website, can be easily done in C/C++ using an external library such as libcurl . Java also allows you to do this without any external library, and many other languages also allow you to extract the HTML from a webpage, so this problem can be taken care of on a language-to-language basis.

The third factor, how to recognize valid URL's from within the HTML of a page, can be done with a regular expression match. Most languages now have support for regular expressions, so this shouldn't be an issue. We'll cover the actual regex syntax for matching the URL later.

As mentioned previously, we will be crawling the web with the BFS algorithm, but as anyone whose studied graph algorithms in the past will ask, why not use Depth First Search? The answer is, you can implement the crawler with DFS as the primary searching algorithm, but this will continue to search deeply rather than fully so depending on how you want the crawling to behave, you can implement BFS to crawl the nearest websites first, or implement DFS to crawl as deep as possible

Due to the fact that the implementation of this in C++ will rely on an external library, I will present the code for the algorithm in Java, but I will have the C++ version available for download separately for those who wish to view the C++ version of the algorithm.

Before we get started, this algorithm needs a starting point; a website where it will initiate its crawl. For purposes of demonstration we will use http://example.com ( yes this is a real website). Finally to the coding! We will need the necessary boilerplate code that will setup the Queue needed in BFS.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WebCrawler
{
  public static void main(String[] args)
  {

     LinkedList<String> links = new LinkedList<String>();  
                // this will be the BFS queue
     HashMap<String,Boolean> hasVisited = new HashMap<String,Boolean>();
    // this is the hash table for storing previously visited sites
     String startingString = "http://example.com";
     links.add(startingString); 
     hasVisited.put(startingSite,true); 
         // we've already been to the starting site

 
    }

}



With out boiler plate code ready, all we need to discuss now is the regular expression syntax for matching valid URL's. There are plenty tutorials scattered throughout the interwebs regarding regular expressions, but in essence our syntax will be the following: "http://(\\w+\\.)*(\\w+)" which will match "http://" followed by any sequence of 1 or more alphanumeric characters, followed by a . ( the period which occurs in .com , .net, .org, etc,). This sequence can occur multiple times ( such as in the case of websites of international domains). Then followed by one or more alphanumeric quantities ("com", "org", "net", etc). If this sounded foreign to you, please visit This Regular Expressions Tutorial Website to learn the regular expression syntax.

Now with everything clear, the complete code is below:
import java.io.BufferedReader; // necessary imports
import java.io.InputStreamReader;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WebCrawler
{
  public static void main(String[] args)
  {

     LinkedList<String> links = new LinkedList<String>();  
     HashMap<String,Boolean> hasVisited = new HashMap<String,Boolean>();
     String startingString = "http://example.com";
     links.add(startingString); 
     hasVisited.put(startingString,true); 
     String regex = "http://(\\w+\\.)*(\\w+)"; 
     while(links.size() > 0)
     {

         try
         {
           String nxt = links.poll(); // next website to process
           System.out.println(nxt); // print the site
           URL url = new URL(nxt); // construct a URL object 
           HttpURLConnection http = null;
           http =(HttpURLConnection)url.openConnection();
           http.connect(); // open and connect to the website

           Pattern pat = Pattern.compile(regex); // regex matching
           BufferedReader br = null;
           InputStreamReader isr = null; 
           InputStream is = http.getInputStream();
           isr = new InputStreamReader(is);
           br = new BufferedReader(isr); // sets up
           String data = "";
           while((data = br.readLine()) != null)
           {
              Matcher mat = pat.matcher(data);
              while(mat.find())
              {
                  String w = mat.group();
                  if(hasVisited.get(w) == null)
                  {
                      links.add(w);
                      hasVisited.put(w,true);
                      
                  }
               }
           }
         }catch(Exception e)
         { }
     }

  }
    

}
Note: The reason we have a try/catch block is because some websites may be offline or not valid sites anymore, in which case, the Java Runtime will throw an exception which would otherwise halt program execution. In this case we silently "ignore" the exception by choosing to do nothing.

 The code presented here is pretty straightforward: we have a queue with which BFS feeds off of, a hash table storing the sites we've already been to, and we start the crawl on a sample website. The BFS algorithm will continue to poll websites off the queue and process them (in this case we print them to the console but you can store them to disk, or any other useful task) and read their HTML, match any URL's with a regular expression,then check to make sure we haven't already been to that site, en-queue it and mark as visited.

Although the example code is in Java, this can be ported over to other languages.

Here is some output produced by the code above after roughly 1 second of execution:

http://example.com
http://www.iana.org
http://gsa.icann.org
http://www.icann.org
http://icann.org
http://purl.org
http://xmlns.com
http://ogp.me
http://www.w3.org
http://rdfs.org
http://drupal.org
http://buenosaires48.icann.org
http://www.flickr.com
http://audio.icann.org
If you enjoyed this article, please Subscribe to get updated with my new blog posts as well as newsletters.

Thank you for your time :)

1 comment:

  1. Very informative, indeed. Thanks for taking the time to explain this!

    ReplyDelete