AskBob | Hacks from the past | Building your own search engine
Back in 2006, Sunil and i were just 2 days away from our code submission deadline at college. We had no code to submit. so, what do you do? hack of course! We decided to quick-hack a web search engine; you know, something that "looks like G", something that works (just), but definitely not industrial :) we had already zeroed in on lucene for an inverted index. so, the larger chunk of code was the crawler itself, which fed stuff to lucene from all over the web. If you like building crawlers that would like to stretch its legs, mercator is a good place to start reading. Anyhoo, this is how AskBob ended up looking after 2 nights of messing around in C#. i dusted off my old source tree today and let AskBob loose for about 20 minutes.
For a start, there is a seed list of links using which pages are fetched; a parser gets meaningful text and links within those pages and you know the drill: lather|rinse|repeat. view the web as one infinitely large graph and traverse it (forever?). you'd have to be careful from where you start, as you can heavily bias your search results. GDirectory is a good place to start though. Once you start building the crawler, you'll run into all sorts of problems. We had to take care of broken HTML (we did, sort of), filter out broken and invalid links, make sure not to run into an already visited page. We actually had multiple lucene indexes to do some of this stuff. And then there is the whole "do no evil" policy where you have to respect the robots exclusion protocol, not hit the same server hard and many more little things like these.
Some numbers:
When we talk about search engines, we usually talk really big numbers like billions of pages, millions of hits and a kazillion other things. But for two clueless guys with commodity hardware and a 256kbps "broadband" connection, we'll see much smaller numbers here. And, this is not the right way of doing any kind of benchmarking for sure :D. By limiting our threadpool to around 6 threads on a (now) 6 year old laptop, we were able to roughly index 70~100 pages per minute. As this is largely an IO bound operation, getting the threads to crawl at an optimum rate was quite a pain; and hence the limit of 6 threads. What was amazing was the number of URLs that were being parsed and queued for further crawling. In around 20 minutes of crawling, you see almost 100000 URLs extracted from the pages visited earlier. And, all of this text cannot be stored in memory. This is where your on-disk and in-memory data structures matter and can significantly affect how your crawler can perform. The sheer volume of data generated is a joy to watch though.
What else should have gone into AskBob, but didn't?
Here are a few things that immediately come to my mind. But there are many more things you'll need if you want to host your own search engine out for public consumption.
- Ranking - lucene does not implement a "page ranking" mechanism. i think it implements ranking largely based on the word frequency within the document rather than measuring things like the number of incoming links to a page, etc. looking back, implementing a basic backlink count based mechanism is fairly easy to implement. but we never got about doing it.
- Sitemaps - Sitemaps.org offers a fabulous solution for dealing with issues like the frequency of visiting a page for updates and the importance of certain pages over the others within a site.
- Document similarity - This is a tricky problem. There is no easy way of telling if two different URLs point to the same page, or if they have similar content. Sunil tells me that Shingling is a good way to measure some of this stuff.
Why call it "AskBob"
Back then, in our circle of friends, tHacker started using the word "Bob" instead of "Dude". It caught on pretty well i must say. So, "AskBob" seemed cool enough a name for a search engine :)
This was a post that i wanted to write for quite sometime now, but never got around doing it, until i stumbled upon this.
Disclaimer: This post is for entertainment purposes only and represents only my views and not that of my employer or anyone else.
Subscribe



