This is the personal blog of Neil Ang. Simple and interesting technology articles written by a developer for developers. Feel free to comment on posts or link to this site. Constructive feedback is always welcomed.

How to create a spell checking web spider

Posted: 24 October 2009

Recently I found out my employer was paying a third-party company to regularly check the spelling on their website and send them a monthly report. So I thought I would write a simple spider that could to do the same thing and save them money.

The setup

My solution uses ruby (plus some gems) and the awesome Aspell (an open source spell checker).

First of all you will need to download and install Aspell, and an Aspell dictionary for the language you want to use. In this example I will be using the English dictionary.

To install Aspell on OS X, download and unpack the latest version, and then using terminal:

cd path/to/aspell
./configure
make
sudo make install

To install a dictionary, repeat the steps above with a downloaded dictionary.

You can verify that Aspell installed correctly and that the dictionaries are loaded by typing these commands in terminal:

aspell -v
aspell dicts

Next you will need to install raspell, which is a gem used to interact with Aspell. You will also need another gem called spider, which will take care of the web crawling work for us. And finally, hpricot or nokogiri to handle the HTML.

sudo gem install raspell
sudo gem install spider
sudo gem install hpricot

The script

With that done, you can use the spider gem to crawl all the html pages on a website, then use hpricot to extract the relevant words from the page and check each one with raspell. E.g.

#!/usr/bin/env ruby

require 'rubygems'
require 'spider'
require 'raspell'
require 'hpricot'

domain = 'http://www.example.com/'

speller = Aspell.new('en_GB')

Spider.start_at(domain) do |s|

  s.add_url_check do |a_url|
    a_url.match("^#{domain}")
  end

  s.on :success do |a_url, resp, prior_url|
    unless resp['content-type'].match('text/html')
      puts "Skipping #{a_url}"
      next
    end
    puts "On page #{a_url}"
    
    document = Hpricot(resp.body)
    document.search('head').remove
    document.search('script').remove
    document.search('link').remove
    document.search('meta').remove
    document.search('style').remove
    words = document.inner_text.gsub(/\s+/, ' ').strip.split(/\s/)
    
    speller.list_misspelled(words).each do |mistake|
      puts " * Found mistake \"#{mistake}\" perhaps you meant \"#{speller.suggest(mistake).first}\""
    end
  end

end

A few things to note about this script:

As you can see this is only a very basic implementation, but demonstrates how easily you can create your own spelling spider!

Your comments (subscribe)

Gravatar

paige Moore 24 Oct 09 at 8:47pm

What you did really so amazing and I admire you for having concern with you employer. Thanks for this I will try to download it also.

Gravatar

Brad @ BitBot Software 4 May 10 at 7:36pm

That really is handy. I hadn't thought of building one to be honest, but it would be obviously useful, especially for my clients. Geez Ruby makes it easy :-)

Post a comment

Comment Guidelines

  • You can subscribe to the comments on this entry via RSS.
  • Have no more than 2 links, otherwise your comment will be flagged as spam.
  • Links are automagically generated.
  • <em>text</em> to make text italic.
  • <strong>text</strong> to make text bold.

JavaScript needs to be enabled to comment.