Neil Ang

Bersonal Plog

A stunning likeness of Neil Ang
Super Nerd

How to create a spell checking web spider

Posted on .

Recently I found out my employer was paying a third-party company to regularly check the spelling on their website and send them a monthly report. So I thought I would write a simple spider that could to do the same thing and save them money.

The setup

My solution uses ruby (plus some gems) and the awesome Aspell (an open source spell checker).

First of all you will need to download and install Aspell, and an Aspell dictionary for the language you want to use. In this example I will be using the English dictionary.

To install Aspell on OS X, download and unpack the latest version, and then using terminal:

cd path/to/aspell
./configure
make
sudo make install

To install a dictionary, repeat the steps above with a downloaded dictionary.

You can verify that Aspell installed correctly and that the dictionaries are loaded by typing these commands in terminal:

aspell -v
aspell dicts

Next you will need to install raspell, which is a gem used to interact with Aspell. You will also need another gem called spider, which will take care of the web crawling work for us. And finally, hpricot or nokogiri to handle the HTML.

sudo gem install raspell
sudo gem install spider
sudo gem install hpricot

The script

With that done, you can use the spider gem to crawl all the html pages on a website, then use hpricot to extract the relevant words from the page and check each one with raspell. E.g.

#!/usr/bin/env ruby 

require 'rubygems' 
require 'spider' 
require 'raspell' 
require 'hpricot' 

domain = 'http://www.example.com/' 

speller = Aspell.new('en_GB') 

Spider.start_at(domain) do |s| 

  s.add_url_check do |a_url| 
    a_url.match("^#{domain}") 
  end 

  s.on :success do |a_url, resp, prior_url| 
    unless resp['content-type'].match('text/html') 
      puts "Skipping #{a_url}" 
      next 
    end 
    puts "On page #{a_url}" 
     
    document = Hpricot(resp.body) 
    document.search('head').remove 
    document.search('script').remove 
    document.search('link').remove 
    document.search('meta').remove 
    document.search('style').remove 
    words = document.inner_text.gsub(/\s+/, ' ').strip.split(/\s/) 
     
    speller.list_misspelled(words).each do |mistake| 
      puts " * Found mistake \"#{mistake}\" perhaps you meant \"#{speller.suggest(mistake).first}\"" 
    end 
  end 

end

A few things to note about this script:

As you can see this is only a very basic implementation, but demonstrates how easily you can create your own spelling spider!