Neil Ang

Developer

A stunning likeness of Neil Ang
Hello world

How to create a spell checking web spider

Posted on

Recently I thought about how I could make a web spider to crawl a site and report the spelling mistakes. Here is what I came up with.

The setup

My solution uses ruby (plus some gems) and the awesome Aspell (an open source spell checker).

First of all you will need to download and install Aspell, and an Aspell dictionary for the language you want to use. In this example I will be using the English dictionary.

To install Aspell on OS X, download and unpack the latest version, and then using terminal:

cd path/to/aspell
./configure
make
sudo make install

To install a dictionary, repeat the steps above with a downloaded dictionary.

You can verify that Aspell installed correctly and that the dictionaries are loaded by typing these commands in terminal:

aspell -v
aspell dicts

Next you will need to install raspell, which is a gem used to interact with Aspell. You will also need another gem called spider, which will take care of the web crawling work for us. And finally, hpricot or nokogiri to handle the HTML.

sudo gem install raspell
sudo gem install spider
sudo gem install hpricot

The script

With that done, you can use the spider gem to crawl all the html pages on a website, then use hpricot to extract the relevant words from the page and check each one with raspell. E.g.

#!/usr/bin/env ruby

require 'rubygems'
require 'spider'
require 'raspell'
require 'hpricot'

domain = 'http://www.example.com/'

speller = Aspell.new('en_GB')

Spider.start_at(domain) do |s|

  s.add_url_check do |a_url|
    a_url.match("^#{domain}")
  end

  s.on :success do |a_url, resp, prior_url|
    unless resp['content-type'].match('text/html')
      puts "Skipping #{a_url}"
      next
    end
    puts "On page #{a_url}"

    document = Hpricot(resp.body)
    document.search('head').remove
    document.search('script').remove
    document.search('link').remove
    document.search('meta').remove
    document.search('style').remove
    words = document.inner_text.gsub(/\s+/, ' ').strip.split(/\s/)

    speller.list_misspelled(words).each do |mistake|
      puts " * Found mistake \"#{mistake}\" perhaps you meant \"#{speller.suggest(mistake).first}\""
    end
  end

end

A few things to note about this script:

  • I have set the spider to use the "en_GB" dictionary (as this is what we use in Australia), but you can set it to any dictionary you installed (e.g. "en_US").
  • If you wanted to use nokogiri instead of hpricot, simply replace the Hpricot declaration with document = Nokogiri(resp.body) and require it at the start of the script.
  • Hpricot is used to strip out typically non-visible sections of the page, so that we only spellcheck the displayed words.
  • The spider won't search outside of the set domain.
  • The spider is also capable of performing a link check as it crawls.

As you can see this is only a very basic implementation, but demonstrates how easily you can create your own spelling spider!