How to create a spell checking web spider
Posted: 24 October 2009
Recently I found out my employer was paying a third-party company to regularly check the spelling on their website and send them a monthly report. So I thought I would write a simple spider that could to do the same thing and save them money.
The setup
My solution uses ruby (plus some gems) and the awesome Aspell (an open source spell checker).
First of all you will need to download and install Aspell, and an Aspell dictionary for the language you want to use. In this example I will be using the English dictionary.
To install Aspell on OS X, download and unpack the latest version, and then using terminal:
cd path/to/aspell
./configure
make
sudo make install
To install a dictionary, repeat the steps above with a downloaded dictionary.
You can verify that Aspell installed correctly and that the dictionaries are loaded by typing these commands in terminal:
aspell -v
aspell dicts
Next you will need to install raspell, which is a gem used to interact with Aspell. You will also need another gem called spider, which will take care of the web crawling work for us. And finally, hpricot or nokogiri to handle the HTML.
sudo gem install raspell
sudo gem install spider
sudo gem install hpricot
The script
With that done, you can use the spider gem to crawl all the html pages on a website, then use hpricot to extract the relevant words from the page and check each one with raspell. E.g.
#!/usr/bin/env ruby
require 'rubygems'
require 'spider'
require 'raspell'
require 'hpricot'
domain = 'http://www.example.com/'
speller = Aspell.new('en_GB')
Spider.start_at(domain) do |s|
s.add_url_check do |a_url|
a_url.match("^#{domain}")
end
s.on :success do |a_url, resp, prior_url|
unless resp['content-type'].match('text/html')
puts "Skipping #{a_url}"
next
end
puts "On page #{a_url}"
document = Hpricot(resp.body)
document.search('head').remove
document.search('script').remove
document.search('link').remove
document.search('meta').remove
document.search('style').remove
words = document.inner_text.gsub(/\s+/, ' ').strip.split(/\s/)
speller.list_misspelled(words).each do |mistake|
puts " * Found mistake \"#{mistake}\" perhaps you meant \"#{speller.suggest(mistake).first}\""
end
end
end
A few things to note about this script:
- I have set the spider to use the "en_GB" dictionary (as this is what we use in Australia), but you can set it to any dictionary you installed (e.g. "en_US").
- If you wanted to use nokogiri instead of hpricot, simply replace the Hpricot declaration with document = Nokogiri(resp.body) and require it at the start of the script.
- Hpricot is used to strip out typically non-visible sections of the page, so that we only spellcheck the displayed words.
- The spider won't search outside of the set domain.
- The spider is also capable of performing a link check as it crawls.
As you can see this is only a very basic implementation, but demonstrates how easily you can create your own spelling spider!
Post a comment
Comment Guidelines
- You can subscribe to the comments on this entry via RSS.
- Have no more than 2 links, otherwise your comment will be flagged as spam.
- Links are automagically generated.
- <em>text</em> to make text italic.
- <strong>text</strong> to make text bold.
JavaScript needs to be enabled to comment.
Your comments (subscribe)
paige Moore 24 Oct 09 at 8:47pm
What you did really so amazing and I admire you for having concern with you employer. Thanks for this I will try to download it also.