Code Snippet: Ruby Image Scraper

Posted by Ryan Baxter Tue, 14 Aug 2007 03:46:00 GMT

I stumbled upon a screen scraping library for Ruby last week called scrAPI. It’s extremely flexible and can be seen in action on the co.mments blog post scraper. The scrAPI library can be installed by issuing the following command from your console:

gem install scrapi

Testing scrAPI was fairly easy once I figured out how to define a scraper. With that aside, I wrote a small script that saves images from a URL provided by the user. The scrAPI library could be used for good or evil, but only you can decide.

#!/usr/bin/ruby

require 'fileutils'
require 'open-uri'
require 'pathname'
require 'rubygems'
require 'scrapi'

# Get the URL input.
puts 'Enter a URL:'
url = gets.chomp

# Get the HTML source.
html = nil
open(url) {|source| html = source.read()}

# Define the scraper.
scraper = Scraper.define do
  array :images
  process "img", :images => "@src"
  result :images
end

# Scrape the HTML for images.
images = scraper.scrape(html)

# Create a directory to save the images in.
directory = url.gsub(/http:\/\//, '')
FileUtils.mkdir directory

images.each do |image_path|
  # Determine if image_path is absolute or relative. 
  path = Pathname.new(image_path)  
  if not path.relative? then image_path = url + image_path end

  # Write the image to disk.
  open(image_path) do |source|
    file_name = image_path.split('/').last
    open(directory + '/' + file_name, 'wb') {|file| file.write(source.read())}
  end
end

puts 'Finished...'
Comments

Leave a response

Avatar
Bill Comment_bubble 7 months later:
Wow. This is friggin awesome. Worked almost entirely on my first try. Many thanks!!!
Avatar
Ryan Baxter Comment_bubble 7 months later:
That's great! I've gotta get back to writing more Ruby soon.
Avatar
http://orlandodefrias.com Comment_bubble about 1 year later:
Great script, however if you try scraping from a directory you'll get an error on mkdir. I replaced: directory = url.gsub(/http:\/\//, '') With: directory = url.gsub(/http:\/\//, '').gsub('/', '_') That's one way of fixing it.
Avatar
Ryan Baxter Comment_bubble about 1 year later:
Thanks for the tip Orlando! I've since written another scraper using the Hpricot HTML parser. It's a bit faster and has a nifty progress bar.