Another Ruby Image Scraper

Posted by Ryan Baxter Thu, 08 Jan 2009 00:22:00 GMT

I’ve been pouring over a lot of vintage Willys pictures since starting the restoration of my 58’ CJ-5 and anyone that has worked with me knows that I tend to obsess over detail. The few quality images I’ve found has been driving me crazy and I’m amazed at how much contradicting information I’ve found about a vehicle that is only 50 years old. Given my career in technology, I’m always surprised when a Google search returns little or nothing of value.

My hard drive is steadily filling with what I have found and the old “Right-click, Save Image As…” has become tedious. Late last night I remembered a little image scraping script I wrote back in August of 2007. I’ve since cleaned it up, added a nifty progress bar, and replaced scrAPI with the Hpricot HTML parser. Neat!

I plan on doing some web crawling with it soon. Stay tuned for that. Without further ado:

# RB

require 'rubygems'
require 'fileutils'
require 'hpricot'
require 'open-uri'
require 'progressbar'

attributes = ['href', 'src']
file_extensions = ['jpg', 'jpeg', 'gif', 'png', 'tiff']

def fetch_extension(url)      
  return url.split('.').last
end

def fetch_file(uri)
  progress_bar = nil 
  open(uri, :proxy => nil,
    :content_length_proc => lambda { |length|
      if length && 0 < length
        progress_bar = ProgressBar.new(uri.to_s, length)
      end 
    },
    :progress_proc => lambda { |progress|
      progress_bar.set(progress) if progress_bar
    }) {|file| return file.read}        
end

def save_file(file_uri)  
  open(file_uri.to_s.gsub!(/[\/:]/, '_'), 'wb') { |file| 
    file.write(fetch_file(file_uri)); puts
  }
end

def scrape_urls(html, attributes)      
  Hpricot.buffer_size = 262144
  attributes.each { |attribute|
    Hpricot(html).search("[@#{attribute}]").map { |tag|
      yield tag["#{attribute}"]
    }
  }
end

def to_absolute_uri(original_uri, url)
  url = URI.parse(url.downcase)     
  url = original_uri + url if url.relative?  
  return url.normalize        
end

puts 'Enter a URL:'
original_uri = URI.parse(gets.chomp!)

html = nil

begin
  open(original_uri, :proxy => nil) {|source| html = source.read()}

  scrape_urls(html, attributes) { |url|
    if file_extensions.include?(fetch_extension(url)) then
      save_file(to_absolute_uri(original_uri, url))
    end
  }
rescue => e
  puts e
end

Older posts: 1 2