Another Ruby Image Scraper

Posted by Ryan Baxter Thu, 08 Jan 2009 00:22:00 GMT

I’ve been pouring over a lot of vintage Willys pictures since starting the restoration of my 58’ CJ-5 and anyone that has worked with me knows that I tend to obsess over detail. The few quality images I’ve found has been driving me crazy and I’m amazed at how much contradicting information I’ve found about a vehicle that is only 50 years old. Given my career in technology, I’m always surprised when a Google search returns little or nothing of value.

My hard drive is steadily filling with what I have found and the old “Right-click, Save Image As…” has become tedious. Late last night I remembered a little image scraping script I wrote back in August of 2007. I’ve since cleaned it up, added a nifty progress bar, and replaced scrAPI with the Hpricot HTML parser. Neat!

I plan on doing some web crawling with it soon. Stay tuned for that. Without further ado:

# RB

require 'rubygems'
require 'fileutils'
require 'hpricot'
require 'open-uri'
require 'progressbar'

attributes = ['href', 'src']
file_extensions = ['jpg', 'jpeg', 'gif', 'png', 'tiff']

def fetch_extension(url)      
  return url.split('.').last
end

def fetch_file(uri)
  progress_bar = nil 
  open(uri, :proxy => nil,
    :content_length_proc => lambda { |length|
      if length && 0 < length
        progress_bar = ProgressBar.new(uri.to_s, length)
      end 
    },
    :progress_proc => lambda { |progress|
      progress_bar.set(progress) if progress_bar
    }) {|file| return file.read}        
end

def save_file(file_uri)  
  open(file_uri.to_s.gsub!(/[\/:]/, '_'), 'wb') { |file| 
    file.write(fetch_file(file_uri)); puts
  }
end

def scrape_urls(html, attributes)      
  Hpricot.buffer_size = 262144
  attributes.each { |attribute|
    Hpricot(html).search("[@#{attribute}]").map { |tag|
      yield tag["#{attribute}"]
    }
  }
end

def to_absolute_uri(original_uri, url)
  url = URI.parse(url.downcase)     
  url = original_uri + url if url.relative?  
  return url.normalize        
end

puts 'Enter a URL:'
original_uri = URI.parse(gets.chomp!)

html = nil

begin
  open(original_uri, :proxy => nil) {|source| html = source.read()}

  scrape_urls(html, attributes) { |url|
    if file_extensions.include?(fetch_extension(url)) then
      save_file(to_absolute_uri(original_uri, url))
    end
  }
rescue => e
  puts e
end

The CJ-5

Posted by Ryan Baxter Mon, 27 Oct 2008 17:12:00 GMT

A couple of weeks ago I became the owner of a 1958 Willy’s CJ-5. I’ve always wanted a Jeep and only seriously started looking for one about a month ago. Being the web-savvy guy that I am - my search started with eBay and craigslist. Not having any luck online, I contacted a friend and fellow Jeepster for advice. Apparently I should have started my search a little closer to home. Sitting in a back lot of Triple A Motors in Williamsport, Pennsylvania was the CJ-5.


I’ve begun busting my knuckles and as a Web Developer, it’s a much welcome diversion. There is something extremely gratifying in wrenching on a vehicle and hearing it’s engine roar to life for the first time. Well, sputter and die in my case, but it did run briefly. I’ve already replaced the distributor cap and rotor, plugs, wires, and fuel pump. Hopefully with some new vacuum lines she’ll be ready for a proper test drive.

Since this will be an ongoing project, I’ll have more pictures as progress is made.