Another Ruby Image Scraper
Posted by Ryan Baxter Thu, 08 Jan 2009 00:22:00 GMT
I’ve been pouring over a lot of vintage Willys pictures since starting the restoration of my 58’ CJ-5 and anyone that has worked with me knows that I tend to obsess over detail. The few quality images I’ve found has been driving me crazy and I’m amazed at how much contradicting information I’ve found about a vehicle that is only 50 years old. Given my career in technology, I’m always surprised when a Google search returns little or nothing of value.
My hard drive is steadily filling with what I have found and the old “Right-click, Save Image As…” has become tedious. Late last night I remembered a little image scraping script I wrote back in August of 2007. I’ve since cleaned it up, added a nifty progress bar, and replaced scrAPI with the Hpricot HTML parser. Neat!
I plan on doing some web crawling with it soon. Stay tuned for that. Without further ado:
# RB
require 'rubygems'
require 'fileutils'
require 'hpricot'
require 'open-uri'
require 'progressbar'
attributes = ['href', 'src']
file_extensions = ['jpg', 'jpeg', 'gif', 'png', 'tiff']
def fetch_extension(url)
return url.split('.').last
end
def fetch_file(uri)
progress_bar = nil
open(uri, :proxy => nil,
:content_length_proc => lambda { |length|
if length && 0 < length
progress_bar = ProgressBar.new(uri.to_s, length)
end
},
:progress_proc => lambda { |progress|
progress_bar.set(progress) if progress_bar
}) {|file| return file.read}
end
def save_file(file_uri)
open(file_uri.to_s.gsub!(/[\/:]/, '_'), 'wb') { |file|
file.write(fetch_file(file_uri)); puts
}
end
def scrape_urls(html, attributes)
Hpricot.buffer_size = 262144
attributes.each { |attribute|
Hpricot(html).search("[@#{attribute}]").map { |tag|
yield tag["#{attribute}"]
}
}
end
def to_absolute_uri(original_uri, url)
url = URI.parse(url.downcase)
url = original_uri + url if url.relative?
return url.normalize
end
puts 'Enter a URL:'
original_uri = URI.parse(gets.chomp!)
html = nil
begin
open(original_uri, :proxy => nil) {|source| html = source.read()}
scrape_urls(html, attributes) { |url|
if file_extensions.include?(fetch_extension(url)) then
save_file(to_absolute_uri(original_uri, url))
end
}
rescue => e
puts e
end- Posted in Code Snippets
- Meta 1 comment, permalink, rss, atom
Fork Ruby?
Posted by Ryan Baxter Mon, 29 Dec 2008 15:49:00 GMT
Dave Thomas speaking at the 2008 RubyConf.
- Meta no comments, permalink, rss, atom
Ruby Fractal Library and RMagick?
Posted by Ryan Baxter Tue, 11 Nov 2008 20:06:00 GMT
Sure, why not! Thanks to Brian Jones, the Ruby Fractal Library now renders fractals with RMagick. Images can be saved in formats including, PNG, GIF, JPEG, and many others. I was able to create an animated GIF displaying the magnification of the Feigenbaum point in less than 15 lines of code.
feigenbaum = Mandelbrot.new(Complex(-0.1528, 1.0397), 2, 100)
feigenbaum.width = 150
feigenbaum.height = 150
feigenbaum.m = 25
feigenbaum.algorithm = Algorithms::NormalizedIterationCount
feigenbaum.theme = Themes::Water
image_list = Magick::ImageList.new
image_list.delay = 100
image_list.iterations = 1
(0...50).each { |i|
feigenbaum.m += (i**2)
image_list << Magick::Image.from_blob(feigenbaum.draw('gif'))[0]
}
image_list.write('feigenbaum.gif')
The recent changes have caused a couple of breaking changes. Now that RMagick is used, ImageMagick must be installed as a dependency. I was against this at first, but RMagick’s flexibility ultimately won me over (I couldn’t resist the lure of animated fractals). The Fractal.draw method now returns an image as a BLOB, but a save_as method has been provided to handle the writing of images to disk. The curious can find that latest source code in the repo’s trunk. Stay tuned for more updates.
svn checkout -r 10 http://svn.ryanbaxter.net/fractals/trunk fractalsAs always, I’ll accept any comments, suggestions, or source code. Thanks again to Brian for the RMagick submission.
- Posted in Code Snippets
- Meta no trackbacks, no comments, permalink, rss, atom

