<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>crunchlife: Code Snippet: Ruby Image Scraper</title>
    <link>http://crunchlife.com/articles/2007/08/13/code-snippet-ruby-image-scraper</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description></description>
    <item>
      <title>Code Snippet: Ruby Image Scraper</title>
      <description>&lt;p&gt;I stumbled upon a &lt;a href="http://en.wikipedia.org/wiki/Screen_scraping" target="_blank"&gt;screen scraping&lt;/a&gt; library for Ruby last week called &lt;a href="http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/" target="_blank"&gt;scrAPI&lt;/a&gt;.  It&amp;#8217;s extremely flexible and can be seen in action on the &lt;a href="http://co.mments.com/" target="_blank"&gt;co.mments&lt;/a&gt; blog post scraper.  The scrAPI library can be installed by issuing the following command from your console:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_default "&gt;gem install scrapi&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Testing scrAPI was fairly easy once I figured out how to define a scraper.  With that aside, I wrote a small script that saves images from a URL provided by the user.  The scrAPI library could be used for good or evil, but only you can decide.   &lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="comment"&gt;#!/usr/bin/ruby&lt;/span&gt;

&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;fileutils&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;open-uri&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;pathname&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;scrapi&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="comment"&gt;# Get the URL input.&lt;/span&gt;
&lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;Enter a URL:&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;url&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;gets&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;chomp&lt;/span&gt;

&lt;span class="comment"&gt;# Get the HTML source.&lt;/span&gt;
&lt;span class="ident"&gt;html&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;
&lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;source&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt; &lt;span class="ident"&gt;html&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;source&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;read&lt;/span&gt;&lt;span class="punct"&gt;()}&lt;/span&gt;

&lt;span class="comment"&gt;# Define the scraper.&lt;/span&gt;
&lt;span class="ident"&gt;scraper&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Scraper&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;define&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt;
  &lt;span class="ident"&gt;array&lt;/span&gt; &lt;span class="symbol"&gt;:images&lt;/span&gt;
  &lt;span class="ident"&gt;process&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;img&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;,&lt;/span&gt; &lt;span class="symbol"&gt;:images&lt;/span&gt; &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;@src&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="ident"&gt;result&lt;/span&gt; &lt;span class="symbol"&gt;:images&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="comment"&gt;# Scrape the HTML for images.&lt;/span&gt;
&lt;span class="ident"&gt;images&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;scraper&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;scrape&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;html&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

&lt;span class="comment"&gt;# Create a directory to save the images in.&lt;/span&gt;
&lt;span class="ident"&gt;directory&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;http:&lt;span class="escape"&gt;\/\/&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;/,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt;
&lt;span class="constant"&gt;FileUtils&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;mkdir&lt;/span&gt; &lt;span class="ident"&gt;directory&lt;/span&gt;

&lt;span class="ident"&gt;images&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;image_path&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
  &lt;span class="comment"&gt;# Determine if image_path is absolute or relative. &lt;/span&gt;
  &lt;span class="ident"&gt;path&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;Pathname&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;image_path&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;  
  &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="keyword"&gt;not&lt;/span&gt; &lt;span class="ident"&gt;path&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;relative?&lt;/span&gt; &lt;span class="keyword"&gt;then&lt;/span&gt; &lt;span class="ident"&gt;image_path&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;url&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="ident"&gt;image_path&lt;/span&gt; &lt;span class="keyword"&gt;end&lt;/span&gt;

  &lt;span class="comment"&gt;# Write the image to disk.&lt;/span&gt;
  &lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;image_path&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;source&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
    &lt;span class="ident"&gt;file_name&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;image_path&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;split&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;/&lt;/span&gt;&lt;span class="punct"&gt;').&lt;/span&gt;&lt;span class="ident"&gt;last&lt;/span&gt;
    &lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;directory&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;/&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="ident"&gt;file_name&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;wb&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt; &lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt; &lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;write&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;source&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;read&lt;/span&gt;&lt;span class="punct"&gt;())}&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;Finished...&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
      <pubDate>Mon, 13 Aug 2007 20:46:00 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:f2568aba-5b0b-4ca4-b4c7-f1f47c11d88d</guid>
      <author>Ryan Baxter</author>
      <link>http://crunchlife.com/articles/2007/08/13/code-snippet-ruby-image-scraper</link>
      <category>Code Snippets</category>
      <category>Ruby</category>
    </item>
    <item>
      <title>"Code Snippet: Ruby Image Scraper" by Ryan Baxter</title>
      <description>That's great! I've gotta get back to writing more Ruby soon.</description>
      <pubDate>Tue, 25 Mar 2008 20:08:54 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:96814c69-ce13-43f2-8b5c-900a5636dd9e</guid>
      <link>http://crunchlife.com/articles/2007/08/13/code-snippet-ruby-image-scraper#comment-3951</link>
    </item>
    <item>
      <title>"Code Snippet: Ruby Image Scraper" by Bill</title>
      <description>Wow.  This is friggin awesome.  Worked almost entirely on my first try.   

Many thanks!!!</description>
      <pubDate>Tue, 25 Mar 2008 19:49:46 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:d51bb7fa-5503-4fde-8633-e3a414adf851</guid>
      <link>http://crunchlife.com/articles/2007/08/13/code-snippet-ruby-image-scraper#comment-3949</link>
    </item>
  </channel>
</rss>
