matthew ephraim

A Quick Little Extension to the Spreadsheet Gem

The problem

One of the most tedious tasks I have to complete at work is importing excel spreadsheets into a database. Often, a client will give me a spreadsheet of data that needs to be imported into an existing database. And, if I’m really lucky, the data needs to be reformatted before it can be imported. I tried a few different methods of accomplishing this task before I settled on using the Ruby Spreadsheet gem to import and format the spreadsheet data. I then use the Ruby DBI library to import the data into a database.

Spreadsheet works great for reading data from a spreadsheet file, but one thing that always annoyed me was that I needed to know the index of a column before I could get the value from it. So, for example, if a spreadsheet stored the value for First Name in the 3rd column, I would need to know that the 3rd slot in each row array represented the the first name value. I thought it would be really nice if I could access that value for a row by saying something like row[:first_name]. I had a little bit of extra time over the holidays, so I decided to see if it would be possible to make this functionality happen.

My solution

The first thing I did was create my own row class for the Spreadsheet library. I decided that I would create a new class called HashRow that would allow you to access the values of each row by using symbols that represented each column header or by using the original method of accessing row values with an index number. For simplicity, I assumed that the first row of the spreadsheet was the header row. I also simplified values for each column header by stripping out any characters that couldn’t be represented by a symbol. So, a header value of First Name becomes :first_name.

I also added a few convenience methods to HashRow. The header? method returns true if the row is a header row and the empty? method returns true if the row is completely empty.

Ruby
# Wraps Spreadsheet::Excel::Row row array with extra functionality
class Spreadsheet::HashRow < Spreadsheet::Excel::Row
	attr_reader :index
	
	# Keeps the original row value array 
	# and also creates a hash of values 
	def initialize(row, col_hash, index)
		@val_array = row
		@val_hash  = get_val_hash(col_hash)
		@index 	   = index
	end
	
	# Is this row the first row in the spreadsheet?
	def header?
		@index === 0
	end
	
	# Checks if every cell in the row is set to nil
	def empty?
		@val_array.compact.length === 0
	end
			
	# Returns the value in the row based on the index 
	# or key passed in. Integer values returns the row value 
	# by index in the array and symbols return the value 
	# for the symbol or string
	def [](value)
		if value.is_a? Integer
			@val_array[value]
		else
			@val_hash[value.to_s.downcase]
		end
	end
	
	private 
	
	# Uses a hash columns to build another hash for the 
	# values in the array with keys for the column heads
	def get_val_hash(col_hash)
		col_hash.keys.inject({}) do |acc, key|
			acc.merge(key => @val_array[col_hash[key]])
		end
	end
end

Once I had my HashRow class, I needed to open the Spreadsheet::Excel::Worksheet class so that I could replace the row method with a new method that returned an instance of the HashRow class. I aliased the old row method and used it inside my new row method to pass in a reference to the original row array. I also added a private method to help determine the index of each of column and a private method to format the column names.

Ruby
# Extends Spreadsheet::Excel::Worksheet so that the Rows become HashRows
class Spreadsheet::Excel::Worksheet		
	# Override the original row method with a new method 
	# that returns the custom HashRow class instead of an array
	alias_method :old_row, :row
	def row(value)
	    Spreadsheet::HashRow.new(old_row(value), get_col_indexes, value)
	end
		
	private
		
	# Returns a hash that contains key/value pairs for the column 
	# headers and the the index of each header
	def get_col_indexes
	    @col_indexes ||= old_row(0).inject({}) do |hash, cell|
	        hash.merge(get_col_key(cell.to_s) => hash.length)
	    end
	end
			
	# Converts the name of a column header to a 
        # specially formatted string
	def get_col_key(col)
	    col.gsub(/[\(\)]+/, "").
	         gsub(/\s/, "_").
	         downcase
	end
end

Once I included my new spreadsheet extension file I could use the Spreadsheet library similar to the way I had used it before, only now I had access to the row values by the row and index and by the column header name.

Ruby
Spreadsheet.open(FILE).worksheet(0).each do |row|
	unless row.empty? || row.header?
		puts row[:first_name]
		puts row[:last_name]
	end
end