MODEL: Read in utf-8, only parse CSV once

Ran into `Encoding::CompatibilityError` issue trying to consume my corpus (tweets.csv) on Windows 7, but this likely affects other environments as well. Fix: force reading corpus file contents as utf-8. Also a quick clean-up of the CSV flow to only parse the content once instead of double-dipping.
2026-02-05 12:05:12 -05:00 · 2014-06-27 18:42:51 -04:00
parent 4b88d3326b
commit be6ac9127f
1 changed files with 4 additions and 3 deletions
--- a/lib/twitter_ebooks/model.rb
+++ b/lib/twitter_ebooks/model.rb
@@ -19,7 +19,7 @@ module Ebooks
    end

    def consume(path)
-      content = File.read(path)
+      content = File.read(path, :encoding => 'utf-8')
      @hash = Digest::MD5.hexdigest(content)

      if path.split('.')[-1] == "json"
@@ -29,9 +29,10 @@ module Ebooks
        end
      elsif path.split('.')[-1] == "csv"
        log "Reading CSV corpus from #{path}"
-        header = CSV.read(path).first
+        content = CSV.parse(content)
+        header = content.shift
        text_col = header.index('text')
-        lines = CSV.read(path).drop(1).map do |tweet|
+        lines = content.map do |tweet|
          tweet[text_col]
        end
      else