mirror of
https://github.com/thewesker/twitter_ebooks.git
synced 2025-12-21 12:51:15 -05:00
MODEL: Read in utf-8, only parse CSV once
Ran into `Encoding::CompatibilityError` issue trying to consume my corpus (tweets.csv) on Windows 7, but this likely affects other environments as well. Fix: force reading corpus file contents as utf-8. Also a quick clean-up of the CSV flow to only parse the content once instead of double-dipping.
This commit is contained in:
@@ -19,7 +19,7 @@ module Ebooks
|
||||
end
|
||||
|
||||
def consume(path)
|
||||
content = File.read(path)
|
||||
content = File.read(path, :encoding => 'utf-8')
|
||||
@hash = Digest::MD5.hexdigest(content)
|
||||
|
||||
if path.split('.')[-1] == "json"
|
||||
@@ -29,9 +29,10 @@ module Ebooks
|
||||
end
|
||||
elsif path.split('.')[-1] == "csv"
|
||||
log "Reading CSV corpus from #{path}"
|
||||
header = CSV.read(path).first
|
||||
content = CSV.parse(content)
|
||||
header = content.shift
|
||||
text_col = header.index('text')
|
||||
lines = CSV.read(path).drop(1).map do |tweet|
|
||||
lines = content.map do |tweet|
|
||||
tweet[text_col]
|
||||
end
|
||||
else
|
||||
|
||||
Reference in New Issue
Block a user