mirror of
https://github.com/thewesker/twitter_ebooks.git
synced 2025-12-21 12:51:15 -05:00
MODEL: Read in utf-8, only parse CSV once
Ran into `Encoding::CompatibilityError` issue trying to consume my corpus (tweets.csv) on Windows 7, but this likely affects other environments as well. Fix: force reading corpus file contents as utf-8. Also a quick clean-up of the CSV flow to only parse the content once instead of double-dipping.
This commit is contained in:
@@ -19,7 +19,7 @@ module Ebooks
|
|||||||
end
|
end
|
||||||
|
|
||||||
def consume(path)
|
def consume(path)
|
||||||
content = File.read(path)
|
content = File.read(path, :encoding => 'utf-8')
|
||||||
@hash = Digest::MD5.hexdigest(content)
|
@hash = Digest::MD5.hexdigest(content)
|
||||||
|
|
||||||
if path.split('.')[-1] == "json"
|
if path.split('.')[-1] == "json"
|
||||||
@@ -29,9 +29,10 @@ module Ebooks
|
|||||||
end
|
end
|
||||||
elsif path.split('.')[-1] == "csv"
|
elsif path.split('.')[-1] == "csv"
|
||||||
log "Reading CSV corpus from #{path}"
|
log "Reading CSV corpus from #{path}"
|
||||||
header = CSV.read(path).first
|
content = CSV.parse(content)
|
||||||
|
header = content.shift
|
||||||
text_col = header.index('text')
|
text_col = header.index('text')
|
||||||
lines = CSV.read(path).drop(1).map do |tweet|
|
lines = content.map do |tweet|
|
||||||
tweet[text_col]
|
tweet[text_col]
|
||||||
end
|
end
|
||||||
else
|
else
|
||||||
|
|||||||
Reference in New Issue
Block a user