Up until 2016, Excel for the Mac provided a handy CSV export format called “Windows CSV”, which used iso-8859-1/Windows-1252 encoding. It was reliable, handled simple extended characters like degree signs, and could be read in Ruby with code like:
CSV.foreach(file, encoding:'iso-8859-1', headers:true, skip_blanks:true) do |row|
....
end
Unfortunately, support for this format was dropped in Excel 365. Many RiskAssess data files were in this format, as the earlier versions of Excel did not properly export to valid UTF-8.
Excel 365 now has a Save As option of “CSV UTF-8 (Comma-delimited)”. This is UTF-8.. with a few twists! Sometimes it includes a UTF-8 format first character, and sometimes not. Sometimes it includes the UTF-8 format first character plus a BOM (byte order mark), another invisible character. According to Wikipedia “The Unicode Standard permits the BOM in UTF-8 but does not require or recommend its use”. This makes it trickier to import. Code like this:
CSV.foreach(file, encoding:'UTF-8', headers:true, skip_blanks:true) do |row|
....
end
will handle the first 2 possibilities, but not the BOM. The BOM ends up as an invisible character in the data, causing trouble. It is possible to import with encoding ‘bom|utf-8’ which will handle this case. Another option is to run data through the dos2unix command (easily installed with brew) which does general tidying including removing the unnecessary BOM.
Also to watch out for, “Windows CSV” format previously used “\r” to denote new lines inside of cells. The new UTF-8 export uses “\n” for new lines inside of cells.
Leave a Reply