Happy Thursday!
all posts

UTF-8 multibyte characters in Mac OS X filenames

Published on Sep 26, 2014

Sven Pachnit GitHub Twitter StackOverflow

I did a little CLI tool which listed files in a directory plus a few extra information in an ASCII table. To calculate it I need the longest filename and I had issues when UTF-8 multibyte characters were in them: The special characters were counted as two characters.

The issue here is that OS X use a slightly different UTF-8 than you would think. Look at this:

[1] pry(main)> str = File.basename(Dir["Desktop/*"][2])
=> "möp"

[2] pry(main)> str.length
=> 4

[3] pry(main)> "möp".length
=> 3

[4] pry(main)> str.encoding
=> #<Encoding:UTF-8>

[5] pry(main)> "möp".encoding
=> #<Encoding:UTF-8>

[6] pry(main)> str == "möp"
=> false

At this point I was confused. It looked the same, it had the same encoding still it's not the same (and longer). So what's the trick here? Encode the path to UTF-8-MAC and everything is fine:

[7] pry(main)> str.encode('UTF-8', 'UTF-8-MAC').length
=> 3

[8] pry(main)> str.encode('UTF-8', 'UTF-8-MAC')
=> "möp"

[9] pry(main)> str.encode('UTF-8', 'UTF-8-MAC') == "möp"
=> true