The situation with diacritics (or other non-ascii characters) in domain names has been fixed by IDN, there’s a conventional encoding called puny-code, so say
xn--schn-7qa.de – but this is transparent as browsers and even Sparkle convert back and forth from the two, you never know.
Unfortunately no such convention is available for the rest of the URL. Browsers can submit that part in any encoding, and the server needs to deal with it. That’s why you see amazon searches include “ÅMÅŽÕÑ” as part of the search URL, they most likely use that to detect what the browser thought the encoding was supposed to be and correct for it. But Sparkle sites are plain, “static” sites, with files in the filesystem, so no auto-detection is possible.
ü can be represented in its iso-8859-1 (also known as “latin 1”) encoding, code point
0xFC (the hexadecimal for 252). Sparkle codes pages as UTF-8, but the URL can end up anywhere (an email client, a chat, a different web page not coded as UTF-8, etc), and if the local operating system doesn’t know better it can put the
0xFC byte straight in the URL.
Another representation for
ü is its Unicode equivalent, which is defined as:
U+00FC Latin Small Letter U with Diaeresis / U with trema / U with umlaut
while it still looks like the same
0xFC number, the common encoding for Unicode on the web is UTF-8, which turns any character above
0x80 into two or more bytes, so
0xFC is represented as
0xC3 followed by
But it’s not over, the
0xC3 0xBC representation is just one form of Unicode Normalization. Unicode has the concept of “grapheme clusters”, where the letter and the diacritic (in this example) are separate, but combined to produce a single visible “character”. That’s how a lot of emoji and skintone variants are produced for example.
In the case of the letter u with the umlaut, this can be achieved with:
U+0075 letter u
U+0308 COMBINING DIAERESIS
This would be encoded as the bytes
0x75 0xCC 0x88 – you can try entering
\x75\xCC\x88 in this tool, it will show the decoded character as an (identical looking)
Now that’s three different byte sequences for the same character, but there are a few more depending on the Unicode Normalization form used.
As mentioned the source operating system could use any of those forms, in addition to pre-Unicode ISO-8859-1 encoding.
The most common Unicode Normalization is called “Normalization Form C (NFC)”, which is the least surprising in that the algorithm is to first decompose the grapheme clusters and then compose them back, producing the
0xFC Unicode code point for the
One could think of always using Normalization Form C, and a file named einführung on the server will be found when the browser uses the common NFC form for einführung in the file path, sounds like it would work most of the time.
The final issue here is the macOS filesystem. The filesystem is involved because if in Sparkle you export the website to disk, the file names will end up there, and you need to have “normalization continuity” between filesystem<->ftp<->server filesystem<->web server<->http(s)<->web browser.
This continuity falls down right at the macOS filesystem level, because HFS+ used to convert the filename to Normalization Form D (that’s not a typo, it did not use NFC), a different stream of bytes. And for compatibility reasons at some level Apple’s newer APFS does it as well.
So the TL;DR of all this is URLs with non-ASCII in them are inherently unreliable and in subtly invisible and very hard to diagnose ways. While under very controlled conditions you could make them work, Sparkle favors better compatibility over having diacritics in page filenames. And it’s unfortunately even worse for CJK languages where something like こんにちは becomes “kon-nichiha”, less readable to a Japanese reader.
I realize this is not what you wanted to hear. I hope you can see the issue has been very well analyzed, we just don’t believe there’s a good enough solution that would work the way we would like for Sparkle. At least for now.
The only thing I can suggest is to configure server side redirects from the names with diacritics to the plain ASCII file names produced by Sparkle. Sorry.