German Umlauts in URL

I have an older website that I created years ago using Jimdo.
Now I’m about to recreate it using Sparkle and revise it at the same time.
Since there are numerous links to the individual pages around (search engines), I want to keep the structure of the website the same. I.e. also the URLs of the individual pages should remain exactly the same.
Unfortunately, there are URLs on the website that contain German umlauts (ä, ö, ü). An example is “chrb.ch/einführung”.
Now every time I enter “einführung.html” in the PAGE FILENAME field, Sparkle makes it “einfuhrung.html”, i.e. it replaces the umlaut “ü” with “u”.
Is there any way to use the original names (with umlaut) here?

Thanks in advance for your help

Greetings, Röbi

The situation with diacritics (or other non-ascii characters) in domain names has been fixed by IDN, there’s a conventional encoding called puny-code, so say schön.de becomes xn--schn-7qa.de – but this is transparent as browsers and even Sparkle convert back and forth from the two, you never know.

Unfortunately no such convention is available for the rest of the URL. Browsers can submit that part in any encoding, and the server needs to deal with it. That’s why you see amazon searches include “ÅMÅŽÕÑ” as part of the search URL, they most likely use that to detect what the browser thought the encoding was supposed to be and correct for it. But Sparkle sites are plain, “static” sites, with files in the filesystem, so no auto-detection is possible.

Specifically ü can be represented in its iso-8859-1 (also known as “latin 1”) encoding, code point 0xFC (the hexadecimal for 252). Sparkle codes pages as UTF-8, but the URL can end up anywhere (an email client, a chat, a different web page not coded as UTF-8, etc), and if the local operating system doesn’t know better it can put the 0xFC byte straight in the URL.

Another representation for ü is its Unicode equivalent, which is defined as:

U+00FC Latin Small Letter U with Diaeresis / U with trema / U with umlaut

while it still looks like the same 0xFC number, the common encoding for Unicode on the web is UTF-8, which turns any character above 0x80 into two or more bytes, so 0xFC is represented as 0xC3 followed by 0xBC.

But it’s not over, the 0xC3 0xBC representation is just one form of Unicode Normalization. Unicode has the concept of “grapheme clusters”, where the letter and the diacritic (in this example) are separate, but combined to produce a single visible “character”. That’s how a lot of emoji and skintone variants are produced for example.

In the case of the letter u with the umlaut, this can be achieved with:

U+0075 letter u
U+0308 COMBINING DIAERESIS

This would be encoded as the bytes 0x75 0xCC 0x88 – you can try entering \x75\xCC\x88 in this tool, it will show the decoded character as an (identical looking) ü.

Now that’s three different byte sequences for the same character, but there are a few more depending on the Unicode Normalization form used.

As mentioned the source operating system could use any of those forms, in addition to pre-Unicode ISO-8859-1 encoding.

The most common Unicode Normalization is called “Normalization Form C (NFC)”, which is the least surprising in that the algorithm is to first decompose the grapheme clusters and then compose them back, producing the 0xFC Unicode code point for the ü.

One could think of always using Normalization Form C, and a file named einführung on the server will be found when the browser uses the common NFC form for einführung in the file path, sounds like it would work most of the time.

The final issue here is the macOS filesystem. The filesystem is involved because if in Sparkle you export the website to disk, the file names will end up there, and you need to have “normalization continuity” between filesystem<->ftp<->server filesystem<->web server<->http(s)<->web browser.

This continuity falls down right at the macOS filesystem level, because HFS+ used to convert the filename to Normalization Form D (that’s not a typo, it did not use NFC), a different stream of bytes. And for compatibility reasons at some level Apple’s newer APFS does it as well.

So the TL;DR of all this is URLs with non-ASCII in them are inherently unreliable and in subtly invisible and very hard to diagnose ways. While under very controlled conditions you could make them work, Sparkle favors better compatibility over having diacritics in page filenames. And it’s unfortunately even worse for CJK languages where something like こんにちは becomes “kon-nichiha”, less readable to a Japanese reader.

I realize this is not what you wanted to hear. I hope you can see the issue has been very well analyzed, we just don’t believe there’s a good enough solution that would work the way we would like for Sparkle. At least for now.

The only thing I can suggest is to configure server side redirects from the names with diacritics to the plain ASCII file names produced by Sparkle. Sorry.

2 Likes

Thank you very much for your comprehensive answer. I certainly realize that the issue has been very well analyzed.
I thought that maybe I had missed a simple setting, so I asked. But it is not a big problem. It would have been nice to stick to it, but if it can’t be done without tricks, I can do without it.
By the way, I am pleasantly surprised how you get competent answers here immediately and are taken seriously.
Thanks again and keep up the good work!

1 Like

Hello.

What about converting ä into ae, ö into oe and so on?
The user can then decide whether to leave it as it is or go back to the default conversion (a, o and so on).
Just an idea.

Mr. F.

We use system functionality to do that for multiple languages, I think it’s part of ICU, a massive library for Unicode support. Not something easily turned into a preference in a general way. If you want to customize that you can set the filename to custom and edit it directly.