What has case distinction but is neither uppercase nor lowercase?
Posted by Wor_king2000@reddit | programming | View on Reddit | 106 comments
Posted by Wor_king2000@reddit | programming | View on Reddit | 106 comments
medforddad@reddit
Did anyone else not understand this part:
How does that explain why there's a single unicode code-point for 'dz', but not for 'cs', or 'dzs'?
It sounds like, if not for Serbo-Croation, then there would be no 'dz' code-point even though it exists as a single letter in Hungarian. But I don't understand why existing in Serbian or Croatian means it does get a single code-point while existing in Hungarian doesn't.
Hacnar@reddit
Because of the Cyrilic. Dz in Latin script corresponds to S in Cyrilic. Then there is also dž that corresponds to џ in Cyrilic. They want a mapping that doesn't change the amount of characters between Latin and Cyrilic.
toiletear@reddit
In Croatian, dž, lj and nj are a single letter, their grammar says so (it's a Slavic thing, my language used to have it too but at some point we decided that's just silly and we separated them like any other letter pair). I'm guessing that cs and dzs are usually written differently when appearing together, but do not otherwise represent a single letter, grammatically speaking.
medforddad@reddit
But in the article it says that they're a single letter in Hungarian as well:
So why being a single letter in Hungarian isn't enough to merit a single code-point but being a single letter in Croatian is enough is what's confusing me.
MatmaRex@reddit
Because pre-Unicode, some Croatians or perhaps Serbians invented and used text encodings where their digraphs were represented by single code points, and so Unicode included the same characters to allow lossless conversion to and from those encodings. Hungarians haven't come up with such a thing, so encoding their digraphs and trigraphs in Unicode was not required for compatiblity.
catcint0s@reddit
I don't think anyone uses this special unicode character to write madzag or expect
mad
to not returnmadzag
, we also have others alphabet members that are 2 letters and they are also treated as 2 letters.rzwitserloot@reddit
Same for Dutch IJ. It is a single letter, so, the proper capitalization of Iceland is IJsland. The big lake thing in the middle of the nation is the IJsselmeer. There is a special Unicode char but nobody uses it.
tav_stuff@reddit
Yes and no. IJ is not one letter — it is two letters as per the Dutch language standard. The confusion comes from how it used to be taught as one letter in schools until they stopped doing that because… it’s wrong.
It’s treated like 1 letter in basically every way: you capitalize it as IJ, you cannot break a word across lines between the I and J, and vertical text keeps the IJ next to each other. That said it’s not part of the Dutch alphabet, it’s not one letter, and if I search for ‘i’ I should get results with ‘ij’.
KyleG@reddit
fight hard enough, fight hard enough, Germans finally got a capital ß in 2017. It was treated as a ligature (i.e., a pretty form of two letters) historically, and finally after centuries of WAR, it was fully (rather than partially) re-classified as a LETTER with that move
tav_stuff@reddit
What are you taking about? The eszett (ß) was considered part of the German alphabet for many many many years. It just didn’t have a capital variant because it’s never used in such a context in normal prose. It has never been a ligature for at least 100 years or so. In fact we don’t even know for certain what the origins of the eszett are
shevy-java@reddit
This is why I think Switzerland did the right thing. They don't use ß anymore.
aksdb@reddit
Darum trinken die dort Bier in Massen.
Maxatar@reddit
You are replying as if you disagree with something, but then go on to repeat exactly what was said.
Wikipedia reinforces the fact that prior to 2017, there was no uppercase form of ß, and once again you don't dispute this but then go on to write two paragraphs as if you do.
mernen@reddit
He’s disagreeing with the statement that it was historically treated as a ligature.
Maxatar@reddit
According to Wikipedia the uppercase version was.
argh523@reddit
What you quote doesn't mean what you say it does. Wiki says that the Eszett wasn't considered a letter in early typesetting, and that this is the reason it didn't have an official capital version for so long
Maxatar@reddit
But I didn't say anything other than what Wikipedia says. Can you quote what you think it is I said that is different from what Wikipedia says? Like actually quote which of the previous 4 sentences I wrote says something different?
Eurynom0s@reddit
I'm surprised there was any kind of push for a capital ß given the push prior to that with the spelling reform stuff was to stop using ß and to just use ss instead.
shevy-java@reddit
Agreed. It makes no sense. There is not a single word in german that would need a capital ß.
Ethesen@reddit
https://typography.guru/journal/germanys-new-character/
Eurynom0s@reddit
For all caps I guess? But the upper and lower case versions are so minorly different that it doesn't seem worth the effort at all.
zombiecalypse@reddit
A campaign about promoting "ss" in Germany might face an uphill battle
shevy-java@reddit
Switzerland already eliminated ß, which I think was the right solution. Instead Unicode now dreams up fake characters that are not used in any (!!!) real word.
shevy-java@reddit
Switzerland got rid of the ß many years ago, which I think was the better option. There is no different variant of ß, though, so I do not think this can be a correct claim.
dtechnology@reddit
ß has been in unicode as a character since forever
lookmeat@reddit
Alphabets are weird. In Spanish they sought to have a phonetic language, that means that letter combinations that make sounds not formed by other letters ("ch" and "ll") are treated as their own letter and have their own section in the dictionary (that way a letter matches to a unique sound). That is they nowadays don't quite have that notion, and see them as separate letters for the point of sorting, they're just seen as other cases of composed sounds (combinations of letters that make a sound different to what you'd expect, such as when the u gets muted, etc.)
Seref15@reddit
In the other direction we can see humans adjusting to computers instead of computers adjusting to humans. The official Spanish alphabet used to include a double-L (ll) as its own letter, but in 1994 Spain adjusted the alphabet to drop this letter, largely motivated by improved usability with computers.
OMG_A_CUPCAKE@reddit
This goes back to typewriters and even the printing press. English lost their Þ because of that.
shevy-java@reddit
I think that was good too. Simplicity has a strong point.
I like learning new languages but I hate e. g. learning the chinese or japanese symbols. (Korean alphabet is a bit different to that, but even then I'd much rather everyone just uses a normal alphabet. It just simplifies so many things, and people can still use their own local language anyway)
jelly_cake@reddit
Lol, "why can't everyone just use a normal alphabet" - you realise that "normal" is subjective, right?
MiniGiantSpaceHams@reddit
Yes, but this person pretty clearly means "phonetic".
Liquid_Fire@reddit
Korean is phonetic, so they probably don't mean that.
recycled_ideas@reddit
It is, but it's not really written that way.
MiniGiantSpaceHams@reddit
Well fair enough, ignore me.
GimmickNG@reddit
and here we see the influence of English on the world in action.
There's a good reason different languages have different alphabets, and it's not because they hate the latin alphabet.
shevy-java@reddit
I understand that; german language also has the ß.
IMO, English wins that race, because it is used by more people, AND is simpler, too. They don't need all those fancy characters.
Own_Solution7820@reddit
If a word contains 'dz', is it guaranteed to be the digraph or are there cases where it's d and z coincidentally being adjacent, maybe something like: podzol.
stahorn@reddit
It's interesting that a search on "mad" maybe would not match "madzag" in some languages. I think that any good search function or engine matches the thing that makes sense instead of technical encodings though. Outside of code, but code's usually just latin alphabet for sanity.
For example, I expect search engines to understand that if I type O it should still match Swedish names with Ö or Norwegian/Danish names with Ø. After all, they look very similar. Two examples:
I just hope those searches work the same in the rest of the world and show Göteborg and Tromsø. It might be depending on where you search!
An example from Thai is if I search like this:
It shows Vũng Tàu, that I would have had trouble to search for otherwise.
I don't have a good way of searching for names in languages that have completely different alphabets though. Chinese, Mongolian, Gregorian script, ..., the list is long. Not that I suffer that much in my daily life because of it.
MaleficentFig7578@reddit
This is why search indices should convert to NFKD and strip diacritics and other stuff.
A1oso@reddit
That would make the search results useless for many queries. For example, searching for 'böse' (German for 'evil') would only return documents about a company that sells sound technology.
Ok-Scheme-913@reddit
I guess a reasonable solution might be to add it to the indexer in both stripped and non-stripped variants. That way searching for bose would return both, while böse (the more specific) would only return the German word.
A1oso@reddit
I don't understand why you want that. 'o' and 'ö' are different letters, why do you want to treat them the same? What did the 'ö' do to you?
Your solution doesn't work either, because searching for 'bose' will still have lots of irrelevant search results.
We have the umlauts on our keyboard, there is no reason for a German to type 'bose' when they mean 'böse'. To me, your proposal sounds just as crazy as replacing every 'm' with an 'n'.
Ok-Scheme-913@reddit
Because umlauts, acutes get lost, German/Hungarian words are often used in English (contexts) as well, often input by someone who doesn't know how to type them, etc.
As a Hungarian, my experience is that being accepting in the direction of non-accented to accented is a must have, the other direction not so much. It's a bit of a Liskov substitution for languages.
qrrux@reddit
This is only odd to people who only know alphabetic languages. In Chinese, there is no alphabet.
Every word is a unique picture, like an emoji.
If you type “hand” into a search bar, it doesn’t pull up all the emoji that have a hand in them. If you type “finger” into a search bar, it doesn’t pull up all the emoji featuring a finger.
So, in Chinese, which sometimes uses other “sub-pictures” of one word in another word, if that “sub-picture” (often a “root”) is itself a word, there is no way to “grep” or otherwise regex or pattern match for roots-used. [I mean, there could be, and in Taiwan they are taught a phonetic alphabet for Chinese, but this isn’t true for mainland China, and presents other problems…]
That becomes semantic, and not lexical.
That you can romanize easily other western alphabets is an interesting fact of linguistic history. But trying to find all the lexical morphemes of ‘o’ when typing ‘o’ isn’t just lexical. At some point, someone has to put semantic—or other—information in the system which identifies these “isotopes” of various western letters.
It was never a lexical problem to begin with. A lexical (or semi-lexical) solution is a hack. But it’s not a simple problem, and connected to issues as far-ranging as physical keyboard layout.
In the “mad” example, imagine there’s a “dz” key on the keyboard. In that case, then “d” isn’t present in that word. But what about on systems which don’t use that keyboard? Then is “d” in the word?
cedear@reddit
Ideograph is the word.
qrrux@reddit
Ideogram is less ambiguous, b/c ideograph has other meanings.
And, with regard to Chinese, that’s not the term of art. The word you’re looking for is “logogram”.
But thanks!
stahorn@reddit
It sound like a very deep rabbit hole to try to understand all of this! Even worse if we look at the keyboard layout that seem to be the one in use in Hungary:
https://en.wikipedia.org/wiki/QWERTZ#Hungary
No "dz" key. I was actually curious, as I've spent time in Poland and thought they also had letters with several characters like "dz". Turns out that in Polish, "dz" is not in the alphabet but it's a digraph. I'm not sure about the practical difference actually. When typing, both languages use keyboards with separate "d" and "z" key, as far as I know and have managed to google.
Ok-Scheme-913@reddit
This letter is absolutely not used in any way or form in Hungarian.
Source: am Hungarian programmer who had way too many problems with shitty encodings w.r. our extra letters like ő, ű, �. I haven't even heard of this existing before this post, and our keyboards have no way of writing this (yeah I guess one could learn the Unicode code), so literally no Hungarian content contains it.
CherryLongjump1989@reddit
It's not so complicated. I have a standard Western keyboard and I would like to perform a search that is normalized to the characters my keyboard can easily produce. I'm already sending you an accepts-language header which should tell you how to normalize my search query and what kind of results to return. I don't consider usability to be a hack.
rzwitserloot@reddit
The asciification of ö in German is oe. But as far as I know, in Swedish, it is o. For example, Sjögren.
The localization of the user or the app is immaterial. The asciified version of Sjögren is Sjogren regardless of the language setting. Names are not conveyed with the root locale of where it etymologically comes from.
So, what you want is impossible. Or, at least, Sjögren, sjogren, and Sjoegren must all be treated equal and no hash algorithm is feasible.
In a simple world where all unicode symbols have one asciification what you want would be easy, but that's not how unicode is used.
backelie@reddit
In our passports å/ä/ö gets internationalized as aa/ae/oe.
backelie@reddit
And as a Swede I'd like for a search for o not to match ö and vice versa (because they are not the same character). I learned to expect it from my English language browser, but it's never useful.
A1oso@reddit
Good search engines are typo tolerant, so they might find 'Göteborg' even when you incorrectly spell it with an 'o'.
But search engines cannot treat similar characters the same, because very often the difference is important. For example, in German, 'lauten' and 'läuten' are completely different words, and if search engines treated them the same, the results would be much less useful.
AquaWolfGuy@reddit
Swedish person here. I'd expect a Google search for Goteborg to find Göteborg because Google is an international service, and it can deal with misspelling in general. I wouldn't be surprised to see the same in an application specifically made for Swedish people, but I wouldn't expect it, and it feels wrong. It's also worth noting that the alphabet ends with ZÅÄÖ, which affects alphabetic sorting (collation).
jhartikainen@reddit
I wonder how does this actually work in practice. Is it impossible in hungarian to have a word with d followed by a z, because they have a letter dz? Seems like it would be very confusing.
Also browsing on wikipedia, their page on this suggests "Dz" is the uppercase version of "dz", not "DZ". Which is right? Any hungarians with insight on how this works? :D
balazsbotond@reddit
Hungarian here.
It is absolutely possible. It doesn't cause any confusion because:
For example, vadzab (wild oat, vad = wild, zab = oat). It is a compound word, with the word boundary between d and z, which, when pronounced, becomes the "dz" phoneme, just like in other words where it naturally occurs.
And yes, Dz is the uppercase version, see this classroom poster:
https://sucika67.hu/termek/abeces-tablo/
And, while we treat dz as a single unit, and it has its own place in our alphabet, it is officially written as two separate letters / graphemes. It is not a single grapheme, and in my opinion, including it as such in the Unicode standard was a huge mistake.
Ok-Scheme-913@reddit
Wtf man, they are not pronounced the same. Dz is a different letter. The others in the group make it more obvious, like s followed by a z sounds way different than our sz letter.
But yeah, context gives it away. And this Unicode character is absolutely not used by any Hungarian, our digitalization was always heavily based on ascii, e.g. a no longer so popular (thank God) encoding called latin-2 included the few Hungarian-specific extra chars, but we only needed a couple as cs, dz, dzs, sz, ty, ly, zs all can be written as combination of letters. Only ő and ű are the ones that are not commonly found in other languages (we also have ö and ü, á, é, ó, but I guess those may happen in other languages sometimes)
balazsbotond@reddit
You know, now that I think about it, I do pronounce the dz in vadzab differently than the one in, for example, edzés, I just never realized it.
jhartikainen@reddit
Interesting, thanks for the explanation.
Nah not really, I was just curious on how it works :) My native language is actually finnish which is similar to hungarian in the phonetic aspect at least, and from what I understand similar grammar, although your words and spelling look completely alien to me lol
-Y0-@reddit
Not Hungarian, but digraphs like
lj
,nj
are treated as two letters that denote a single sound. They are a hack to get the sound ofЉ
andЊ
(from Cyrillic alphabet) using mostly Latin Alphabet.For Croatian, if you had a word beginning with
L
andJ
, due to the way how phonetics works they would merge into a single sound (Lj
orЉ
)90% of the time you don't write uppercase but just want to capitalize the first letter of the word. They are thus showing Titlecase versus the much rarer UPPERCASE.
TheMeteorShower@reddit
That sounds exactly how we have th to make it own sound. The only difference is th isnt in our alphabet.
Its two letter and has its own sound.
raleksandar@reddit
Just to add that (at least in Serbian, but I guess the same is true for Croatian as well) there are words where `nj` is one sound (`њ` in Cyrillic) but there are also words where it is two (`нј` in Cyrillic), for example:
- `njiva` (lat) / `њива` (cyr) - meaning "field"
- `injekcija` (lat) / `инјекција` (cyr) - meaning "injection"
The latter are not that common, and I believe all are examples of borrowed words, but they exist. And they complicate transliteration between Cyrillic and Latin alphabets (Serbian uses both, which is another can of worms).
jhartikainen@reddit
Thanks for the insights, I figured maybe there was some reason like the cyrillic conversion there.
wintrmt3@reddit
In practice no one gives a fuck about Unicode digraphs and everyone expects "mad" to match "madzag".
PurepointDog@reddit
Well that was interesting
dvlsg@reddit
Raymond Chen usually has really interesting, bizarre things to blog about.
https://devblogs.microsoft.com/oldnewthing/author/oldnewthing
midir@reddit
Well tell the Hungarians to stop.
VRRifter@reddit
tldr: title case. Saved you a click
Thelmholtz@reddit
Which applies to characters that represent digraphs (such as a Serbo/Croatian one for "dz") that have three cases:
Note: I'm using normal ASCII to represent the digraph here, as I'm on mobile. They are just a single character if using the right Unicode symbol.
backelie@reddit
What about smallcaps?
Thelmholtz@reddit
Apparently they are not meant as case, are supposed to be semantically unimportant for Unicode, and it's recommended use is IPA. https://en.m.wikipedia.org/wiki/Small_caps#Unicode
MaleficentFig7578@reddit
Why don't Serbs/Croats just use d and then z
Thelmholtz@reddit
Do I look like a linguist specialized in the latin ortography of Slavic languages?
backelie@reddit
A little?
MrKapla@reddit
DZ, Dz and dz.
1639728813@reddit
What's title case? Do I now need to click?
chucker23n@reddit
It’s a special approach to casing that e.g. journalists use in headlines. https://titlecase.com
diegoasecas@reddit
This Is Title Case
PoolNoodleSamurai@reddit
It’s a Raymond Chen article. It’s worth a click.
CryZe92@reddit
Which apparently has nothing to do with the title case that people are familiar with.
shevy-java@reddit
I thought this was about Microsoft upcasing or downcasing files and not caring for the difference ...
Programmdude@reddit
In most european languages, case matters heaps since the majority of characters written have lower/upper cases. So being able to correctly change the case of the most common characters is crucial.
Full-Spectral@reddit
I always find Unicode to be a weird beast. I was around when it came out, and I was writing the first version of the Xerces C++ XML parser, which of course had to deal with various language encodings. So we were all excited that Unicode was going to simplify all of that.
But, in the end, what was Unicode really supposed to do? Was it really supposed to simplify the ability of software applications to deal with multiple languages, or was it really an academic exercise to allow every single complexity of the random mutation of human languages to be encoded?
It clearly became the latter, whatever might have been its original intent. If the purpose was to make it practical for software applications to handle multiple languages, it would have forced various simplifications of the representation of human languages, which would have served everyone better, IMO.
Lonsdale1086@reddit
I mean, it's designed to allow transmission and storage of text in any language that exists.
Anything else is a side effect.
Full-Spectral@reddit
Text that is never actually displayed probably doesn't need to get transmitted or stored though. The purpose of text is for people to create it, edit it, and read it, for the most part, which means it has to be practical for any application that does those things.
Lonsdale1086@reddit
Tell me you've never tried to store a copy of an ancient Babylonian scroll in a Word Document without telling me you've never tried to store a copy of an ancient Babylonian scroll in a Word Document.
I'm being a dick about it, but it's an admirable goal to create a text encoding that can represent any character of any language ever written, thus allowing lossless transmission and storage of it.
It may only be practically useful for archivists, but it's also better than the solutions we had before in software development, even if we now get weird edgecases like this.
Full-Spectral@reddit
I imagine that very few of us have done that, and that's the point. It's the tail wagging the dog.
Lonsdale1086@reddit
We've coopted a standard then complained it's not objectively worse in order to meet our needs.
Full-Spectral@reddit
But wait... Unicode was not created to deal with Babylonian texts that I remember. As I said, I was around at the time, and I remember it being pitched as a way to SIMPLIFY supporting multiple languages in software, partially of course by getting rid of the need to deal with multiple encodings and provide a single one. And it did do that. But it created vastly more complexity in the end since, as I said above, the point of text is to manipulate it, and that has become crazily complex now.
If the purpose was to make it practical for software to create, edit, and display text, then it could have pushed for simplifications and limitations. I think that the world of Babylonian scrolls, nothing personal, could have suffered in the process because the area under the curve improvements for those 20 million folks who don't deal with Babylonian scrolls for every one that does, could have software that is simpler, safer, faster, more robust, etc...
Tarquin_McBeard@reddit
You keep repeating this, even after multiple people have already corrected you: it wasn't.
The purpose of Unicode is and always was to enable the accurate storage and transmission of any text in any language. By your own admission, it has succeeded completely in that aim.
vytah@reddit
I want my customers' complaints about low quality copper properly digitized.
backelie@reddit
Control characters not needing to be stored and transmitted seems very unlikely to me.
Hussell@reddit
Technically, it's designed to allow text encoded in any other character encoding to be converted to and from Unicode without complications. That's why Unicode has all the weird characters from every other character encoding ever created, and why it has multiple ways to represent the same characters. If it didn't, then you couldn't convert from a character encoding that represented 'dz' as a single letter into Unicode and back without either special rules to decide when 'd' 'z' should be converted into 'dz' or not or just accepting that all the 'dz' characters would come back as 'd' 'z' after the round-trip.
The Unicode standard lists a bunch of other goals it would like to adhere to, but the requirement for round-trip encoding to never alter the input always comes first at the expense of the other goals.
roelschroeven@reddit
The purpose was to make it possible for software applications to handle multiple languages. And emojis, as it turned out.
larsga@reddit
It's almost criminal to write about this without talking about Unicode normalization. A text that encodes "madzag" as "m-a-dz-a-g" is in Normalization Form C.
However, the Unicode character database has decomposition mappings for many of the characters, including the dz digraph, so if you normalize the string to Normalization Form D it becomes "madzag", one character per letter, exactly as expected.
That was the good part. The bad part is that in NFD something like "olé" becomes "o-l-e-combining ´". So it's not quite as straightforward as it might seem, but knowing about this stuff is very important for anyone who wants to do text processing.
vytah@reddit
C vs D does not affect the decomposition of dz, it gets decomposed when you enable compatibility normalization (KC or KD). It replaces letter-like things with normal letters, and you then choose KC vs KD depending on whether you want diacritics composed or not.
larsga@reddit
Thanks for correcting that. I didn't take time to look up the details of the different decompositions.
Zen13_@reddit
Life is already too complicated as it was. This is something I didn't need to know. Can I unknow it?
audentis@reddit
Knowledge is power. Knowing something is inherently positive.
wintrmt3@reddit
On the other hand in practice this information is false, no one is creating text in hungarian with the digraph version and everyone would be really surprised if d didn't match dz.
Zen13_@reddit
He's a Microsoft developer, everything is important to him, as he never knows what will break next. Let's see if he developed a sense of humour since the last post...
Zen13_@reddit
Not everyone understands my jokes, only a few fortunate ones can.
balazsbotond@reddit
Programming is about deeply understanding the complexity of the world and modeling it in software in an elegant way. If you aren't interested in this, why are you reading the programming subreddit?
Zen13_@reddit
I deeply understand it and model it for more than 35 years (professionally, over 40 as an hobby). I can't say the same about your understanding of jokes.
PurpleYoshiEgg@reddit
Jokes are meant to be funny. Start with that if you intend to write a joke.
omepiet@reddit
IJ enters the conversation. https://en.wikipedia.org/wiki/IJ_(digraph)