What has case distinction but is neither uppercase nor lowercase?

[-]

ozgurakgun@reddit

I am disappointed that madzag in the article wasn't actually written using the 'ǳ' character. This way it wouldn't match maǳag when the search term is dz.

[-]

It's interesting that a search on "mad" maybe would not match "madzag" in some languages. I think that any good search function or engine matches the thing that makes sense instead of technical encodings though. Outside of code, but code's usually just latin alphabet for sanity.

For example, I expect search engines to understand that if I type O it should still match Swedish names with Ö or Norwegian/Danish names with Ø. After all, they look very similar. Two examples:

https://www.google.com/search?q=goteborg
https://www.google.com/search?q=tromso

I just hope those searches work the same in the rest of the world and show Göteborg and Tromsø. It might be depending on where you search!

An example from Thai is if I search like this:

https://www.google.com/search?q=vung+tau

It shows Vũng Tàu, that I would have had trouble to search for otherwise.

I don't have a good way of searching for names in languages that have completely different alphabets though. Chinese, Mongolian, Gregorian script, ..., the list is long. Not that I suffer that much in my daily life because of it.

[-]

qrrux@reddit

This is only odd to people who only know alphabetic languages. In Chinese, there is no alphabet.

Every word is a unique picture, like an emoji.

If you type “hand” into a search bar, it doesn’t pull up all the emoji that have a hand in them. If you type “finger” into a search bar, it doesn’t pull up all the emoji featuring a finger.

So, in Chinese, which sometimes uses other “sub-pictures” of one word in another word, if that “sub-picture” (often a “root”) is itself a word, there is no way to “grep” or otherwise regex or pattern match for roots-used. [I mean, there could be, and in Taiwan they are taught a phonetic alphabet for Chinese, but this isn’t true for mainland China, and presents other problems…]

That becomes semantic, and not lexical.

That you can romanize easily other western alphabets is an interesting fact of linguistic history. But trying to find all the lexical morphemes of ‘o’ when typing ‘o’ isn’t just lexical. At some point, someone has to put semantic—or other—information in the system which identifies these “isotopes” of various western letters.

It was never a lexical problem to begin with. A lexical (or semi-lexical) solution is a hack. But it’s not a simple problem, and connected to issues as far-ranging as physical keyboard layout.

In the “mad” example, imagine there’s a “dz” key on the keyboard. In that case, then “d” isn’t present in that word. But what about on systems which don’t use that keyboard? Then is “d” in the word?

[-]

CherryLongjump1989@reddit

It's not so complicated. I have a standard Western keyboard and I would like to perform a search that is normalized to the characters my keyboard can easily produce. I'm already sending you an accepts-language header which should tell you how to normalize my search query and what kind of results to return. I don't consider usability to be a hack.

[-]

qrrux@reddit

"The desired outcome is not so complicated."

Is this for real?

You have an edge case that doesn't seem complicated to you, though you haven't presented us with an obvious solution for how to find all the "visual isotopes of a particular letter X across target languages W, Y, and Z."

Let's look at a Russian/Cyrillic example. There is a letter "Б" which is approximately the "b" in "bad". Its lowercase is "б". Should this match the English "B"? Before you decide, there is ANOTHER Russian (Cyrillic) letter "В", whose lowercase is "в", and sounds like the "v" in "vine". Should this match "B", too? By your logic, maybe one matches, or maybe both matches. Maybe that works...

But, look at other languages without an alphabet that shares a common ancestor (or, at least, visually and lexically). There is a Chinese word "口" (that means "mouth", but the meaning is irrelevant here). On some Chinese keyboards, it occupies the position of the letter "R".

When you type this "character" on that keyboard, is it suppose to match the English letter "O"? Would this make any sense? When you type an English "O", should it match words which contain the Chinese radical "口"? Would that make any sense? Should--one or both--match the emoji with the open mouth? Would that make sense?

Your edge case is that you're working in languages derived from some common root which still shares letters. What does "lexical normalization" look like when you cross languages without similar alphabets? Your edge case doesn't generalize well.

Could we take the shortcut in similar languages? Should the Chinese "口" match Japanese kanji words which also contain that radical? Probably. But how does this work? Where would this information live? Do you create classes of languages with "similar alphabets"? And who decides, when a particular keycode gets pressed, which "isotopes" match? Presumably, some deep mapping in the OS, kind of analogous to timezone mappings.

According to you, an HTML "accepts-language header" solves all these issues? Is this for real?

[-]

CherryLongjump1989@reddit

Are you suggesting that normalization could be done phonetically as well as visually? Sure, that's fine by me. A search engine can gloss over many of these issues via relevance ranking, matching on multiple characters, and spell checking. These are common techniques used by search engines even within a single language.

Do I expect it to be a universal solution that works in every extreme edge case you can imagine? Surely, not, but who cares? Perfection is the enemy of good.

[-]

qrrux@reddit

Your entire point: “I don’t know much about this, I can’t address any of your points, my edge case is easy, so let’s standardize around my edge case. Blah blah blah, pithy noises about perfection.”

This is how things get bad in the first place.

[-]

CherryLongjump1989@reddit

Nah, it's just not worth getting into the details with you. Search indexes are almost always regional, and almost always limited to a small number of languages if not just one. It's common practice to pick a custom normalization scheme, or several, depending on the use cases you want to support. Your insistence that this is impossible because there is no single universal normalization that can handle every glyph in existence largely misses the point: nobody wants a universal normalization.

[-]

stahorn@reddit

It sound like a very deep rabbit hole to try to understand all of this! Even worse if we look at the keyboard layout that seem to be the one in use in Hungary:

https://en.wikipedia.org/wiki/QWERTZ#Hungary

No "dz" key. I was actually curious, as I've spent time in Poland and thought they also had letters with several characters like "dz". Turns out that in Polish, "dz" is not in the alphabet but it's a digraph. I'm not sure about the practical difference actually. When typing, both languages use keyboards with separate "d" and "z" key, as far as I know and have managed to google.

[-]

Ok-Scheme-913@reddit

This letter is absolutely not used in any way or form in Hungarian.

Source: am Hungarian programmer who had way too many problems with shitty encodings w.r. our extra letters like ő, ű, �. I haven't even heard of this existing before this post, and our keyboards have no way of writing this (yeah I guess one could learn the Unicode code), so literally no Hungarian content contains it.

[-]

stahorn@reddit

A little bit how I feel about Unicode: Lots of things they're doing technically correct, but not so sure if anybody actually wants to use it.

BTW, I think me or reddit has shitty encoding and shows ő, ű, � (with the last being a question mark in a black rhombus for me, if copy-pasting would in some way preserve encodings).

[-]

Ok-Scheme-913@reddit

I don't know, utf8 is the de facto standard, and the moment you stop using it, all hell lose break. And it's definitely better than it used to be where you had multiple encodings randomly, the worst nowadays is that sometimes windows spits in your face with their utf-16.

And for the not-found character, I deliberately copied that char to fool with you, sorry :D but I have met it way too often a decade ago.

[-]

stahorn@reddit

Ah, you got me there then! Copying these chars is a great way to increase the blood pressure of any programmer ;)

utf8 everywhere, but avoid the fancy stuff is what I'd like. Spaces that take up no space, or just different widths of them...

[-]

cedear@reddit

Ideograph is the word.

[-]

qrrux@reddit

Ideogram is less ambiguous, b/c ideograph has other meanings.

And, with regard to Chinese, that’s not the term of art. The word you’re looking for is “logogram”.

But thanks!

[-]

sparr@reddit

A search on "mad" would match "madzag" in every language.

It might not match "maǳag", though.

[-]

stahorn@reddit

Also good to know: In my current font, "ǳ" appears slightly smaller than "dz". Things like this could drive a person mad!

[-]

al-mongus-bin-susar@reddit

Probably because it's not your font actually providing that chatacter, but the system's unicode compatibility font. There are very few fonts that cover a large portion of unicode, the most popular are Google's noto fonts.

[-]

A1oso@reddit

I expect search engines to understand that if I type O it should still match Swedish names with Ö

Good search engines are typo tolerant, so they might find 'Göteborg' even when you incorrectly spell it with an 'o'.

But search engines cannot treat similar characters the same, because very often the difference is important. For example, in German, 'lauten' and 'läuten' are completely different words, and if search engines treated them the same, the results would be much less useful.

[-]

stahorn@reddit

I don't know much about search engines, but I kind of feel like they've messed up things like "lauten" and "läuten". Not that I google in German, but similar in other languages.

Sometimes it's just impossible for them to know what you mean with a single word. Take the word "tomten" in Swedish and compare it to "tomten. Both words are in the definite form, but refer to these separate things:

https://sv.wikipedia.org/wiki/Tomte
https://sv.wikipedia.org/wiki/Tomt

The first is a mythological create in Scandinavian folklore or what we call Santa. The second I think becomes "real property in English. You can then say things like "Tomten gick över tomten", "The 'nisse' walked over the property". If you're Swedish and hear me say the two words, you would know what I mean, but it's such a slight variation in the tone that it's hard to hear if you're not native.

The solution in search engines is of course to just type out a whole sentence and they seem to know what. Google returns correct results here for example:

https://www.google.com/search?q=var+bor+tomten
https://www.google.com/search?q=hur+tar+man+hand+om+tomten

"Where does Santa live" and "how do you take care of your property".

[-]

A1oso@reddit

Sometimes it's just impossible for them to know what you mean with a single word. Take the word "tomten" in Swedish

Words with different meanings but the same spelling are homonyms. Homonyms do pose a challenge for search engines (the easiest solution is to add context in the query, as you already described). However, and this is important: 'lauten' and 'läuten' are not homonyms. They have a different spelling, a different pronunciation, and a different meaning. I don't know how it is in Swedish, but in German the umlauts are very important, and must not be removed when searching text.

[-]

stahorn@reddit

Yeah, same in Swedish. They are different letters for a reason! You wouldn't like to drop the umlauts of "Höra" for example.

[-]

MaleficentFig7578@reddit

This is why search indices should convert to NFKD and strip diacritics and other stuff.

[-]

A1oso@reddit

That would make the search results useless for many queries. For example, searching for 'böse' (German for 'evil') would only return documents about a company that sells sound technology.

[-]

Ok-Scheme-913@reddit

I guess a reasonable solution might be to add it to the indexer in both stripped and non-stripped variants. That way searching for bose would return both, while böse (the more specific) would only return the German word.

[-]

A1oso@reddit

I don't understand why you want that. 'o' and 'ö' are different letters, why do you want to treat them the same? What did the 'ö' do to you?

Your solution doesn't work either, because searching for 'bose' will still have lots of irrelevant search results.

We have the umlauts on our keyboard, there is no reason for a German to type 'bose' when they mean 'böse'. To me, your proposal sounds just as crazy as replacing every 'm' with an 'n'.

[-]

Ok-Scheme-913@reddit

Because umlauts, acutes get lost, German/Hungarian words are often used in English (contexts) as well, often input by someone who doesn't know how to type them, etc.

As a Hungarian, my experience is that being accepting in the direction of non-accented to accented is a must have, the other direction not so much. It's a bit of a Liskov substitution for languages.

[-]

A1oso@reddit

Then you have to rely on the typo tolerance of the search engine, but typing 'a' instead of 'ä' is still absolutely wrong in German. It quite often changes the meaning of the word.

You cannot expect to be able to type every language on your US keyboard. Many languages don't even use the latin script, like Cyrillic, Greek, Arabic, Hebrew, Thai, Hangul, and so on. The fact that you can type most German letters with US-ASCII is nice, but if you want to type German text correctly, you need a German keyboard, as is the case for languages like Greek that don't use the Latin script.

German allows replacing 'ä' with 'ae' when the system doesn't support umlauts for some reason. This was done on typewriters sold internationally. It was also used in the early days of the internet, e.g. in SMS messages (which didn't support Unicode for a long time). It is still used today for domain names like mueller.de for MÜLLER.

The substitutions are - ä to ae - ö to oe - ü to ue - ß to ss

But it should only be used as a last resort. Since Unicode is almost universally supported these days, it is rarely necessary. Moreover, replacing 'ö' with just 'o' is just wrong. The 'e' is necessary to distinguish different words.

[-]

Ok-Scheme-913@reddit

I'm a native Hungarian speaker and I do speak a bit of German, so I'm familiar with both alphabets. In Hungarian there is no such transcribing like ä->ae, we simply leave off the accent, and it's quite common in SMS/quick texting.

Don't get me wrong, I get where you are coming from, and I also prefer proper punctuation/acutes as that is how my and your language is supposed to be written. I just do know that de facto English is the global language, especially in IT, your domain name being a good example.

And one more thing to add to the infinitely long "assumptions you shouldn't make as a developer" list is not even ASCII-fication is done in the same way between languages :D

[-]

stahorn@reddit

Funny thing about "ö". When I was a kid I listened to the band Motörhead. I pronounced it a bit different in Swedish then the actual English way to pronounce it though!

[-]

SkoomaDentist@reddit

I still pronounce in my head it as the ö-letter instead of heavy metal ümlaut.

[-]

AquaWolfGuy@reddit

Swedish person here. I'd expect a Google search for Goteborg to find Göteborg because Google is an international service, and it can deal with misspelling in general. I wouldn't be surprised to see the same in an application specifically made for Swedish people, but I wouldn't expect it, and it feels wrong. It's also worth noting that the alphabet ends with ZÅÄÖ, which affects alphabetic sorting (collation).

[-]

stahorn@reddit

Don't get into alphabetic sorting, that's another rabbit hole! But if you want to, you can start with how it's done in German here:

https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions

[-]

rzwitserloot@reddit

The asciification of ö in German is oe. But as far as I know, in Swedish, it is o. For example, Sjögren.

The localization of the user or the app is immaterial. The asciified version of Sjögren is Sjogren regardless of the language setting. Names are not conveyed with the root locale of where it etymologically comes from.

So, what you want is impossible. Or, at least, Sjögren, sjogren, and Sjoegren must all be treated equal and no hash algorithm is feasible.

In a simple world where all unicode symbols have one asciification what you want would be easy, but that's not how unicode is used.

[-]

backelie@reddit

But as far as I know, in Swedish, it is o. For example, Sjögren.

In our passports å/ä/ö gets internationalized as aa/ae/oe.

[-]

backelie@reddit

And as a Swede I'd like for a search for o not to match ö and vice versa (because they are not the same character). I learned to expect it from my English language browser, but it's never useful.

[-]

PurepointDog@reddit

Well that was interesting

[-]

dvlsg@reddit

Raymond Chen usually has really interesting, bizarre things to blog about.

https://devblogs.microsoft.com/oldnewthing/author/oldnewthing

[-]

PurepointDog@reddit

Wow that's great! Some really effective short-ish pieces in there, thanks!

[-]

catcint0s@reddit

The fact that dz is treated as a single letter in Hungarian means that if you search for “mad”, it should not match “madzag” (which means “string”) because the “dz” in “madzag” is a single letter and not a “d” followed by a “z”

I don't think anyone uses this special unicode character to write madzag or expect mad to not return madzag, we also have others alphabet members that are 2 letters and they are also treated as 2 letters.

[-]

rzwitserloot@reddit

Same for Dutch IJ. It is a single letter, so, the proper capitalization of Iceland is IJsland. The big lake thing in the middle of the nation is the IJsselmeer. There is a special Unicode char but nobody uses it.

[-]

tav_stuff@reddit

Yes and no. IJ is not one letter — it is two letters as per the Dutch language standard. The confusion comes from how it used to be taught as one letter in schools until they stopped doing that because… it’s wrong.

It’s treated like 1 letter in basically every way: you capitalize it as IJ, you cannot break a word across lines between the I and J, and vertical text keeps the IJ next to each other. That said it’s not part of the Dutch alphabet, it’s not one letter, and if I search for ‘i’ I should get results with ‘ij’.

[-]

KyleG@reddit

fight hard enough, fight hard enough, Germans finally got a capital ß in 2017. It was treated as a ligature (i.e., a pretty form of two letters) historically, and finally after centuries of WAR, it was fully (rather than partially) re-classified as a LETTER with that move

[-]

tav_stuff@reddit

What are you taking about? The eszett (ß) was considered part of the German alphabet for many many many years. It just didn’t have a capital variant because it’s never used in such a context in normal prose. It has never been a ligature for at least 100 years or so. In fact we don’t even know for certain what the origins of the eszett are

[-]

Maxatar@reddit

You are replying as if you disagree with something, but then go on to repeat exactly what was said.

Wikipedia reinforces the fact that prior to 2017, there was no uppercase form of ß, and once again you don't dispute this but then go on to write two paragraphs as if you do.

[-]

tav_stuff@reddit

You claimed that ß was never a letter until 2017. This is wrong

[-]

Potential-Shower-603@reddit

Readings hard don’t worry

[-]

Maxatar@reddit

No such claim was made.

[-]

mernen@reddit

He’s disagreeing with the statement that it was historically treated as a ligature.

[-]

Maxatar@reddit

According to Wikipedia the uppercase version was.

[-]

argh523@reddit

What you quote doesn't mean what you say it does. Wiki says that the Eszett wasn't considered a letter in early typesetting, and that this is the reason it didn't have an official capital version for so long

[-]

Maxatar@reddit

But I didn't say anything other than what Wikipedia says. Can you quote what you think it is I said that is different from what Wikipedia says? Like actually quote which of the previous 4 sentences I wrote says something different?

[-]

argh523@reddit

numerous sources all state that the capital form had been treated as a ligature until 2017

That is not true. You're conflating two different things in the paragraph on wiki:

In "early modern typesetting" (hundreds of years ago) the eszett was considered a ligature, but was only used in lowercase writing (or not at all), and therefore thes was no capital version
Until 2017, there was no capital eszett.

You combine those two things into "the capital form had been treated as a ligature until 2017". But this is not true. A capital version of the ligature never existed. The capital version was "SZ", and later "SS". Eszett became a letter in the meantime, but only as a lowercase version.

But I can see how, when reading the paragraph on wiki, it is easy to conflate those things. The writing doesn't make it very clear. But if you go and re-read it now, I think you'll see that it does entierly agree with wat I just said

[-]

Maxatar@reddit

A ligature is nothing more than the combining of two letters to form a single glyph. Until 2017 the capital form of of "ß" was "SS", which is a ligature.

[-]

argh523@reddit

Read that again:

A ligature is nothing more than the combining of two letters to form a single glyph

"SS" is two glyphs.

[-]

Maxatar@reddit

"SS" to refer to the distinct letters "S" and "S" is certainly two glyphs, but when it was used as the capital form of ß it would often be written out as a single glyph to avoid ambiguity. For example "Maße" is not the same word as "Masse". When writing "Maße" in all caps one would write out "SS" as a single glyph/ligature as opposed to two distinct and separate "S" letters.

[-]

shevy-java@reddit

(even though it’s wrong and ought to be SS)

This is why I think Switzerland did the right thing. They don't use ß anymore.

[-]

tav_stuff@reddit

The eszett is not just a double S. It actually alters the pronunciation of the word and that matters in a phonetic language like German

[-]

aksdb@reddit

Darum trinken die dort Bier in Massen.

[-]

shevy-java@reddit

Germans finally got a capital ß in 2017

Switzerland got rid of the ß many years ago, which I think was the better option. There is no different variant of ß, though, so I do not think this can be a correct claim.

[-]

sparr@reddit

Every word in German with a ß needs it capitalized (ẞ) when the word is being written in all capital letters.

[-]

Eurynom0s@reddit

I'm surprised there was any kind of push for a capital ß given the push prior to that with the spelling reform stuff was to stop using ß and to just use ss instead.

[-]

shevy-java@reddit

Agreed. It makes no sense. There is not a single word in german that would need a capital ß.

[-]

Ethesen@reddit

https://typography.guru/journal/germanys-new-character/

[-]

Eurynom0s@reddit

For all caps I guess? But the upper and lower case versions are so minorly different that it doesn't seem worth the effort at all.

[-]

zombiecalypse@reddit

A campaign about promoting "ss" in Germany might face an uphill battle

[-]

shevy-java@reddit

Switzerland already eliminated ß, which I think was the right solution. Instead Unicode now dreams up fake characters that are not used in any (!!!) real word.

[-]

dtechnology@reddit

ß has been in unicode as a character since forever

[-]

Seref15@reddit

In the other direction we can see humans adjusting to computers instead of computers adjusting to humans. The official Spanish alphabet used to include a double-L (ll) as its own letter, but in 1994 Spain adjusted the alphabet to drop this letter, largely motivated by improved usability with computers.

[-]

OMG_A_CUPCAKE@reddit

This goes back to typewriters and even the printing press. English lost their Þ because of that.

[-]

shevy-java@reddit

I think that was good too. Simplicity has a strong point.

I like learning new languages but I hate e. g. learning the chinese or japanese symbols. (Korean alphabet is a bit different to that, but even then I'd much rather everyone just uses a normal alphabet. It just simplifies so many things, and people can still use their own local language anyway)

[-]

jelly_cake@reddit

Lol, "why can't everyone just use a normal alphabet" - you realise that "normal" is subjective, right?

[-]

MiniGiantSpaceHams@reddit

Yes, but this person pretty clearly means "phonetic".

[-]

Liquid_Fire@reddit

Korean is phonetic, so they probably don't mean that.

[-]

lunchmeat317@reddit

I think they want everything to be converted to Hiragana and Katakana.

[-]

recycled_ideas@reddit

Korean is phonetic,

It is, but it's not really written that way.

[-]

MiniGiantSpaceHams@reddit

Well fair enough, ignore me.

[-]

GimmickNG@reddit

I'd much rather everyone just uses a normal alphabet

and here we see the influence of English on the world in action.

There's a good reason different languages have different alphabets, and it's not because they hate the latin alphabet.

[-]

shevy-java@reddit

I understand that; german language also has the ß.

IMO, English wins that race, because it is used by more people, AND is simpler, too. They don't need all those fancy characters.

[-]

lookmeat@reddit

Alphabets are weird. In Spanish they sought to have a phonetic language, that means that letter combinations that make sounds not formed by other letters ("ch" and "ll") are treated as their own letter and have their own section in the dictionary (that way a letter matches to a unique sound). That is they nowadays don't quite have that notion, and see them as separate letters for the point of sorting, they're just seen as other cases of composed sounds (combinations of letters that make a sound different to what you'd expect, such as when the u gets muted, etc.)

[-]

Own_Solution7820@reddit

If a word contains 'dz', is it guaranteed to be the digraph or are there cases where it's d and z coincidentally being adjacent, maybe something like: podzol.

[-]

medforddad@reddit

Did anyone else not understand this part:

But wait, we have a Unicode code point for the dz digraph, but we don’t have one for the cs digraph or the dzs trigraph. What’s so special about dz?

These digraphs owe their existence in Unicode not to Hungarian but to Serbo-Croatian. Serbo-Croatian is written in both Latin script (Croatian) and Cyrillic script (Serbian), and these digraphs permit one-to-one transliteration between them.

How does that explain why there's a single unicode code-point for 'dz', but not for 'cs', or 'dzs'?

It sounds like, if not for Serbo-Croation, then there would be no 'dz' code-point even though it exists as a single letter in Hungarian. But I don't understand why existing in Serbian or Croatian means it does get a single code-point while existing in Hungarian doesn't.

[-]

Hacnar@reddit

Because of the Cyrilic. Dz in Latin script corresponds to S in Cyrilic. Then there is also dž that corresponds to џ in Cyrilic. They want a mapping that doesn't change the amount of characters between Latin and Cyrilic.

[-]

toiletear@reddit

In Croatian, dž, lj and nj are a single letter, their grammar says so (it's a Slavic thing, my language used to have it too but at some point we decided that's just silly and we separated them like any other letter pair). I'm guessing that cs and dzs are usually written differently when appearing together, but do not otherwise represent a single letter, grammatically speaking.

[-]

medforddad@reddit

But in the article it says that they're a single letter in Hungarian as well:

These diagraphs are characters in the alphabets of some languages, most notably Hungarian. In those languages, the diagraph is considered a separate letter of the alphabet.

So why being a single letter in Hungarian isn't enough to merit a single code-point but being a single letter in Croatian is enough is what's confusing me.

[-]

MatmaRex@reddit

Because pre-Unicode, some Croatians or perhaps Serbians invented and used text encodings where their digraphs were represented by single code points, and so Unicode included the same characters to allow lossless conversion to and from those encodings. Hungarians haven't come up with such a thing, so encoding their digraphs and trigraphs in Unicode was not required for compatiblity.

[-]

jhartikainen@reddit

I wonder how does this actually work in practice. Is it impossible in hungarian to have a word with d followed by a z, because they have a letter dz? Seems like it would be very confusing.

Also browsing on wikipedia, their page on this suggests "Dz" is the uppercase version of "dz", not "DZ". Which is right? Any hungarians with insight on how this works? :D

[-]

balazsbotond@reddit

Hungarian here.

Is it impossible in hungarian to have a word with d followed by a z, because they have a letter dz?

It is absolutely possible. It doesn't cause any confusion because:

it is very rare
if you are a native speaker you know which is whic
it is pronounced exactly the same

For example, vadzab (wild oat, vad = wild, zab = oat). It is a compound word, with the word boundary between d and z, which, when pronounced, becomes the "dz" phoneme, just like in other words where it naturally occurs.

And yes, Dz is the uppercase version, see this classroom poster:

https://sucika67.hu/termek/abeces-tablo/

And, while we treat dz as a single unit, and it has its own place in our alphabet, it is officially written as two separate letters / graphemes. It is not a single grapheme, and in my opinion, including it as such in the Unicode standard was a huge mistake.

[-]

Ok-Scheme-913@reddit

Wtf man, they are not pronounced the same. Dz is a different letter. The others in the group make it more obvious, like s followed by a z sounds way different than our sz letter.

But yeah, context gives it away. And this Unicode character is absolutely not used by any Hungarian, our digitalization was always heavily based on ascii, e.g. a no longer so popular (thank God) encoding called latin-2 included the few Hungarian-specific extra chars, but we only needed a couple as cs, dz, dzs, sz, ty, ly, zs all can be written as combination of letters. Only ő and ű are the ones that are not commonly found in other languages (we also have ö and ü, á, é, ó, but I guess those may happen in other languages sometimes)

[-]

balazsbotond@reddit

You know, now that I think about it, I do pronounce the dz in vadzab differently than the one in, for example, edzés, I just never realized it.

[-]

jhartikainen@reddit

Interesting, thanks for the explanation.

It is a bit funny to me that you worry about this being confusing in Hungarian

Nah not really, I was just curious on how it works :) My native language is actually finnish which is similar to hungarian in the phonetic aspect at least, and from what I understand similar grammar, although your words and spelling look completely alien to me lol

[-]

-Y0-@reddit

Not Hungarian, but digraphs like lj, nj are treated as two letters that denote a single sound. They are a hack to get the sound of Љand Њ(from Cyrillic alphabet) using mostly Latin Alphabet.

Is it impossible in hungarian to have a word with d followed by a z, because they have a letter dz?

For Croatian, if you had a word beginning with L and J, due to the way how phonetics works they would merge into a single sound (Lj or Љ)

Also browsing on wikipedia, their page on this suggests "Dz" is the uppercase version of "dz", not "DZ".

90% of the time you don't write uppercase but just want to capitalize the first letter of the word. They are thus showing Titlecase versus the much rarer UPPERCASE.

[-]

TheMeteorShower@reddit

That sounds exactly how we have th to make it own sound. The only difference is th isnt in our alphabet.

Its two letter and has its own sound.

[-]

raleksandar@reddit

Just to add that (at least in Serbian, but I guess the same is true for Croatian as well) there are words where `nj` is one sound (`њ` in Cyrillic) but there are also words where it is two (`нј` in Cyrillic), for example:

- `njiva` (lat) / `њива` (cyr) - meaning "field"

- `injekcija` (lat) / `инјекција` (cyr) - meaning "injection"

The latter are not that common, and I believe all are examples of borrowed words, but they exist. And they complicate transliteration between Cyrillic and Latin alphabets (Serbian uses both, which is another can of worms).

[-]

jhartikainen@reddit

Thanks for the insights, I figured maybe there was some reason like the cyrillic conversion there.

[-]

wintrmt3@reddit

In practice no one gives a fuck about Unicode digraphs and everyone expects "mad" to match "madzag".

[-]

midir@reddit

These diagraphs are characters in the alphabets of some languages, most notably Hungarian.

Well tell the Hungarians to stop.

[-]

VRRifter@reddit

tldr: title case. Saved you a click

[-]

Thelmholtz@reddit

Which applies to characters that represent digraphs (such as a Serbo/Croatian one for "dz") that have three cases:

Upper case (DZ)
Lower case (dz)
Title case (Dz)

Note: I'm using normal ASCII to represent the digraph here, as I'm on mobile. They are just a single character if using the right Unicode symbol.

[-]

backelie@reddit

What about smallcaps?

[-]

Thelmholtz@reddit

Apparently they are not meant as case, are supposed to be semantically unimportant for Unicode, and it's recommended use is IPA. https://en.m.wikipedia.org/wiki/Small_caps#Unicode

[-]

MaleficentFig7578@reddit

Why don't Serbs/Croats just use d and then z

[-]

Thelmholtz@reddit

Do I look like a linguist specialized in the latin ortography of Slavic languages?

[-]

backelie@reddit

A little?

[-]

MrKapla@reddit

Ǳ, ǲ and ǳ.

[-]

1639728813@reddit

What's title case? Do I now need to click?

[-]

chucker23n@reddit

It’s a special approach to casing that e.g. journalists use in headlines. https://titlecase.com

[-]

diegoasecas@reddit

This Is Title Case

[-]

PoolNoodleSamurai@reddit

It’s a Raymond Chen article. It’s worth a click.

[-]

CryZe92@reddit

Which apparently has nothing to do with the title case that people are familiar with.

[-]

shevy-java@reddit

I thought this was about Microsoft upcasing or downcasing files and not caring for the difference ...

[-]

Programmdude@reddit

In most european languages, case matters heaps since the majority of characters written have lower/upper cases. So being able to correctly change the case of the most common characters is crucial.

[-]

Full-Spectral@reddit

I always find Unicode to be a weird beast. I was around when it came out, and I was writing the first version of the Xerces C++ XML parser, which of course had to deal with various language encodings. So we were all excited that Unicode was going to simplify all of that.

But, in the end, what was Unicode really supposed to do? Was it really supposed to simplify the ability of software applications to deal with multiple languages, or was it really an academic exercise to allow every single complexity of the random mutation of human languages to be encoded?

It clearly became the latter, whatever might have been its original intent. If the purpose was to make it practical for software applications to handle multiple languages, it would have forced various simplifications of the representation of human languages, which would have served everyone better, IMO.

[-]

Lonsdale1086@reddit

I mean, it's designed to allow transmission and storage of text in any language that exists.

Anything else is a side effect.

[-]

Full-Spectral@reddit

Text that is never actually displayed probably doesn't need to get transmitted or stored though. The purpose of text is for people to create it, edit it, and read it, for the most part, which means it has to be practical for any application that does those things.

[-]

Lonsdale1086@reddit

Tell me you've never tried to store a copy of an ancient Babylonian scroll in a Word Document without telling me you've never tried to store a copy of an ancient Babylonian scroll in a Word Document.

I'm being a dick about it, but it's an admirable goal to create a text encoding that can represent any character of any language ever written, thus allowing lossless transmission and storage of it.

It may only be practically useful for archivists, but it's also better than the solutions we had before in software development, even if we now get weird edgecases like this.

[-]

Full-Spectral@reddit

I imagine that very few of us have done that, and that's the point. It's the tail wagging the dog.

[-]

Lonsdale1086@reddit

We've coopted a standard then complained it's not objectively worse in order to meet our needs.

[-]

Full-Spectral@reddit

But wait... Unicode was not created to deal with Babylonian texts that I remember. As I said, I was around at the time, and I remember it being pitched as a way to SIMPLIFY supporting multiple languages in software, partially of course by getting rid of the need to deal with multiple encodings and provide a single one. And it did do that. But it created vastly more complexity in the end since, as I said above, the point of text is to manipulate it, and that has become crazily complex now.

If the purpose was to make it practical for software to create, edit, and display text, then it could have pushed for simplifications and limitations. I think that the world of Babylonian scrolls, nothing personal, could have suffered in the process because the area under the curve improvements for those 20 million folks who don't deal with Babylonian scrolls for every one that does, could have software that is simpler, safer, faster, more robust, etc...

[-]

Tarquin_McBeard@reddit

but if the point really was to simplify text handling for software,

You keep repeating this, even after multiple people have already corrected you: it wasn't.

The purpose of Unicode is and always was to enable the accurate storage and transmission of any text in any language. By your own admission, it has succeeded completely in that aim.

[-]

vytah@reddit

I want my customers' complaints about low quality copper properly digitized.

[-]

backelie@reddit

Text that is never actually displayed probably doesn't need to get transmitted or stored though.

Control characters not needing to be stored and transmitted seems very unlikely to me.

[-]

Hussell@reddit

Technically, it's designed to allow text encoded in any other character encoding to be converted to and from Unicode without complications. That's why Unicode has all the weird characters from every other character encoding ever created, and why it has multiple ways to represent the same characters. If it didn't, then you couldn't convert from a character encoding that represented 'dz' as a single letter into Unicode and back without either special rules to decide when 'd' 'z' should be converted into 'dz' or not or just accepting that all the 'dz' characters would come back as 'd' 'z' after the round-trip.

The Unicode standard lists a bunch of other goals it would like to adhere to, but the requirement for round-trip encoding to never alter the input always comes first at the expense of the other goals.

[-]

roelschroeven@reddit

If the purpose was to make it practical for software applications to handle multiple languages

The purpose was to make it possible for software applications to handle multiple languages. And emojis, as it turned out.

[-]

larsga@reddit

It's almost criminal to write about this without talking about Unicode normalization. A text that encodes "madzag" as "m-a-dz-a-g" is in Normalization Form C.

However, the Unicode character database has decomposition mappings for many of the characters, including the dz digraph, so if you normalize the string to Normalization Form D it becomes "madzag", one character per letter, exactly as expected.

That was the good part. The bad part is that in NFD something like "olé" becomes "o-l-e-combining ´". So it's not quite as straightforward as it might seem, but knowing about this stuff is very important for anyone who wants to do text processing.

[-]

vytah@reddit

C vs D does not affect the decomposition of ǳ, it gets decomposed when you enable compatibility normalization (KC or KD). It replaces letter-like things with normal letters, and you then choose KC vs KD depending on whether you want diacritics composed or not.

[-]