Were in the business of natural language processing with lots of different languages. So far weve worked on (big breath): English, Portuguese (Brazilian and from Portugal), Spanish, Italian, French, Russian, German, Turkish, Arabic, Japanese, Greek, Mandarin Chinese, Persian, Polish, Dutch, Swedish, Serbian, Romanian, Korean, Hungarian, Bulgarian, Hindi, Croatian, Czech, Ukrainian, Finnish, Hebrew, Urdu, Catalan, Slovak, Indonesian, Malay, Vietnamese, Bengali, Thai, and a bit on Latvian, Estonian, Lithuanian, Kurdish, Yoruba, Amharic, Zulu, Hausa, Kazakh, Sindhi, Punjabi, Tagalog, Cebuano, Danish, and Navajo.
Natural language processing (NLP) is about finding patterns in languagefor example, taking heaps of unstructured text and automatically pulling out its structure.The open secret about NLP is that its very English-centric. English is far and away the language that linguists have worked on the most and its also the language that has the most available resources for computer science projects (and more data is almost always better in computer science). So one of the best ways to test an NLP system is to try languages other than English. The better that a system can deal with diverse data, the more confident that you can be in its ability to handle unseen data.
To this end, we might choose to define weirdness in terms of English. But thats a pretty irritating definition. Lets try to do something different.
The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things192 different language features in total.
So rather than take an English-centric view of the world, WALS allows us take a worldwide view. That is, we evaluate each language in terms of how unusual it is for each feature. For example, English word order is subject-verb-objectthere are 1,377 languages that are coded for word order in WALS and 35.5% of them have SVO word order. Meanwhile only 8.7% of languages start with a verblike Welsh, Hawaiian and Majangso cross-linguistically, starting with a verb is unusual. For what it’s worth, 41.0% of the world’s languages are actually SOV order. (Aside: Ive done some work with Hawaiian and Majang and thats how I learned that verbs are a big commitment for me. Im just not ready for verbs when I open my mouth.)
The data in WALS is fairly sparse, so we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of thesedropping us down to 1,693 languages).
Now, one problem is that if you just stop there you have a huge amount of collinearity. Part of this is just the nature of the features listed in WALStheres one for overall subject/object/verb order and then separate ones for object/verb and subject/verb. Ideally, wed like to judge weirdness based on unrelated features. We can focus in on features that arent strongly correlated with each other (between two correlated features, we pick the one that has more languages coded for it). We end up with 21 features in total.
For each value that a language has, we calculate the relative frequency of that value for all the other languages that are coded for it. So if we had included subject-object-verb order then English would’ve gotten a value of 0.355 (we actually normalized these values according to the overal entropy for each feature, so it wasn’t exactly 0.355, but you get the idea). The Weirdness Index is then an average across the 21 unique structural features. But because different features have different numbers of values and we want to reduce skewing, we actually take the harmonic mean (and because we want bigger numbers = more weird, we actually subtract the mean from one). In this blog post, I’ll only report languages that have a value filled in for at least two-thirds of features (239 languages).
The language that is most different from the majority of all other languages in the world is a verb-initial tonal languages spoken by 6,000 people in Oaxaca, Mexico, known as Chalcatongo Mixtec (aka San Miguel el Grande Mixtec). Number two is spoken in Siberia by 22,000 people:Nenets (that’s where we get the wordparkafrom).Number three isChoctaw, spoken by about 10,000 people, mostly in Oklahoma.
But here’s the rubsome of the weirdest languages in the world are ones you’ve heard of: German, Dutch, Norwegian, Czech, Spanish, and Mandarin.And actually English is #33 in the Language Weirdness Index.
The 25 weirdest languages of the world. In North America: Chalcatongo Mixtec, Choctaw, Mesa Grande Diegueo, Kutenai, andZoque; in South America: Paumar andTrumai; in Australia/Oceania: Pitjantjatjara and Lavukaleve; in Africa: Harar Oromo, Iraqw, Kongo, Mumuye,Ju|hoan, andKhoekhoe; in Asia: Nenets, Eastern Armenian, Abkhaz, Ladakhi, and Mandarin;and in Europe: German, Dutch, Norwegian, Czech,and Spanish.
By the way, how awesome of a name is “Pitjantjatjara“? (Also: can you guess which one of the internal syllables is silent?)
This is odd. Is this odd? One of the features that distinguishes languages is how they ask yes/no questions.The vast majority of languages have a special question particle that they tack on somewhere (like theka at the end of a Japanese question). Of 954 languages coded for this in WALS, 584 of them have question particles. The word order switching that we do in English only happens in 1.4% of the languages.Thats 13 languages total and most of them come from Europe:German,Czech,Dutch,Swedish,Norwegian,Frisian,English,Danish, andSpanish.
But there is an even more unusual way to deal with yes/no questions and that’s whatChalcatongo Mixtecdoes: which is to do nothing at all. It is the only language surveyed that does not have a particle, a change of word order, a change of intonation…There is absolutely no difference between an interrogative yes/no question and a simple statement. I have spent part of the day imagining a game show in this language.
Another thing languages have to deal with is what to do with simple subjects like I, they, or it. These are called pronominal subjects (something like The minister prevaricated has a nominal subject). The most common way to do this is to just tack the information about the subject on to the verb437 out of 711 languages do this, like Spanish, Italian, and Portuguese. ButDutch, German, and Norwegianlike Englishprefer having special subject pronouns that are normally/obligatorily present. But this is only done by 82 of the 711 languages coded in WALS. Kutenai(100 speakers in British Columbia, Canada) and Mumuye(400,000 speakers in Nigeria) do something even more unusual: they have something like subject pronouns but these go in different positions in the syntax than where full noun phrases go. And even more unusual than this isChalcatongo Mixtecagain: they combine several strategies so they have both subject markers that they add to verbs and they have pronoun words, too. But these pronoun words appear in a different spot from where a full noun phrase would show up.
Now if I asked you to consider these languages, how weird would you say they were?Lithuanian, Indonesian, Turkish, Basque, andCantonese. Surprise! They are really low on the Weirdness Index. They don’t seem typical to linguists and language learners but for these 21 features they stick with the crowd. Notice that we get isolates (like Basque) distributed throughout levels of Weirdness. Basque is “typical” but Kutenai,another isolate, is one of the weirdest of all languages.Even more surprising is that Mandarin Chinese is in the top 25 weirdest and Cantonese is in the bottom 10. This has to do with the fact that they have different sounds: Mandarin, unlike Cantonese has uvular continuants and has some limits on “velar nasals” (like English, Mandarin can have a sound like at the end ofsong but it can’t have that sound at the beginning of wordsworldwide it’s rare to have that particular restriction).
At the very very bottom of the Weirdness Index there are two languages you’ve heard of and three you may not have: Hungarian, normally renowned as a linguistic oddball comes out as totally typical on these dimensions. (I got to live in Budapest last summer and I swear that Hungarian does have weirdnesses, it just hides them other places.) Chamorro (a language of Guam spoken by 95,000 people), Ainu (just a handful of speakers left in Japan, it is nearly extinct), and Purpecha (55,000 speakers, mostly in Mexico) are all very normal. But the very most super-typical, non-deviant language of them all, with a Weirdness Index of only 0.087 isHindi, which has only a single weird feature.
Part of this is to say that some of the languages you take for granted as being normal (like English, Spanish, or German) consistently do things differently than most of the other languages in the world. It reminds me of one of the basic questions in psychology: to what extent can we generalize from research studies based on university students who are, as Joseph Henrich and his colleagues argue, Western Educated Industrialized Rich and Democratic. In other words: sometimes the input is WEIRD and you need to ask yourself how that changes things.
Even though the methods here dont define things in terms of English, they still smuggle in some cultural-specificity. That is, the linguists who developed and annotated the features were mostly speakers of European languages. What features might a person from Papua New Guinea or Ethiopia or the Amazon have come up with instead? And of course, WALS doesnt have any data at all on about 4,000 languages. And the languages that it has the most data for are not truly random.
Despite this, English still ranks as highly unusual (it comes in as #33 with an index value of 0.756). That English-speaking brain you’ve been using to read this? It’s wired weird.
– Tyler Schnoebelen (@TSchnoebelen)
Here are the values for the top and bottom 10 languages. You might also check out our posts on:
Diegueo (Mesa Grande)
Update: Here is the full list, with the 21 weirdness features and all of the languages that had values for at least one of them (don’t trust those values, of course).
Tyler finds the patterns in data that make it meaningful. He has ten years of experience in UX design/research in Silicon Valley and a PhD from Stanford. His work there included experimental psycholinguistics, fieldwork on endangered languages, and a dissertation on emotion (he got his BA at Yale studying playwriting and poetry). His insights on social media have been featured in The New York Times Magazine, The Boston Globe, The Atlantic, and NPR. He is incorrigible.
The languages that shape the world’s economies: an overview of which ones are in the best and worst position for NLP.Read More
English posts on entrepreneurs are rosy. In Spanish, there’s a lot of negativity. And in French…well, the loudest trend is an absence.Read More
The largest annual conference in computational linguistics is in Beijing this year. Over the next few days, the world’s top researchers will present their latest research. Here are the languages they are studying: Calculating the languages studied We took the languages mentioned in the ~300 abstracts, counting each one named. When multiple were named, we…Read More