Language of web content

Every year since 2013, we have reviewed the language of web content associated with IDNs, to see whether there was any correlation with the script of domain name. It’s plausible that we might have observed a random pattern in the evidence, ie no strong correlation between domain name script/language and the language of content. But with a strong correlation, we might expect that a Cyrillic script domain would lead to web content in Russian, Bulgarian or Ukrainian, or that an Arabic script domain would lead to web content in Arabic or Persian, Han script to Chinese, and so on.

As in previous years, our analysis of the web content of IDNs (gTLDs plus .eu) found that the relationship between language of web content and IDN script is not random. There is near-perfect correlation between language of web content and the script of IDN associated with it. In other words, IDNs are in practice accurate predictors of the language in which their web content appears. Only English, which is commonly spoken around the world, and (to a lesser extent) Portuguese, are associated with a large number of scripts (Latin, Arabic, Cyrillic, Han, Katakana, Hiragana, Hangul, Greek, and others), and display the more random pattern predicted in the “no connection” hypothesis.

The use of automated translation tools can lead to some false positives, particularly with Norwegian and Greek language.  We found that Norwegian was wrongly identified as the language of web content instead of Chinese, Japanese and Korean.  We also found that Greek was wrongly identified instead of Korean language.  Spot-checking and tightening of our algorithm eliminated most of these false-positives.  The results for Greek language, despite showing more than 80% associated with Greek script domain names, may overstate the instances of Greek language with other scripts.

The analysis works from the language of web content up to the script, and it does not necessarily follow that the reverse is true, ie that IDN script will accurately predict the language of associated web content. However, the strength of the correlation between language of web content and script of IDN can help us infer the language of web content of IDNs in ccTLDs for which we do not have access to the zone files.

For ccTLDs, strongly anchored to individual countries and territories, where IDN deployment (in almost every case) closely matches the requirements of those languages spoken in the ccTLD’s country or territory, we can predict that if we have estimated the percentage of active web content (see above), the language of those sites will reflect the same languages. One exception is the English language, which represents approximately 10% of the language of web content in the IDNs we have analysed.