Language of web content

The chart above shows the language of web content (along the x axis), and for each website the chart records the script of IDN.  The scripts of IDNs are presented in the different colours as set out in the chart’s key.

Every year since 2013, we have reviewed the language of web content associated with IDNs to see whether there was any correlation with the script of domain name. It is plausible that we might have observed a random pattern in the evidence, ie no strong correlation between domain name script/language and the language of content. But with a strong correlation, we might expect that a Cyrillic script domain would lead to web content in Russian, Bulgarian or Ukrainian, or that an Arabic script domain would lead to web content in Arabic or Persian, Han script to Chinese, and so on.

As in previous years, our analysis of the web content of IDNs (gTLDs plus .eu) found that the relationship between language of web content and IDN script is not random. There is a very high correlation between language of web content and the script of IDN associated with it. In other words, IDNs are accurate predictors of the language in which their web content appears. Only English, which is commonly spoken around the world, is associated with a large number of scripts (Latin, Arabic, Cyrillic, Han, Katakana, Hiragana, Hangul, Greek, and others), and displays the more random pattern predicted in the “no connection” hypothesis.

The use of automated translation tools, and poor rendering of Unicode content in some cases, can lead to some anomalies. The steps taken are set out in more detail in the methodology.

The analysis works from the language of web content up to the script, and it does not necessarily follow that the reverse is true, ie that IDN script will accurately predict the language of associated web content. However, the strength of the correlation between language of web content and script of IDN can help us infer the language of web content of IDNs in ccTLDs for which we do not have access to the zone files.

For ccTLDs, strongly anchored to individual countries and territories, where IDN deployment (in almost every case) closely matches the requirements of those languages spoken in the ccTLD’s country or territory, we can predict that if we have estimated the percentage of active web content (see above), the language of those sites will reflect the same languages. One exception is the English language, which represents approximately 10% of the language of web content in the IDNs we have analysed.