IDNs are accurate predictors of the language in which their web content appears. There is a very high correlation between language of web content and the script of IDN associated with it. Only English, which is commonly spoken around the world, is associated with a large number of scripts (Latin, Arabic, Cyrillic, Han, Katakana, Hiragana, Hangul, Greek, and others), and displays the more random pattern predicted in the “no connection” hypothesis.
The analysis works from the language of web content up to the script, and it does not necessarily follow that the reverse is true, ie that IDN script will accurately predict the language of associated web content. However, the strength of the correlation between language of web content and script of IDN can help us infer the language of web content of IDNs in ccTLDs for which we do not have access to the zone files.
For ccTLDs, strongly anchored to individual countries and territories, where IDN deployment (in almost every case) closely matches the requirements of those languages spoken in the ccTLD’s country or territory, we can predict that if we have estimated the percentage of active web content, the language of those sites will reflect the same languages. One exception is the English language, which represents approximately 10% of the language of web content in the IDNs we have analysed.