Collecting the data
Each year the research team relies on a variety of sources of data for IDNs: ccTLD data are collected direct from the ccTLD registries, usually through the regional ccTLD organisations; gTLD data are collected via the ICANN open zone file access programme. Results are cross checked against publicly available information, for example data published by individual registries (such as the statdom.ru site), statistics sites provided by industry players (eg www.ntldstats.com) and the regular Domain Name Industry Brief published by Verisign.
The majority of IDNs are registered under country code Top Level Domains, both at the top and second level. Most ccTLDs do not publish their zone files, and most do not publish data for their IDNs. Working in close partnership with the regional IDN organisations – CENTR, LACTLD and APTLD – the research team circulates a questionnaire to the ccTLD registries who are members of those regional organisations. The questionnaire includes data on the total number of overall registrations, and of IDNs at two data points during each year (June and December). It also contains qualitative questions for our ‘industry opinions‘ survey each year. The research team follows up with individual registries on any queries or to fill gaps in the data.
We have a high level of confidence in the data provided by industry colleagues in the annual questionnaire. However, there are sometimes inconsistencies, or possible inaccuracies. While the research team makes every effort to follow up with individual registries to resolve suspected inaccuracies or queries, we do not guarantee that the data are 100% accurate.
EURid and the research team are grateful to our friends and colleagues at the regional ccTLD organisations and the individual ccTLD registries who patiently circulate and fill in our questionnaires each year.
Verisign is a key supporter of the IDN World Report, and provides the research team each year with its own inhouse-generated research into IDNs at the second level under .com and .net. Over the years, as the base of IDNs has extended into the gTLD space beyond .com and .net, the research team has increasingly relied on its own analysis of the public zone files provided through ICANN’s Central Zone Data Service (CZDS).
Each year, the research team scans the open gTLD zones to garner information on the number of IDNs registered in each period, as well as information such as the presence of active nameservers. Active IDN sites are then visited and analysed using algorithms developed by the research team to identify the usage (including the language of web content) of each site.
While the ICANN CZDS service has the potential to provide researchers with invaluable access to data, our research team experiences similar challenges to those reported by other researchers who use the CZDS (see transcript at page 91 ff) – access to each individual zone file has approved by the registry provider, and automatically expires following a given period. While the majority of providers do grant access in a timely way, some do not, and the timing of expiry dates means that a small number of zones may be missed in our analysis of the zone files at any given point.
Analysing the language of web content
Inferring language results for ccTLDs
Each year, the IDN World Report research team reports on the language of web content associated with IDNs. We can report results for gTLDs and .eu, as we have access to the individual zone files for research purposes. This is not possible for most ccTLDs, where the majority of IDNs are registered. Therefore we infer the numbers for ccTLDs from our gTLD analysis.
|Type of registry||Second or top level||Percentage with active website|
|‘Legacy’ gTLDs||Second level||39%|
|New gTLD||Second level||45%|
|New gTLD||Top level||36%|
|.рф (ccTLD)||Top level||68%**|
The rate of active web content associated with IDNs at the second level ranges from 5% (.vn) to more than 60% (.eu and .es), and at the top level from 36% (IDN new gTLDs) to 68% (РФ, which was the first top level IDN on the market in 2009 and has a relatively high level of maturity compared with new gTLDs at the top level). The high rate of active web content for the Russian domains results from a methodology used by the Statdom.ru team, and the IDN World Report research team is unable to compare the methodologies used with our own – so the results may not be consistent with our own approach.
In our automated analysis of the language of web content, the research team has observed that false positives and errors arise when there is too little text. We have therefore limited our language analysis to a smaller data sample. We have adapted our methodology in several ways to minimise errors. We sample individual keywords and rank them according to the frequency with which they appear on a given page. Individual keywords are then run through third party automated language translation tools, and this results in the most frequently occurring language being assigned to a group of keywords.
We have also found that poor rendering of Unicode can lead to inaccuracies with the automated translation tools, often with European languages such as Corsican, Portuguese or Danish substituted for Chinese or Japanese language. Results are spot-checked for accuracy, and anomalies are investigated further by rerunning the analysis with a fix for poor rendering of Unicode. This reduced the number of anomalies, but did not eliminate them. For example, we suspect that the Han script IDNs identified as having Ukrainian language content are examples of poor Unicode rendering.
Accurate analysis of Japanese language is challenging as the language uses a mixture of Han, Katakana and Hiragana scripts. Some words appear entirely in Han script (also associated with Chinese language), and this results in inaccurate attribution of Chinese language to Japanese language websites. In these cases, we group together the keywords for each site, and run them as a group against the automated translation tools. This all but eliminated the problem of Japanese language sites being identified as Chinese language.
Therefore, when inferring usage rates for ccTLD IDNs, we applied the following rules: The script of IDN seems to affect usage rates – with Han and Arabic showing lower levels of active web content than the combined Han, Katakana and Hiragana (associated with Japanese language) and Latin.
• Use actual data where available (.eu, .es*, .vn, . рф). This accounts for 1.95 million IDNs, or 31% of the IDNs in ccTLDs (both at second and top level).
• Assume an active website rate of 40% for IDNs (where top or second level)
– Discount by 20% for Han script
– Discount by 25% for right to left scripts (Arabic, Hebrew)
*data from 2015 IDN World Report
** data from the statdom.ru site. The methodology for assessing usage has changed this year, and it is difficult to make like-for-like comparisons. The figure shown is the total of the categories ‘website’, ‘web-app/one page site’, and ‘web redirect’.