Methodology for world report on IDNs

How we collect and analyse data for the IDN World Report

Collecting the data

Each year the research team relies on a variety of sources of data for IDNs: ccTLD data are collected direct from the ccTLD registries, both through the regional ccTLD organisations and via direct contact with the individual registries; gTLD data are collected via the ICANN open zone file access programme; .eu IDN data is provided directly with the research team by EURid. Results are cross checked against publicly available information, for example data published by individual registries (such as the site), statistics sites provided by industry players (eg and the regular Domain Name Industry Brief published by Verisign.

ccTLD data

The majority of IDNs are registered under country code Top Level Domains, both at the top and second level. Most ccTLDs do not publish their zone files, and most do not publish data for their IDNs. Working in close partnership with the regional IDN organisations – CENTR, LACTLD and APTLD – the research team circulates a questionnaire to the ccTLD registries who are members of those regional organisations. The questionnaire includes data on the total number of overall registrations, and of IDNs at two data points during each year (June and December). It also contains qualitative questions for our ‘industry opinions‘ survey each year. The research team follows up with individual registries on any queries or to fill gaps in the data.

We have a high level of confidence in the data provided by industry colleagues in the annual questionnaire. However, there are sometimes inconsistencies, or possible inaccuracies. While the research team makes every effort to follow up with individual registries to resolve suspected inaccuracies or queries, we do not guarantee that the data are 100% accurate.

EURid and the research team are grateful to our friends and colleagues at the regional ccTLD organisations and the individual ccTLD registries who patiently circulate and fill in our questionnaires each year.

gTLD data

Each year, the research team scans the open gTLD zones, using ICANN’s CZDS service, to garner information on the number of IDNs registered in each period, as well as information such as the presence of active name servers. Active IDN sites are then visited and analysed using algorithms developed by the research team to identify the usage (including the language of web content) of each site.

While the ICANN CZDS service has the potential to provide researchers with invaluable access to data, our research team experiences similar challenges to those reported by other researchers who use the CZDS (see transcript at page 91 ff) – access to each individual zone file has to be approved by the registry provider, and automatically expires following a given period. While the majority of providers do grant access in a timely way, some do not, and the timing of expiry dates means that a small number of zones may be missed in our analysis of the zone files at any given point.

Identifying low quality content – parking pages

The research team followed the same methodology to identify low quality content as set out in our study for CENTR in 2019:

Check for parking hints: number of internal links (identifying single page sites); content siblings (ie identical content); redirection siblings where many domains are redirecting to the same domain name; number of content words is greater than 50; language of web content is Latin (‘lorem ipsum’… placeholder text).

Analysing the language of web content

Inferring language results for ccTLDs

Each year, the IDN World Report research team reports on the language of web content associated with IDNs. We can report results for gTLDs and .eu, as we have access to the individual zone files for research purposes. This is not possible for most ccTLDs, where the majority of IDNs are registered. Where there is published ccTLD data (such as the CENTR study referred to earlier), we have relied on those published data. Otherwise, we infer the numbers for ccTLDs from our gTLD analysis.

In our automated analysis of the language of web content, the research team has observed that false positives and errors arise when there is too little text. We first eliminate domains with no active services, and then those domains identified as having low quality content (parking pages). We have adapted our methodology in several ways to minimise errors. We sample individual keywords and rank them according to the frequency with which they appear on a given page. Individual keywords are then run through third party automated language translation tools, and this results in the most frequently occurring language being assigned to a group of keywords.

We have found that poor rendering of Unicode can lead to inaccuracies with the automated translation tools, often with European languages such as Corsican, Portuguese or Danish substituted for Chinese or Japanese language. Results are spot-checked for accuracy, and anomalies are investigated further by rerunning the analysis with a fix for poor rendering of Unicode. This reduced the number of anomalies, but did not eliminate them.

Accurate analysis of Japanese language is challenging as the language uses a mixture of Han, Katakana and Hiragana scripts. Some words appear entirely in Han script (also associated with Chinese language), and this results in inaccurate attribution of Chinese language to Japanese language websites. In these cases, we group together the keywords for each site, and run them as a group against the automated translation tools. This all but eliminated the problem of Japanese language sites being identified as Chinese language. Finally, we undertake manual spot checks and eliminate obviously incorrect results.

When inferring usage rates for ccTLD IDNs, we applied the following rules: The script of IDN seems to affect usage rates – with Han and Arabic showing lower levels of active web content than the combined Han, Katakana and Hiragana (associated with Japanese language) and Latin. However, ccTLD IDNs tend to have a higher rate of quality content than their gTLD counterparts.

• Use published data where available.

• Assume an active website rate of 40% for IDNs (whether top or second level)

• Discount by 20% for Han script – Discount by 25% for right to left scripts (Arabic, Hebrew)