Long range language comparison

As we can see in the language trees which are generated from the eLinguistics.net system, using the set of 18 previously carefully selected basic words and a subset of universal sound correspondence rules delivers very good results in short and medium range language classification. The results are in line with existing language classifications as known from older and more classical methods. But a real interest would be to use the system in the context of existing long range language classification hypotheses.
Long range comparisons are always very sensitive, as we cope with results with weaker phylogenetic signals on the one side and higher chance interference on the other side. The system identifies a series of possible long range comparisons with a reasonable probability level that these results are not due to chance. The most remarkable are:
- Linking the Uralic and Altaic families (Uralo-Altaic hypothesis): strong evidence is found in the various pairwise results and in the language trees. But a similar level of evidence links Uralic with Indo-European, suggesting that an acient common family containing Indo-European, Uralic and Altaic languages may have existed. This idea is found both in the Nostratic and in the Euroasiatic hypotheses.
- Linking the Cushitic and Chadic languages with the Semitic group (Afro-Asiatic hypothesis): this link is now by far the most widely accepted clasisfication. At pairwise comparison level, this link is not always visible in eLinguistics but comparisons like Bole to Arabic and Amharic to Oromo bring obvious results, as do the phylogenetic classification.
Limits of the system:
In long range language comparisons, we have to deal with results with lower steadiness in the statistical context: when we identify long range cognates with a probability as low as 1% or even 0,1% to be due to chance, it sounds encouraging, but we have to take into consideration that we cary out thousands of comparisons and have thus a very high probability to identify not related languages as cognates. The most extreme cases in the system are:
- Georgian to Oroquen - genetic proximity: 65 with a probabily of 0,5% to get this and lower results by chance.
- Korean to Tamil - genetic proximity: 73% with a probabily of 5,9% to get this and lower results by chance.
Both pairwise comparisons are misleading and identify only "chance cognates".
Hypotheses, for which other methods deliver strong relatedness evidence can not be identified with eLinguistics (or identifies with a too high level of chance interference). The most important one is:
- The Japanese - Korean relatedness which doesn't appear in this study. A look at the lexical comparison between Japanese und Korean shows a few matches, but all of them are clearly due to chance.
A specific limit is reached by very short range comparisons: the signal strength has a too high granularity for highly related languages: when languages within a group differenciate with only one or two slight sound changes, we have a clearly proven relatedness, but too few signals for a reliable quantitative differenciation - more signals would be needed.