Evolutionary Tree of languages

This language evolutionary tree is generated fully automatically on the basis of 18 basic vocabulary items and a simplified sound correspondence point system.

For interactive language trees with diachronic interpolation click here:     Indo-European language tree      Afro-Asiatic language tree

For computer generated tree (raw data) with all languages read further:
The farther left the branches get linked together, the more remote the languages are. The distances represented in the language evolutionary tree have to be counted in both directions, so in the scale, branches at a 35 level reflect a genetic proximity of 70. The topology of this language evolutionary tree shows the language families and sub-families which are known from other, more complex methods - which is a strong sign that the methodology is right.

Language evolutionary tree

Notes from the tree (observations about languages the system doesn't place at the right place):
(1) Since the system is designed to react to signals that link languages over thousands of years, the sub-classifications within the sub-families are sometimes not exact - especially if these languages are very close to each other. Examples on the tree are Slovene, Russian and Tahitian. These language get classified in their right sub-family, but at the wrong place within this sub-family.
(2) Sanskrit can in some way be regarded as near proto-language for the Indo-Aryan family. It gets classified within the Iranian family in the same area as Avestan, itself one of the oldest Iranian languages. This is because it is closer to its relative of thousands of years ago than to its descendents. (Avestan to Sanskrit (Vedic) - distance = 37,6 and Hindi to Sanskrit (Vedic) - distance = 54,4.) It is a limit of a tree representation.
(3) The Uralic branch is here clearly linked to the Altaic one, which used to be a widely accepted hypothesis. Today, many scholars reject this classification, others argue a broader common origin where Uralic, Altaic, Indo-European and other families share a common origin (see source). In eLinguistics, the relatedness signals between the Uralic and the Altaic languages are quite strong: Pairwise comparisons like Finnish to Kazakh and Hungarian to Mongolian show very strong relatedness signals with a very low p-value (low probability that the results are due to chance).
The eLinguistics results alone will not restore the Uralo-Altaic hypothesis but bring interesting facts for discussions about this issue!
The Tungusic family is isolated (the branch linking it to the Altaic family is too far left in the "statistic noise area") - altough quite strong relationships appear in queries for the genetic proximity between Oroqen and Kalmyk and Mongolian.
(4) Korean gets slight signals of relatedness to the Dravidian family. This hypothesis has been formulated by various scholars at different times. More information see Dravido-Korean language article in Wikipedia
(5) Basque and Summerian and to a lesser extent Basque and Chechen receive distant signals in this system. A connection between Basque and the North Caucasian languages (e.g.: Basque / North Caucasian lexical matches by G. Starostin), as well as with Sumerian (Dené-Caucasian Hypothesis) is documented by some scholars. The results in this study are a possible clue but not a proof (p-value=0,03).
(6) The Cushitic sub-family should be linked more clearly to the Semitic, Egyptian (ancient Egyptian and Coptic), Berber and Chadic ones (Afroasiatic macro family). Although pairwise results between these languages do show a very probable genetic relationship, chance results link Cushitic at a distant degree to other families so that the tree program receives contradictory signals and almost isolates this branch.
Other remarks:
- Latin - as the proto-language of the Romance sub-family, should be linked more centrally in the Romance sub-tree.
- Some sub-classifications within the sub-families are sometimes not exact whenever languages are very close to each other. Examples on the tree are Slovene and Russian.
- Armenian gets classified at the very edge of the Indo-European family in this system. Armenian stands as one of the most isolated IE language, which is reflected also here.
- Khowar is also a language which doesn't find its right place in this system. Its classification should be the Dardic branch of Indo-Aryan.

7 languages of the study don't get classified within a macro-family: Ainu, Elamite, Burushaski, Japanese, Georgian, Fula and Inuit. These languages are not represented in the tree. Creoles (French Creoles and Sranan) and languages with too few available words (Etruscan, Mycenaean, Oscan and Umbrian) are also excluded from the tree as they bias the results, with pairwise comparisons based on 1 to 5 words. For a better visibility, languages like Provencal, Romansch, Ladin, Letzebuergesh,... are note classified in the tree.

The farther left the branches are being linked at, the less reliable the classification is. A value of 40 on the tree corresponds to a genetic distance of 80 is the comparison query. In most cases, pairwise comparisons with a genetic distance of 80 have a p-value of 0,1 or more - this is a level where chance dominates the results.

All classifications are done automatically, without manual intervention. The position of a few languages inside the sub-families on the language evolutionary tree can be subject to discussion, but all languages get classified in the right sub-family.

The clustering technique used to generate the tree from the distance matrix is UPGMA (Unweighted Pairwise Group Method with Arithmetic-mean). The neighbor-joining tree technique brings very similar results.

The program used to generate evolutionary trees from the distance matrix is MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for bigger datasets (submitted). Kumar S, Stecher G, and Tamura K (2015) - www.megasoftware.net