Evolutionary Tree of languages

This language evolutionary tree is generated fully automatically on the basis of 18 basic vocabulary items and a simplified sound correspondence point system. The farther left the branches get linked together, the more remote the languages are.
 
Important note!!! All families and subfamilies with names in dark blue (=top level families) and light blue (=subfamilies) are the ones recognized by most specialists. So in these cases, our automatic classification matches the mainstream view. The three macrofamilies with name in green and their branches (also green) are hypotheses for long-range relationships inferred by our system. So the names in green refer to existing (but not unanimously recognized) hypotheses: Eurasiatic[a], Austro-Tai[b], Northern Caucasian[c] and Mosan[d]. Another interesting result is the internal classification of the subfamilies within Indo-European: the order in which the subfamilies split is still disputed today. Our Indo-European subgrouping results match most existing hypotheses.

Language evolutionary tree

The topology of the tree shows the language families and subfamilies which are known from other, more complex methods - which is a strong validation of our methodology. But most important is perhaps that the system also infers long-range relationships between established macrofamilies which are not yet recognized by all specialists but for which hypotheses already exist. These "super families" are known from hypotheses by numerous scholars: Eurasiatic[a] for the link between Indo-European, Uralic, Turkic, Mongolic and Tungusic, Austro-Tai[b] for the link between Austronesian and Tai-Kadai and North Caucasian[c] for the link between Northwest and Northeast Caucasian. In the Americas, our system spots a relationship between the Salishan and Wakashan macrofamilies, which is reflected in the disputed Mosan[d] hypothesis.

For interactive language trees with diachronic interpolation click here:     Indo-European language tree      Afro-Asiatic language tree


Notes from the tree:
 
(1) Since the system is designed to react to signals that link languages over thousands of years, the subclassifications within the subfamilies are sometimes not perfect - especially if these languages are very close to each other. In a few cases, languages get classified in their right subfamily, but at the wrong place within this subfamily. Tree models are not perfect and can not always represent the gradual evolution of expanding clusters of dialects. Alternative models like the wave model[f] may deliver better results in the future.
(2) We have excluded all ancient languages which have living descendents like Sanskrit, Avestan, Latin, Ancient Greek and languages from the Middle Ages as they bias the results in the tree: their represented evolution stops at a certain date but these old languages are compared with the others, as if they were contemporary languages. An effect of this is that Sanskrit, which can in some way be regarded as near proto-language for the Indo-Aryan family, could almost get classified within the Iranian family in the same area as Avestan, itself one of the oldest Iranian languages. This is because it is closer to its relative of thousands of years ago than to its descendents. (Avestan to Sanskrit (Vedic) - distance = 37,4 and Hindi to Sanskrit (Vedic) - distance = 46,3.) It is a limit of a tree representation.
(3) The Uralic branch has strong links both to the Turkic/Mongolic and to the Indo-European ones, but with a low statistical confidence. The "Uralo-Altaic" hypothesis used to be a widely accepted hypothesis in the past. Today, many scholars reject this classification, others argue a broader common origin where Uralic, Altaic, Indo-European and other families share a common origin (see source). In eLinguistics, the relatedness signals between the Uralic and the Altaic languages are as strong as they are between Uralic and Indo-European. Pairwise comparisons like Finnish to Kazakh and Hungarian to Mongolian show very strong relatedness signals with a very low p-value (low probability that the results are due to chance). The eLinguistics results alone can not confirm the Eurasiatic hypothesis[a] but bring interesting facts for discussions about this issue!
(4) The Altaic family - if considered as linking only the Turkic, Mongolic and Tungusic subfamilies - is reflected here with a very strong statistical significance.
(5) The Afroasiatic classification reflected in eLinguistics.net matches the widely accepted views. All branches (with the exception of Omotic) are very stable. The Cushitic and Omotic subfamilies are linked to the Semitic, Egyptian (ancient Egyptian and Coptic), Berber and Chadic ones (Afroasiatic macro family). However, this connection is not completely stable. Two Omotic languages (Wolaytta and Gamo) do not connect to any macrofamily, although they clearly have the Omotic-Dizoid languages (Dizi, Nayi and Sheko) as their next neighbours. There is no consensus among linguists regarding the broader status of the Omotic family and Glottolog[f] does not classify it as a single group but as 4 separate ones. Our system identifies only the Dizoid branch of Omotic as having a clear genetic relationship with the Cushitic languages.
(6) In the Austronesian families, 4 languages do not get classified by our system: they are all Southern Oceanic languages from New Caledonia: Ajie, Drehu, Nengone and Paicî. 4 other Austronesian languages get classified at the right place in the tree, but their position is unstable and changes when other languages are removed or added and if the UPGMA phylogenetic algorithm is used instead of the Neighbour Joining one. These languages are Nauruan, Palauan, Chamorro (Guam) and Tetum. Analysing their nearest neighbours in the values confirms the position in the present tree. Otherwise, the Austronesian classification inferred by our system perfectly matches current mainstream views. Catawba, an Eastern Siouan language, also shows up as isolated, so misclassified.
(7) In the Niger-Congo macrofamily, 2 languages are identified as members but do not classify in the right subfamily in our system: Serer and Temne. We do not represent them in the tree.
 
Other remarks:

10 languages included in our system do not get classified within a macrofamily: Ainu, Basque, Burushaski, Elamite, Georgian, Japanese, Kanuri, Korean, Nivkh and Sumerian. These languages are isolated and are not represented in the tree.

We include 77 languages of the Americas which classify in seven major American macrofamilies in our trees: Siouan, Algic, Na-Dene, Uto-Aztecan, Mayan, Salishan and Wakashan. 10 isolated North and South American languages are not displayed on the tree list for the sake of clarity: Aymara, Cherokee, Guarani, Haida, Huave, Kiowa, Mixe, Mohawk, Quechua and Zoque. Within this group, Mixe and Zoque form a related pair, as do Quechua and Aymara (the relatedness of these last two languages is disputed). We do not represent the Chukotko-Kamchatkan language tree (Koryak, Chuchki and Itelmen): whereas our system does link these three languages correctly, it does not relate to any other macrofamily.

Languages with too few available words (Etruscan, Hurrian, Mycenaean, Phrygian, Safaitic and Urartian) are also excluded from the tree as they bias the results, with pairwise comparisons based on 1 to 5 words: fewer words in the pairwise comparisons means bigger exposure to chance and a bigger amount of false positive results throughout the tree. Ideally, we should keep results with the same level of chance exposure to build the trees. For Oscan, only a dozen of words is available but it ranks well in the tree without negative impact on the tree stability. Lycian and Umbrian have only eight words available, but they have clear excludsive next neighbours in the list (Lycian->Luwian (Anatolian) and Umbrian->Oscan (Sabellic)), so we place them manually in the tree.

Creole languages are not placed in the tree and are available in the pairwise comparisons only.

The clustering technique used to generate the tree from the distance matrix is Neighbor-Joining. Another tree technique, UPGMA (Unweighted Pairwise Group Method with Arithmetic-mean), brings very similar results.

The technology used to generate evolutionary trees from the distance matrix is R and the SigTree and ape R-packages.


  • [a] "Eurasiatic languages" in Wikipedia
  • [b] "Austro-Tai languages" in Wikipedia
  • [c] "North Caucasian languages" in Wikipedia
  • [d] "Mosan languages" in Wikipedia
  • [e] "Wave model" in Wikipedia
  • [f] Glottolog: comprehensive catalogue of the world's languages, language families and dialects.