Language evolution: a computerized model

Language evolution

Try out a comparison:

Or browse on world map:

Basic principles - "Linguistics & math, using methodologies like in biology...":
  -> Model of "language identity markers" to detect and quantify common origin.
  -> Right balance between signals' strenght and resistance to bias by chance.

The idea is to make a representation which is accessible and easily understandable by anyone, so that a broader public can query language comparisons and navigate language trees and so better understand the evolution of languages. uses following software modules:

  1. The module for single comparisons of pairs of languages (available here for interactive "Language comparisons")
  2. The module for mass-comparison of languages and their evolution - its basis is the software module in point 1 which is being called thausands of times in mass comparisons. The output of the mass comparison is a distance matrix, written in a format (.mts) which enables its analysis by standard software used in biology and genetics for the representation of evolutionary trees.
  3. The standard software for the construction of evolutionary tree itself.

The major steps in the methodology are:

  1. Choosing and encoding the language material for the comparisons. The challenge is to encode this material in a way it can be processed by a computer. The approach in this project is purely lexical and the language evolution material are words, for which consonants are encoded in a way they can be processed by the program. More details about this language material here.
  2. Determining a set of rules used to identify cognates. These rules are consonant-relationships as they are known from sound change. More details here. The scoring system applies step by step - first from vowel to vowel, then from word to word and then for the whole language comparison. At each step, 0 to 100 is the scoring value which is averaged from level to level - 100 being the highest possible number of points. This result is then reversed (100 minus "Result") to express a distance from 0 (same language) to 100 (completely different).
  3. Calculating the statistical context of all results. Along with cognate scoring, a statistical expected value and its standard deviation are calculated. Chance exposure is a big issue in language evolution analysis and from certain values (appr. 72-75 and above) the "cognates" detected by the system are more due to chance than to relatedness. For every comparison, the result is confronted with the statistical expectation it has to be equal or lower to what it is.

Since the input material as described in 2) and 3) is not dependant on the software itself, any other material and hypothesis can be processed by the system. Some of the blog visitors already had suggestions and own hypotheses - see in the discussion area for more details.

In this blog, you can query comparisons between over 160 languages. All calculations, representations and detail-sheets are generated interactively, directly from the row data. Processing the distance matrix is being done on a desktop application: The computer tries up the different analyses and takes over millions of comparison tasks within seconds, without mistake and without influence or lack of objectivity... Presently, the matrix is generated from 160 languages and represents values of over 12,800 comparisons of pairs of languages. To do so, the system takes over approximately 1.4 million consonant comparisons within 35 seconds.

The sample detail page gives you a clear idea how the value for the genetic distance between two languages is being calculated. The sample page shows the details for the calculation of the genetic distance between German and English reflection their evolution.

Here is the desktop application's interface screenshot:

The technology used is ".net", using c# as a programming language. The data aren't stored in a database, but in XML files. This ensures simple portability to any PC or web server. The data handling is written with LINQ - the advantage is that the data are encapsulated in objects (after mapping from the classical, relational representation stored in XML).

Further to lexicostatistics - basis words    Further to consonant relationship    Further to sample detail (result) page