Similarity index - distorted picture

Similarity index
Posted by: Gurao on Sep. 21, 2014, 06:06
The very idea that similarity index calculated between a pair of languages on the basis of the identity of segments(consonants or vowels) is a misnomer in glottochronology or lexicostatistics. Instead, what the procedures indicate is the very existence or nonexistence of the word i.e. whether the existing word compared is genetically shared or not. You cannot give fractions or marks on the basis of matched segments between the pair compared. It gives a distorted picture. For example between Tamil and Telugu the pair for the word 'two' is rentu in Tamil and remdu in Telugu are differentiated by nt:md which is an orthographic variation in representation besides involving a phonemic change. The n:m variation is purely belongs to the level of orthographic representation and t:d is a phonemic shift. If such a distortion can be avoided the real or factual relation might emerge. Similar are cases between other pairs of words.
 
Posted by: Vincent on Sep. 22, 2014, 10:02
Thank you very much for your reaction. I greatly appreciate your comments. I am aware the system (and especially the data) has a certain amount of inconsistency. Since this is fully computerized mass comparison with always the same 18 words, I am dependent on the choice of the right stem for the words in each language and on the right phonetic transcription.
 
However, although the method and data have some mistakes, I am confident the principle is right, because as you can see, the system with the current data automatically generates a family-tree of languages which itself (with some mistakes) quite exactly retraces otherwise widely accepted language classifications. It means that on the basis of 18 words, with the phonetic transcription I use to make the data processable by the computer, we get results which are globally right. The more I have access to information (e.g. Indo-European, Semitic, Uralo-Altaic), the better the classification is. But the system clearly has limits - all pair of relations with a genetic distance above appr. 72-75 are greatly influenced by statistic noise - so correspondences which are due to chance. So the system doesn't recognize remote genetic relationship like the link between the Uralo-Altaic families and Japanese or Korean...
 
If you have time and can give me more examples of inconsistencies/mistakes in my Tamil/Telugu and more generally Dravidian data, I would be really grateful if you can send me details, I would correct the data and reprocess the tree. In many cases I got similar feedback in the past (but until now, not for Dravidianlanguages) and it has helped me to improve the data.
 
Blog author: