Discussing the Lexicostatistical Comparison
of the Main Indo-European Language Groups
The methodology and preliminary outcomes
How lexemes were counted
It's not the matter of how the lexemes in the table were counted, it's rather the matter of maintaining a standardized, uniform, unbiased, objective method of computation throughout the dataset. Initially, it was planned to feed the table into a program to automate all calculations, but a Python program, prepared by a friend of mine, has never been finished, so as of 2008, I simply had to do all calculations manually. Think of it as a quick-and-dirty, preliminary analysis.
Similar considerations would be true for possible statistical errors. As for all statistical studies, occasional errors do not matter and cannot affect the final outcome (the law of large numbers). This is what makes statistical methods so robust. Even if one finds some false cognates or lose some of the right ones, that would not impact the result, as long as one consistently maintain the same method of counting throughout the study. It may not even bet relevant, whether one uses too a strict or too a lax method of counting, because all that he or she will get in either case is an overall negative or positive local offset value that will be effectively canceled out after comparing these values globally to each other.
As a result, it's in fact irrelevant which method of counting was used. However, as a matter of digression, I can say that I have not used the commonly accepted method of cognates solely (as, for instance, Dyen et al. have). This classical method was enhanced by considering the actual phonological similarity of two sets of possible cognates.
The commonly accepted method of counting cognates in Swadesh lists is based on the presumption that the genetic relation of two sets of lexemes, and finally, two languages can be shown by demonstrating that just two pairs are related to each other, and then counting up the number of such matches in a sufficiently long list. But what if use a short list? That would make this method susceptible to the following error in reasoning. Suppose, we compare English, German and Armenian:
Tthe relation of Armenian numbers is thought to have been demonstrated by the method of regular correspondences, therefore in the classical method the genetic relation of these three columns is exactly the same, and we should count all three pairs as cognates. However, it is quite obvious that English/German pairs stand much closer to each other, whereas Armenian looks as something completely different. That happened because we put the theory before the facts here, overestimating theoretical analysis, instead of looking at facts of life and phonology as they are. The facts demand to take notice of actual phonological similarity of lexical pairs, and compare words in a phoneme-by-phoneme, rather than word-by-word fashion. Consequently, while /tu:/ and /tsvai/ look related, /yerku/ does not, and should not be regarded as an exact match for the former two lexemes.
There's also a good survey of computational methods in historical linguistics by Agarwal & Adams (2007).
Consequently, a lexicostatistical distance matrix was obtained that will be used in the further discussion.
Moreover, the following comments should be taken into consideration:
(1) In case of doubt, calculations were done several times until stable results were obtained, or the results were statistically averaged over several calculations.
(2) If calculation outcomes were much too uncertain, they were temporarily aborted (as in the case with positioning Balto-Slavic and Armenian).
(3) The outcomes are not just the result of dead reckoning, they are supported by additional considerations, such as showing which particular innovations were produced within a particular genetic group, or even adding some material from outside the table. So they should rather be thought of as a preliminary indicator where to look, not the final diagnosis.
(4) Averaging over many language groups makes a result particularly stable, as in the case of 35-36%-figure for the relatedness of the most distant main IE groups to each other (excluding Hittite). On the other hand, the cases of contrasting just two individual groups (e.g., just Irish and Welsh) are particularly prone to possible statistical errors, as long as no additional data are taken into account.
(5) The exact relatedness value in many cases is simply irrelevant, because this is just a lexicostatistical research to help regroup the IE groups, not necessarily produce exact numbers for further glottochronological analysis. For instance, it might not matter herein, whether you get 60 or 75 for the relation of Baltic to Slavic, as long as you do not plan to do any glottochronology – you just become convinced these groups are quite close, and that is that.
(6) The table was designed to be analyzed by a computer program using a special algorithm described elsewhere, so these manual results should be seen as preliminary (=just re-stating the obvious).
Again, since these are just manual calculations, the error margin for standalone pairs, not contrasted to other groups, can be rather high (+/-7%). This figure follows from statistical fluctuations in the distance matrix, which were obtained when the same calculations were repeated under different subjective biases or when different languages of the same group, which supposedly have the same phylogenetic depth, were compared to a particular language (for instance, Hindi or Persian to all of the Iranian languages).
Hypothesis: Albanian related to Celtic
Note that the percentage of Irish to Welsh may be a little lower than actual, because of the greater than usual number of dialectical (?) synonyms within the Welsh dataset, which are herein calculated as 0.5-0.3 per lexeme, so we might expect the corrected figure for the Irish/Welsh relatedness to be a little higher (about 57% -?).
Accordingly, this predicts two waves of migration into the British Isles, with Proto-Goidelic being the first to enter, and the Brythonic subgroup being a result of relatively recent migration from the Continent. Proto-Brythonic and Proto-Goidelic must have separated a long time ago somewhere in northwestern or central continental Europe.
As to Italo-Celtic, the current study is not sufficiently detailed and elaborate neither to completely exclude, nor to corroborate the possibility of Italo-Celtic grouping; rather we see it as a possible, but unlikely, and in this case very short-lived state within the European Centum branch. There seem to be no specific Italo-Celtic shared innovations in the 46-list, except for the typical Celt. ni : Lat. nos (we), which is also attested in other Indo-European groups, and is not unique.
Hypothesis: Italic related to Hellenic
In all of the above instances we observe close phonological and semantical proximity that can be explained by assuming a genetic unity of Italic and Hellenic languages. This is easily explained from the geographical perspective by considering the fact that one of the few feasible passages to the Italian peninsula goes through the southern Balkans and northern Greece, therefore the only geographically realistic way for Proto-Italic to form was by its separation from Proto-Hellenic at some point in time.
An even more interesting finding may be a possible proximity of Proto-Germanic to Proto-Tocharian (Old. Eng : Toch B ~ 65%; German : Toch B ~ 59%). This observation deserves further investigation:
OE. wæter; Toch. A. wär; Toch. B. wer < *wat'er (?) (but Ir.uisce; W. dwr; Gr. hüdo:r; Lith. vanduõ)
Consider also the strong aspiration in t'- which lead to a transformation t' > ts (not necessarily palatalization as normally explained):
Toch. B mácer (mother); Toch. B pacer (father); Toch. B tkácer (daugther); Toch. B. kuce (who)
Tocharian k- finds explanation as a strongly aspirated t' > tk' > k' > k (Apparently, the digraph <tk> as preserved in Toch. A tkam (earth); Toch. A ckácer; Toch. B tkácer marks the result of this aspiration.)
The former process is possible if Proto-Tocharian stops where heavily aspirated, hence *ta > *tha > *hha > *kha > *ka before an open /a/ when the dentals were undergoing allophonic lention. The metathesis in *tankwo occurred precisely under the impact of aspiration, because both *th and kh* were pronounced in a rather similar way at some point, more or less like *hhanhhwo
The Tocharian aspiration reminds of the Grimm's law and the aspiration in the West Germanic languages.
Some of the Grimm's law seems to be already in progress in early Proto-Tocharian, since we have *k > *h > 0 in:
Got. dauhtar; Toch. B tkácer (but Gr. thügáte:r)
Other examples of Germano-Tocharian analogies might include:
A. kumn-äs'; B. känm-as's'äm; Germ. kommen (come) [cf. Skt. gamati "he goes," Avestan jamaiti "goes," Lith. gemu "to be born," Gk. bainein "to go, walk, step," L. venire "to come"), which do not have the same semantic and phonological form as in Germ. and Toch.]
It should not be particularly surprising that the Proto-Tocharians wandered as far as the Takla-Makan desert - remember that we have a massive Gothic migration to the Crimean peninsula and the rest of the Europe about two thousands of years later. The Indo-Europeans used horses, whereas the vast Ponto-Caspian and Central Asian steppes allowed for distant migrations across Eurasia.
By no means I insist on proto-Tocharian-proto-Germanic unity; at this level that's just a tentative hypothesis, which follows from the data under consideration, but which is rather poorly demonstrated herein.
The Balto-Slavic unity, well-proven
The close proximity of Baltic and Slavic (herein 65%) languages is well-supported by many other studies (Dyen (1991), Ringe (2005)), including some articles you can find at this site. You can also easily see a number of shared Balto-Slavic lexical innovations in the present 46-lexeme list:
Pruss. ranko; Lith ranká; Latv. roka; OCS ro~ka [nasal]; Russ. ruká (hand, arm)
The close genetic proximity of both groups is evident to anyone familiar with any two Baltic and Slavic languages. It doesn't really take any research. Some selected words and phrases may not even require translation, and some meanings can even be figured out with some effort and the knowledge of regular correspondences. [Cf. as anecdotal evidence (phonetical transcription): Kaip ash buváu ministru (How I was a Minister)" (a book by Zinkevicius), but a possible Russian translation Kak ya (also OCS azê and Bulgarian az) byl (also: byvál) minístrom; or Lith. Líye litús (Rains (pours) the rain) vs. Russ. Lyót líven' (Pours the shower/rain)] However, this close relationship should not be oversimplified or overestimated, neither it means that Lithuanian or Latvian are directly readable to the speakers of Slavic languages and vice versa.
On the other hand, the difference between modern Lithuanian and Latvian seems rather pronounced. According to a lexicostatistical study by Girdenis&Mazhiulis (1994) we have 68% for the Lithuanian-Latvian pair, and 70% for the Russian -Macedonian pair, the two most lexicostatistically distant Slavic languages, whereas the average inter-Slavic lexicostatistical distance normally oscillates c. 75%. We should also take into consideration possible historical contacts between Proto-Latgalian-Latvian and Proto-Lithuanian-Samogitan throughout their history, which would further decrease the figure for the Baltic languages to about ~62% because of possible mutual borrowings. This leads us to the conclusion that the Baltic group has many internal differences and is generally a little older than the Slavic group.
As to the Balto-Slavic lexicostatistical relatedness, we have an average of 46% for Lithuanian vs. Slavic [Girdenis, Mazhiulis] and an average of 42% for Latvian vs. Slavic, or ~44% on average. This yields a 62–44 ~18% difference between the hypothetical lexicostatistical depth of the Baltic and Slavic groups.
I've also conducted my own lexicostatistical study using an unconventional local wild flora/fauna list (81 lexemes) which is supposed to be much less affected by loanwords due to the high stability of this type of basic lexis (see Balto-Slavic Lexicostatistics (in Russian)). This flora/fauna list yielded the following percentages:
(*The Prussian percentage should be decreased by a small number, because of the 600-year difference with a hypothetical "Modern Prussian", but that wouldn't affect the final outcome to any sufficient extent.)
Finally, the above figures partly corroborate the calculations of the present preliminary study (the 46-list):
Herein, we have [78 + 76 + 68)/3] - [(67 + 66 + 64)/3] ~ 8% difference between Baltic and Slavic (Russian). The smaller difference should be attributed to a much higher stability of the 46-list.
Consequently, the difference within the Baltic languages is a little greater than normally assumed, whereas the difference between Slavic and Baltic is less than normally assumed, which makes Balto-Slavic a statistically reasonable grouping, although it is true that the Slavic languages cannot be directly included into the Baltic group as a subgroup, that would be going too far, rather they seem to have separated much earlier than most Baltic languages.
Hypothesis: Balto-Slavic related to Germanic?
First of all, it should be noted that the traditional four-corner scheme of Iranian languages (Northwest, Southwest, Northeast, Southeast Iranian) hardly holds true in the perspective of contemporary accurate lexicostatistical studies. The Iranian languages are an extremely complex branch of IE languages with a glottochronological and historical depth of at least 3000 years, similar in this respect to the Balto-Slavic branch, but more numerous and spread over a highly geographically differentiated territory.
Most West Iranian subgroups are closely related (cf. Modern Persian/Kurdish ~ 80%). The fact of the close proximity of West Iranian languages can easily be explained by reminding that the West Iranian languages are in many ways similar to Romance -- they result from the expansion of the Median and Persian Empire since c. 800-600 BC. Although the Median Empire and the unattested Median language is sometimes linked to Kurdish (without any clear arguments), the present research rather shows that Kurdish is much closer to Persian.
Consequently, the linguitsic urban legend of Kurdish being related to a semi-legendary unattested language of Media cannot hold true, although this may be true of Zazaki, Mazandarani and some other Northwest Iranian languages which evidently exhibit many differences from the languages that descend from the Persian Empire.
The group average lexicostatistical depth of about 60% indicates that the East Iranian languages have been hiding around the Pamir and Hindu-Kush mountains probably since about 1000-1500 BC, branching off into several subgroups shortly after the period of separation of the whole Indo-Iranian supergroup. As a result, they can be regareded as complex and probably even a rather independent taxon of Indo-Iranian languages.
Cf. Ossetic ærtæ, Shughni aráy (three)
Wakhi, a language located in the Hindu-Kush mountains, just across the ridge from Burushaski, is normally thought to be "Pamir", but differes from other Pamir languages in many respects. It exhibits less East Iranian lenition (cf. trui "three"; sha:d "six"; ðaGd "daughter"'), and possesses certain archaic lexemes and innovative phonological formations (suk "we"; bu "two"; pazuv "heart"; naghd "night" cf. Av. xshap; naxtu), which demonstrates the archaicness of Wakhi. It has probably separated early, and has been isolated from the rest of Iranian languages for a long time.
Yagnobi (Yaghnobi), or Neo-Sogdian spoken only in a few villages in Tadzikistan, is presently strongly contaminated by Tadzik even in basic lexis, which creates many difficulties in lexicostatistical studies. However, it should be noted that there is no evidence it is particularly close to Ossetic as it is assumed in the Northeast-to-Southeast textbook classification.
Ormuri and Parachi have been excluded from the calculations due to insufficient material, yet there are reasons to believe their separation from other Iranian subgroups is quite ancient.
Ossetic is one of the most famous offshoots of Proto-East-Iranian that must have separated quite early on (not later than 700 BC judging from historical assumptions). Among other features, it is characterized by an extensive metathesis:
ærtæ < *tere (three),
and further lenitive changes (*p > f):
fêd < *ped (father)
The lexicostatistical relatedness of Modern Persian to East Iranian (62%) is nearly the same or just slightly greater than among East Iranian languages to each other (58-59%), which means that all Iranian languages separated from the common Iranian stem almost simultaneously, and if the East Iranian languages constituted a genetic unity, it was only for a relatively short time.
Khotanese is a historically important and well-attested Iranian language of the Tarim Basin (Taklamakan Desert) c. 500-700 AD, but almost completely forgotten in most classical Indo-European studies. It probably has nothing to do with the ancient Sakas, but the name stuck and is unlikely to change. For all practical purposes, we could think of Khotanese (in the south) and Tumshuquese (in the north) as the "Taklamkan" Iranian languages, not "Sakan", at least this name would be more self-explanatory. There also existed several other languages of this branch, although they are poorly attested.
The Khotanese/Avestan lexicostatistical relatedess of ~ 73% corresponds to the glottochronological separation of about 2000 years prior to the mean dating of Khotanese (600 AD) and Avestan (600 BC), that is c. 2000 BC. This separation depth matches the average relatedness of Khotanese / Modern Iranian languages = (66 + 64 + 59 + 60 + 56 + 62 + 54) / 6 ~ 60%, which should be adjusted by a coefficient of about 0.9 to correct for the early dating of Khotanese (500-700 AD) thus yielding ~54%, or again 1800 BC.
This means that Proto-Saka could have separated from other Iranian languages at a very early stage, probably as early as the period of existence of the Oxus civilization; therefore it should be regarded as a separate Iranian group, which is also phonologically corroborated by the lack of East Iranian lenition, and geographically, by the great distance from the West Iranian languages.
As to the Iranian languages in general, they do share many often unique lexical, semantical and phonological innovations which prove the existence of a rather long historical period of common Proto-Iranian state.
These lexical items can be seen as typically Iranian, indicating that Proto-Iranian has existed as a single unity for a time ong enough to produce local innovations even in a short 46-list.
The position of Armenian is seen herein as highly controversial, and its discussion has been excluded from the present notes. It may very well be related to Indo-Iranian languages, not Proto-Greek, as many people assume.
Nuristani, Dardic and Indo-Aryan
Kashmiri does not seem to be Dardic, and was herein included into the mainstream Indo-Aryan subgroup.
Kati kor; Kalasha ka; Skr. karNa; Hindi kan; Sinh. kana (ear) (probably akin to Av. karana 'side, flank') (as opposed to Av. gaosha; Pers. gush, Pashto Gvazh; Yaghnobi Gu:sh; Shugni ghox) (semantic innovation).
Kati radur; Kalasha rat; Skr. ra:tra; Hindi ra:t; Sinh. <raeya> (night) (akin to Lith. /ri:ta/ 'morning'; Pers. ruz 'day') (as opposed to Av. xshap; Pers. shab; Pashto shpa; Yaghnobi xishap; Wakhi naGd). (semantic innovation)
Again, as in in the case with Proto-Iranian, from a great number of shared features within a short word list, we can deduce that Proto-Nuristani-Dardo-Indo-Aryan (Proto-Indic) has had a prolonged period of separate existence, at least 2000 years long.
(1) Av. âf-sh, ap; Pr. âb; Pashto obê; Yaghnobi op; Wakhi yupk; Kami oa, op; Skr. a:paH > paniya (?); (akin to Lith. /upe/ 'river') (a semantic innovation, which probably arose because water was closely associated with rivers in desert Central Asian regions); it's more likely, however, that this lexeme is only present in Iranian, whereas its appearance in Indic is recent, also cf. Sinh. <vatura>.
As we have seen, the changes in Proto-Indic and Proto-Iranian are pronounced and they share few common innovations, which indicates both languages have existed separately from each other for some considerable amount of time, and no longer have much in common (~40% in the present study). Glottochronologically, from the considerations of the present study, they could have separated c. 3000-3500 BC, which is about 1000-1500 years earlier than usually assumed. That would mean that the Proto-Indo-Iranians entered Central Asia soon after 4000 BC (see Map of Indo-Iranian Migrations), quickly migrated along the Oxus valley, reached the Hindu-Kush and Pamir mountains, where early Proto-Indic completely separated by 3000 BC, getting lost among the mountain ridges, and then stayed there with some internal differentiation until c. 1700 BC when the Indo-Aryan languages descendant of this stock finally began to migrate to northern India. Although this is not shown in this study, it's also plausible to assume that Indo-Aryan were essentially a few Dardic languages that expanded into the Indian subcontient.
On the other hand, in the present study, only real languages of the same period are contrasted, which helps to uncover the lack of common lexical background and offsets the Indo-Iranian separation further back in time. Similar difficulties of finding the common Indo-Iranian proto-state were also noted in other lexicostatistical studies of modern languages, first by Dyen, Kruskal, Black (1992) who complained about the "absence in the present classification of an Indoiranian group" and then by Ringe et al. (2005)
The current study confirms the early separation of Hittite. This is evident from the following consideration. The comparison of Hittite to: Latin (52%)(attested c. 100 BC), Attic Greek (51%)(400 BC), Avestan (50%) (600 BC), Sanskrit (~50%) (400BC) renders nearly equal results, which means that Hittite seems to be equidistant from other Indo-European groups.
Since Hittite is dated to c. 1600 BC, there would be even fewer matches if it had lived 1300 ys. longer to see Latin and Avestan, therefore this average figure of ~50% should be further reduced to about 45% of relatedness to most Indo-European groups. Now this is more difference than Latin/Greek (67%), Greek/Avestan (57%), Latin/Sanskrit (~57%). Glottochronologically, that figure would translate to about 5000 years before Latin/Greek/Avestan/Sanskrit (c. 300 BC), or circa 5300 BC (see below).
Therefore, we repeat the conclusion that Anatolian group should be regarded separately from the mainstream Indo-European languages, and support the hypothesis of Indo-Hittite (Indo-Anatolian).
Attempting to date Proto-Indo-European
One of the common reasons for the critisism of glottochronology is the alleged insufficient lexicostatistical distance between Modern Icelandic, Modern Armenian, and Modern Georgian and their respective old languages. However, the critics of glottochronology seem to ignore the law of large numbers, which states that even if some of the languages might deviate considerably from the mean in their phono- and lexistatistical behavior, the arithmetic average over a large number of languages would be relatively stable and trustably correct. On the other hand, the probability of running into languages with considerable deviation from the mean would be rather low, while, in many cases, the non-conformity of such deviant languages may be explained and even consistently predicted using various ad-hoc assumptions, such as geographical and linguistic isolation on a distant island (as in the case with Icelandic) or in the mountains (as with Armenian, and Georgian).
However, we will try not to use any ad-hoc assumptions herein. The law of large numbers would be just enough to establish the temporal position of PIE. Here is what we'll do. We will (1) calculate the mean average over the Indo-Aryan languages, (2) calibrate the rest of the list using the obtained lexicostatistical depth set to 1600 BC, the archaelogically attested date of Indo-Aryan invasion into India (3) calculate the mean for all of the Indo-European groups, (3) and finally convert that number into an approximate date in years using the aforementioned calibration date.
The mean percentage of separation among Nuristani-Dardic (excluding the unreasonably deviating Khowar), Sinhalese, and Hindi-Kashmiri seems to converge at an average depth of about 56% [(54 + 57 + 50 + 60 + 62 + 50 + 64 + 59 + 52) / 9 = 56], which should correspond to circa 1600 BC judging from the archaeological and historical record (also see Map of Indo-European Migrations).
Hence, from the logarithmic glottochronological formula, we have:
After some calculations, that would produce the following calibrated glottochronological row:
This glottochronological row seems to be more or less consistent with the following historically attested facts and plausible assumptions:
(1) (very approximately) with the dating of Proto-Celtic (Irish/Welsh ~ 52% (or 57%, as corrected for synonyms -- see above) that is c. 1500 BC to the Early Bronze Urnfield culture (1300-750 BC); note the latter figure is unique, not mean, therefore may contain statistical errors, neither we know the exact dating of Proto-Celtic;
(3) with the Baltic (~75%) expanding around 200 AD, just a little earlier than late Proto-Slavic (c. 450-500 AD), because the lexicostatistical distance between Lithuanian and Latvian in a more accurate lexicostatistical study by Mazhulis (1994) using Swadesh-200 is just slightly greater than among the Slavic languages to each other, whereas the period of Proto-Slavic split is probably historically datable to 400-500 AD; therefore, 0-200 AD seems plausible for the separation date of Lithuanian and Latvian-Latgalian.
(4) with the likely separation of Proto-Zazaki from Persion (70-75%) soon after the end of the Old Persian period (300-500 BC);
(5) with Ossetic separating from other East Iranian at 60% (~1100 BC). This dating looks right, because the Scythian languages are attested in the Caucasus just circa 800 BC, and their migration from the Pamir had to be relatively fast (occuring within centuries or even decades) because they used horse-based technology.
(6) with the existence of the BMAC civilization along the Oxus river during 1800-2300 BC, which should probably be attributed to the Proto-Iranian state, apparently located along the Oxus as well (see Map of Indo-Iranian Migrations). According to the present calculations, the era of Proto-Iranian would roughly correspond to 60-40% thus embracing the period from 1100 to 3000 BC, which includes the period of the BMAC as a subset.
(7) with the diversification of West Iranian languages (70%, 200 BC) after the fall of Persian Empire by 330 BC.
Now that we have calibrated the glottochronological row, we can use the figures for the Greek/Avestan (57%), and Latin/Sanskrit (~57%) relatedness to place the upper limit for Proto-Indo-European at 3500 years before the average dating of Avestan (600 BC), Sanskrit (400 BC), Latin (100 BC), and Greek (400 BC), or circa 3900 BC.
The conclusion is that PIE must have separated into early Indo-European dialects by circa 4100 BC, which is in rather good correspondence with Gimbutas' theory.
Does any of this agree with other models?
See Vaclav Blazhek, On the internal classification of Indo-European languages: survey (2005)