Ancient history of Indo-Europeans

Eastern Iranic poeple

Discussing the Lexicostatistical Comparison
of the Main Indo-European Language Groups

The methodology and preliminary outcomes
of the manual lexicostatistical analysis
performed on the Indo-European 46-wordlist (draft notes).



How lexemes were counted

It's not the matter of how the lexemes in the table were counted, it's rather the matter of maintaining a standardized, uniform, unbiased, objective method of computation throughout the dataset. Initially, it was planned to feed the table into a program to automate all calculations, but a Python program, prepared by a friend of mine, has never been finished, so as of 2008, I simply had to do all calculations manually. Think of it as a quick-and-dirty, preliminary analysis.

[In 04.2009, the ASJP group published the results of a theoritically similar but a much more resourceful study using a 40-list that embraced most languages of the world. They say they began to move along the same lines c. 01.2008, but this page was published online in 10.2008, so it was an independent move. (added 06.2009)]

Again, it should be emphasized that uniformity is what mattered here, because the percentages have no meaning as absolute values; they only have a meaning when compared to each other within this particular 46-list, so in case someone attempts to repeat some of the counting and gets a different value for a pair of languages, keep in mind that it's only the relation of any two values that counts.

Similar considerations would be true for possible statistical errors. As for all statistical studies, occasional errors do not matter and cannot affect the final outcome (the law of large numbers). This is what makes statistical methods so robust. Even if one finds some false cognates or lose some of the right ones, that would not impact the result, as long as one consistently maintain the same method of counting throughout the study. It may not even bet relevant, whether one uses too a strict or too a lax method of counting, because all that he or she will get in either case is an overall negative or positive local offset value that will be effectively canceled out after comparing these values globally to each other.

As a result, it's in fact irrelevant which method of counting was used. However, as a matter of digression, I can say that I have not used the commonly accepted method of cognates solely (as, for instance, Dyen et al. have). This classical method was enhanced by considering the actual phonological similarity of two sets of possible cognates.

The commonly accepted method of counting cognates in Swadesh lists is based on the presumption that the genetic relation of two sets of lexemes, and finally, two languages can be shown by demonstrating that just two pairs are related to each other, and then counting up the number of such matches in a sufficiently long list. But what if use a short list? That would make this method susceptible to the following error in reasoning. Suppose, we compare English, German and Armenian:

English German Armenian
<two>, /tu:/ <zwei>, /tsvai/ yerek'
<three>, /ri:/ <drei>, /drai/ yerku
<four>, /fo:r/ <fier>, /fi:r/ chors

Tthe relation of Armenian numbers is thought to have been demonstrated by the method of regular correspondences, therefore in the classical method the genetic relation of these three columns is exactly the same, and we should count all three pairs as cognates. However, it is quite obvious that English/German pairs stand much closer to each other, whereas Armenian looks as something completely different. That happened because we put the theory before the facts here, overestimating theoretical analysis, instead of looking at facts of life and phonology as they are. The facts demand to take notice of actual phonological similarity of lexical pairs, and compare words in a phoneme-by-phoneme, rather than word-by-word fashion. Consequently, while /tu:/ and /tsvai/ look related, /yerku/ does not, and should not be regarded as an exact match for the former two lexemes.

Curiously, mathematicians Serva & Petroni have already attempted a letter-by-letter comparison in 2007-2008 (no cognates at all!) using Dyen's database, and obtained some consistent (though not completely waterproof) results, despite the obvious straightforwardness and simplicity of their method.

There's also a good survey of computational methods in historical linguistics by Agarwal & Adams (2007).
This point is so important, it deserves a separate article, but we'll now have to leave it aside.

Briefly speaking, the counting in the table was done according to the following manual procedure of cognate comparison:

1 Phonologically similar cognates with at least one very similar stable phoneme (mostly, an initial consonant), as in /t/:/ts/ in <two>:<zwei>.
0.5 Obviously phonologically dissimilar cognates, with less than one stable phoneme, as in Eng. /ri:/ : Shugni /ari/ -- here the initial *th is lacking, whereas /r/ is not enough to make the two lexemes look "sufficiently similar".
0.5-0.2 Probable unproven cognates with some phonological similarity, as in Pashto /xul/ : Shugni /Gaiv/ 'mouth', or in other uncertain cases. These cases were rather rare, and could not statistically affect the results.
0.5 If one of the contrasted languages has a second synonym with a different root, which is not expalined in quotes as having a different meaning, as in Old Eng. <wind> : Welsh <gwynt>; <avel>.
0.3-0.2 Highly dissimilar cognates or complex systems of possible cognates (more than 3 lexemes) with several synonyms that had to be cross-compared, as in Sanskrit <putraH>; <sut> : Latin <filius>; <puer>, unless these possible cognates seemed to clearly coincide phonologically or have previously established cogantes in all cases.

Consequently, a lexicostatistical distance matrix was obtained that will be used in the further discussion.

At this point, I can hear someone cry out, "Aha! Mass comparison!". You can name it as you please, although "mass comparison" should normally refer to churning out uncontrolled data, whereas I attempted to maintain strict control throughout the study. If you do not believe me, you can do your own calculations using your own set of rules -- it doesn't take much time, actually (a matter of days), because the list is so short. Basically, all I did here was use probabilistic logic, adding "maybe" to the much too inflexible true-or-false classical logical approach, as well as developing a very simple system to count synonyms.

Moreover, the following comments should be taken into consideration:

(1) In case of doubt, calculations were done several times until stable results were obtained, or the results were statistically averaged over several calculations.

(2) If calculation outcomes were much too uncertain, they were temporarily aborted (as in the case with positioning Balto-Slavic and Armenian).

(3) The outcomes are not just the result of dead reckoning, they are supported by additional considerations, such as showing which particular innovations were produced within a particular genetic group, or even adding some material from outside the table. So they should rather be thought of as a preliminary indicator where to look, not the final diagnosis.

(4) Averaging over many language groups makes a result particularly stable, as in the case of 35-36%-figure for the relatedness of the most distant main IE groups to each other (excluding Hittite). On the other hand, the cases of contrasting just two individual groups (e.g., just Irish and Welsh) are particularly prone to possible statistical errors, as long as no additional data are taken into account.

(5) The exact relatedness value in many cases is simply irrelevant, because this is just a lexicostatistical research to help regroup the IE groups, not necessarily produce exact numbers for further glottochronological analysis. For instance, it might not matter herein, whether you get 60 or 75 for the relation of Baltic to Slavic, as long as you do not plan to do any glottochronology you just become convinced these groups are quite close, and that is that.

(6) The table was designed to be analyzed by a computer program using a special algorithm described elsewhere, so these manual results should be seen as preliminary (=just re-stating the obvious).

From distance matrix to dendrogram

There are many mathematical methods of cluster analysis and computational phylogenetics developed mostly for biological and financial applications. These methods were tested on an IE dataset by Ringe, Warnow, et al. in a series of articles (2002-2005), and...well, none of them seems to work properly in their papers. Actually, the problem is not in the cluster analysis, the problem is in how you count your words. If you have built the correct distance matrix, everything would work almost automatically, but no cluster method can savior the flawed analysis if one compares languages in a wrong way. Besides, building a tree for a relatively small number of language groups is hardly a computational problem, and can be done manually in many cases.

Error margin

Again, since these are just manual calculations, the error margin for standalone pairs, not contrasted to other groups, can be rather high (+/-7%). This figure follows from statistical fluctuations in the distance matrix, which were obtained when the same calculations were repeated under different subjective biases or when different languages of the same group, which supposedly have the same phylogenetic depth, were compared to a particular language (for instance, Hindi or Persian to all of the Iranian languages).

But whenever a comparison value is averaged over a large number of languages with the same phylogenetic depth, we obtain much more statistically stable results. Hence, the value of 35-36% for all of the IE languages (outside Anatolian) may have much lower statistical error margin (no more than +/- 1-2%).

More rigorous computations using a computer algorithm and extended lists should reduce the error margin even further.

A lexicostatistical tree of the Indo-European languages


Hypothesis: Albanian related to Celtic

Evidently, there's a profound dissimilarity between Goidelic and Brythonic languages, which makes "Celtic" a rather deep, archaic grouping similar in this respect to the Balto-Slavic or Iranian branches. In fact, the herein assumed lexicostatical depth of 52% seems to be greater than that for the Balto-Slavic group (65%). Consequently, the great depth of the Celtic branch may finally allow to add Albanian with its similar lexicostatistical separation of about 50% to the Celtic languages.

Is Albanian really Celtic? The similarities between Albanian and other Celtic languages can be directly exemplified by the following lexemes, some of which may turn out to be unique shared innovations:

OIr. uisce, Alb. uje, ujt (water) (obvious phonological similarity), rather opposed to Lat. unda 'wave' with a modified meaning.
ainm; OW. anu; Gheg mn (name) (apparently, a metathesis from *namen > *anmen, rather unique in Europe)
OIr. <sil> /su:il/; Alb. <sy> /s/ (eye) (the relatedness to IE "sun" seems doubtful because of semantic differences)
duille, W. deilen; Alb. <gjethe> /dJee/ (leaf), as opposed to L. folium, Gr. fllon
. carric 'rock'; W. carreg; Alb. gur (stone) (probably akin to Eng. hard, and Toch. B krwee 'stone')
W. gwraig /gura'ig/, Alb. grua (wife, woman)
W. w^y; Br. vi; Alb. vez; ve (egg) (phonologically similar)
OIr. gin; Welsh or Cornish genau, Br. quen, Alb. goje (mouth) as opposed to Latin gena 'cheek', Eng. chin with a different meaning
OIr. athir, Alb. at (father) (a similar development, which may not be coincidental)
Ir. far, W. gwair; gwellt; Alb. bar (grass)
[Also cf. the regular correspondence of Irish /f/ : W /g/ : Alb. /b/ in Ir. <fear>, OIr. <fer>; OW. <gur>; Alb. burr (man) (as well as Latin vir; Anglo-Saxon wer, Lith. viras), and OIr <find>, W. <gwyn>, Alb. <bardh> /bar/
OIr. tech; W. <ty^>; Br. ti; Alb. shtpi (house) (also Gr. stegn 'house', but Eng. thatch; Sanskrit stagati 'to cover' mostly with different meanings) (?)
Mx. shimmey, Alb. shum (many)
W. <gwdff>, Alb. qaf (neck)
Ir. ur, Alb. ri (new)
Ir. maith, W. mad, Alb. mir (good)

On the other hand, there are a few unique Goidelic/Brythonic innovations (as long as these matches do not result from subsequent mutual borrowings).

macc, OW. map (son) (IE root, semantically innovative)
tene, W. tn (fire) (IE root, semantically innovative)
lm, OW. lau (hand) (akin to Latin palma, Anglo-Saxon folm 'palm', typically Celtic, but not unique)
OIr. carric, OW. carrec (stone) (IE root, rather semantically innovative, but also cf. Alb. gur)

Note that the percentage of Irish to Welsh may be a little lower than actual, because of the greater than usual number of dialectical (?) synonyms within the Welsh dataset, which are herein calculated as 0.5-0.3 per lexeme, so we might expect the corrected figure for the Irish/Welsh relatedness to be a little higher (about 57% -?).

Accordingly, this predicts two waves of migration into the British Isles, with Proto-Goidelic being the first to enter, and the Brythonic subgroup being a result of relatively recent migration from the Continent. Proto-Brythonic and Proto-Goidelic must have separated a long time ago somewhere in northwestern or central continental Europe.

As to Italo-Celtic, the current study is not sufficiently detailed and elaborate neither to completely exclude, nor to corroborate the possibility of Italo-Celtic grouping; rather we see it as a possible, but unlikely, and in this case very short-lived state within the European Centum branch. There seem to be no specific Italo-Celtic shared innovations in the 46-list, except for the typical Celt. ni : Lat. nos (we), which is also attested in other Indo-European groups, and is not unique.


Hypothesis: Italic related to Hellenic

These two groups seem to have very much in common (herein ~69%), which should not be surprising, since the close proximity of Attic Greek to Latin was well-known since the antiquity. Consider the following phonological and semantical similarities from the 46-table:

(Latin and Greek are transcribed phonetically):

L. duo; Gr. d:o (but W. dau; OE. tva; Pruss. dwai. Lith du)
L. kwattuor; Myc. Gr. kwetoro (but OIr. cethir; Alb katr; OE. feower; Lith. ketur)
L. ego; Gr. eg: (but OE. ik; Toch. s; W. i; Alb.un)
L. pes; Gr p:s (foot)
L. noks; Gr. n:ks (but W. nos; Alb. nat; OE. niht)
L. humus (ground); Gr. *xamos, xamai 'on the ground'
L. folium; Gr. fllon (leaf)
L. frater; Gr. phra:ter (brother) (but br- in most other IE languages, Sanskrit "bhra:taH")

L. lupus; Gr. lkos (wolf) (a similar loss of initial v-, which was rather unique among other IE groups)
L. petra; Gr. ptros (stone) (as opposed to I. cloch; Alb. gur; Toch. B krwee; OE. sta:n)
L. domus; Gr. dmos (home)
L. rivus; Gr. rheos (river)

In all of the above instances we observe close phonological and semantical proximity that can be explained by assuming a genetic unity of Italic and Hellenic languages. This is easily explained from the geographical perspective by considering the fact that one of the few feasible passages to the Italian peninsula goes through the southern Balkans and northern Greece, therefore the only geographically realistic way for Proto-Italic to form was by its separation from Proto-Hellenic at some point in time.

However, the lexicostatistical proximity of only about 45% between Modern Greek and modern Romance languages (such as Spanish) as compared to an average of about 40% among other modern European Centum languages indicates that the the Italo-Hellenic proto-state was rather short-lived and unstable.


Hypothesis: Germanic related to Tocharian

An even more interesting finding may be a possible proximity of Proto-Germanic to Proto-Tocharian (Old. Eng : Toch B ~ 65%; German : Toch B ~ 59%). This observation deserves further investigation:

OE. wter; Toch. A. wr; Toch. B. wer < *wat'er (?) (but Ir.uisce; W. dwr; Gr. hdo:r; Lith. vandu)
Goth. swistar; Toch. A. s'ar; Toch. B s'er <*set'er (sister) (the same loss of aspirated intervocalic -t'-)
Goth. weis; Toch. B. wes (we) (also, at least Lith. vedu 'we two' and OCS ve 'we two', but not as 'we' in the phonological form of *weis, and not in the European Centum languages)
Goth. hairto; Toch. B. arace <*harnte (?) (heart)
Goth. waurts; German Wurzel; Toch. A witsako (root)
German Blatt; Toch. A plt; Toch. B pilta (leaf, blade) (this root is also persistent in the Indo-Iranian branch)
German Stamm 'stem'; Toch. A; Toch. B stm (tree)
Goth. waurms; Toch. A wal (worm) (but also L. vermis (with a full ending); Gr. rhomos; I. cruimh, Alb. krimb, Pruss. <Girmis>)

Consider also the strong aspiration in t'- which lead to a transformation t' > ts (not necessarily palatalization as normally explained):

Toch. B mcer (mother); Toch. B pacer (father); Toch. B tkcer (daugther); Toch. B. kuce (who)

Tocharian k- finds explanation as a strongly aspirated t' > tk' > k' > k (Apparently, the digraph <tk> as preserved in Toch. A tkam (earth); Toch. A ckcer; Toch. B tkcer marks the result of this aspiration.)

Here's a short lemma that attempts to prove a regular correspondence between Proto-Tocharian *ka- and *tV- in the European Centum languages:

Toch. A kam; Toch. B keme, hence Proto-Toch. *kam < *tham (tooth)
Toch. A kantu; Toch. B kantwo, hence Proto-Toch. *kantwo <*thank'wo (with a metathesis) (tongue)
Toch. A kom.; Toch. B kaum., hence Proto-Toch. *kaum < *thaum (day, sun)
Toch. A tkam.; Toch. B kem., hence Proto-Toch. *kam (tkam) <*tham (earth, cf. Lat. tellus, OIr. ti:r)
Toch. A karke; Toch B kara:k, hence Proto-Toch. *karak < *tharak, *tharakh, *tharah (tree branch)
Toch. A kayurs'; Toch. B kaurs'e, hence Proto-Toch. *kaurs'e < *thaurse (bull, cf. Taurus)
Toch. A klyt-r;Toch. B kalt-r, hence Proto-Toch. *kalt-ar < *thalt-ar (s-tand) (?)

[Yet, in some other cases we have k < *k:

Toch. A knt; Toch. B kante, hence Proto-Toch. *kente (hundred)
Toch. A kanwem.; Toch. B keni, hence Proto-Toch. *ken- (knees (du.))]

The former process is possible if Proto-Tocharian stops where heavily aspirated, hence *ta > *tha > *hha > *kha > *ka before an open /a/ when the dentals were undergoing allophonic lention. The metathesis in *tankwo occurred precisely under the impact of aspiration, because both *th and kh* were pronounced in a rather similar way at some point, more or less like *hhanhhwo

The Tocharian aspiration reminds of the Grimm's law and the aspiration in the West Germanic languages.

Some of the Grimm's law seems to be already in progress in early Proto-Tocharian, since we have *k > *h > 0 in:

Got. dauhtar; Toch. B tkcer (but Gr. thgte:r)
Got. hairto; Toch. B. arace <*harnte (?)

Other examples of Germano-Tocharian analogies might include:

A. kumn-s'; B. knm-as's'm; Germ. kommen (come) [cf. Skt. gamati "he goes," Avestan jamaiti "goes," Lith. gemu "to be born," Gk. bainein "to go, walk, step," L. venire "to come"), which do not have the same semantic and phonological form as in Germ. and Toch.]
B. s'ayye; Germ. Schaf (sheep) [no known cognates outside Gmc. The more usual IE word for the animal was *ewe.

It should not be particularly surprising that the Proto-Tocharians wandered as far as the Takla-Makan desert - remember that we have a massive Gothic migration to the Crimean peninsula and the rest of the Europe about two thousands of years later. The Indo-Europeans used horses, whereas the vast Ponto-Caspian and Central Asian steppes allowed for distant migrations across Eurasia.

By no means I insist on proto-Tocharian-proto-Germanic unity; at this level that's just a tentative hypothesis, which follows from the data under consideration, but which is rather poorly demonstrated herein.


The Balto-Slavic unity, well-proven

The close proximity of Baltic and Slavic (herein 65%) languages is well-supported by many other studies (Dyen (1991), Ringe (2005)), including some articles you can find at this site. You can also easily see a number of shared Balto-Slavic lexical innovations in the present 46-lexeme list:

Pruss. ranko; Lith rank; Latv. roka; OCS ro~ka [nasal]; Russ. ruk (hand, arm)
Pruss. nage; OCS noga (foot, leg)
Pruss. zwaigstan (or rather: swaigstan) 'the shining'; Lith. zhvaigzhde; Latv. zvaigzne; OCS zvezd (star)
Pruss. zirgis 'stallion'; Lith. zhirgas 'horse'; Latv. zirgs 'horse'; Russ. zhere-bts 'stallion'

The close genetic proximity of both groups is evident to anyone familiar with any two Baltic and Slavic languages. It doesn't really take any research. Some selected words and phrases may not even require translation, and some meanings can even be figured out with some effort and the knowledge of regular correspondences. [Cf. as anecdotal evidence (phonetical transcription): Kaip ash buvu ministru (How I was a Minister)" (a book by Zinkevicius), but a possible Russian translation Kak ya (also OCS az and Bulgarian az) byl (also: byvl) minstrom; or Lith. Lye lits (Rains (pours) the rain) vs. Russ. Lyt lven' (Pours the shower/rain)] However, this close relationship should not be oversimplified or overestimated, neither it means that Lithuanian or Latvian are directly readable to the speakers of Slavic languages and vice versa.

In my personal humble opinion, reasons against Balto-Slavic genetic grouping can only come from western researches either unfamiliar with any of these languages, or nationalistically-minded Balts who view any relation to Slavic as insulting. This long-standing dispute should finally be closed down.

On the other hand, the difference between modern Lithuanian and Latvian seems rather pronounced. According to a lexicostatistical study by Girdenis&Mazhiulis (1994) we have 68% for the Lithuanian-Latvian pair, and 70% for the Russian -Macedonian pair, the two most lexicostatistically distant Slavic languages, whereas the average inter-Slavic lexicostatistical distance normally oscillates c. 75%. We should also take into consideration possible historical contacts between Proto-Latgalian-Latvian and Proto-Lithuanian-Samogitan throughout their history, which would further decrease the figure for the Baltic languages to about ~62% because of possible mutual borrowings. This leads us to the conclusion that the Baltic group has many internal differences and is generally a little older than the Slavic group.

(Note again that this was the classical Swadesh-200 by Mazhiulis! Not to confuse these figures with percentages from other lists!)

As to the Balto-Slavic lexicostatistical relatedness, we have an average of 46% for Lithuanian vs. Slavic [Girdenis, Mazhiulis] and an average of 42% for Latvian vs. Slavic, or ~44% on average. This yields a 6244 ~18% difference between the hypothetical lexicostatistical depth of the Baltic and Slavic groups.

Girdenis, Mazhiulis (Swadesh-200, cognates) (1994)
  Lithuanian Latvian Old Prussian* Russian
Lithuanian   68% (49%) 47%
Latvian     (44%) 45%
Old Prussian       (41%)
(*The data for Prussian are probably unreliable, because there's not enough attested material to fill in a Swadesh-200)

I've also conducted my own lexicostatistical study using an unconventional local wild flora/fauna list (81 lexemes) which is supposed to be much less affected by loanwords due to the high stability of this type of basic lexis (see Balto-Slavic Lexicostatistics (in Russian)). This flora/fauna list yielded the following percentages:

Wild fauna/flora, 81 lexemes, cognates (2008)
  Lithuanian Latvian Old Prussian Russian
Lithuanian   64% 67% 48%
Latvian     58% 46%
Old Prussian       *51%

(*The Prussian percentage should be decreased by a small number, because of the 600-year difference with a hypothetical "Modern Prussian", but that wouldn't affect the final outcome to any sufficient extent.)

Incidentally, that nearly coincides with the Mazhiulis' data (again, different lexical lists may normally coincide by absolute figures only by accident), hence we have (64 + 67 + 58 /3) = 63% for the average relatedness among the Baltic languages, and (48 + 46 + 51/3) = 48% for the average relatedness of Russian to Baltic. Again, we have a ~15% difference between the hypothetical glottochronological age of the Baltic and Slavic groups in this study.

Finally, the above figures partly corroborate the calculations of the present preliminary study (the 46-list):

The 46-list; cognates with phonological similarity (2008)
  Lithuanian Latvian Old Prussian Russian
Lithuanian   ~78% ~76% ~67%
Latvian     ~68% ~66%
Old Prussian       ~64%

Herein, we have [78 + 76 + 68)/3] - [(67 + 66 + 64)/3] ~ 8% difference between Baltic and Slavic (Russian). The smaller difference should be attributed to a much higher stability of the 46-list.

Consequently, the difference within the Baltic languages is a little greater than normally assumed, whereas the difference between Slavic and Baltic is less than normally assumed, which makes Balto-Slavic a statistically reasonable grouping, although it is true that the Slavic languages cannot be directly included into the Baltic group as a subgroup, that would be going too far, rather they seem to have separated much earlier than most Baltic languages.

Hypothesis: Balto-Slavic related to Germanic?

This current lexicostatistical conclusion of modern Baltic and Slavic being related to modern Germanic to about 50% contradicts the fact of pronounced satemization in Balto-Slavic. Herein, we have BS/Germanic ~ 50%, and BS/Indo-Iranian ~ 35%, which could be due to a lexicostatistical error. The close match may also be attributed to the archaicness of the both groups. Neither there are any clear-cut innovations shared by Balto-Slavic and Proto-Germanic in the 46-list. More extensive research on the subject is needed to support or discard this hypothesis.


West Iranian

First of all, it should be noted that the traditional four-corner scheme of Iranian languages (Northwest, Southwest, Northeast, Southeast Iranian) hardly holds true in the perspective of contemporary accurate lexicostatistical studies. The Iranian languages are an extremely complex branch of IE languages with a glottochronological and historical depth of at least 3000 years, similar in this respect to the Balto-Slavic branch, but more numerous and spread over a highly geographically differentiated territory.

Most West Iranian subgroups are closely related (cf. Modern Persian/Kurdish ~ 80%). The fact of the close proximity of West Iranian languages can easily be explained by reminding that the West Iranian languages are in many ways similar to Romance -- they result from the expansion of the Median and Persian Empire since c. 800-600 BC. Although the Median Empire and the unattested Median language is sometimes linked to Kurdish (without any clear arguments), the present research rather shows that Kurdish is much closer to Persian.

However, there is a longer lexicostatistical distance between Modern Persian and the "Northwest Iranian" languages, such as Zazaki (Dimli), the lesser Northwest Iranian languages (such as Harzani, Semnani, Gorani, Kermanshahi, Sangisari), probably Parthian and Mazandarani (Zazaki/M. Persian ~70-75%). [The results for the lesser languages have been obtained from the consideration of phonological transitions in numbers 1-10; Mazanderani has its own word list, its 1-10 numbers having been apparently borrowed from Persian.]

Proof: Kurdish is closely related to Persian, Zazaki is not

(1) wolf: Kurdish gur, Pahlavi gurg, Balochi gurkh, Persian gorg, but Avestan varkha, Old Persian varka, Zazaki verk. Herein, we have a very typical post-Old-Persian innovation with an initial g-.
(2) three: Avestan thri > se in most West Iranian, chi in Old Persian, but Zazaki hire.
(3) I: the loss of the historical pronoun azem in many West Iranian languages with its substitution by man, but Zazaki ez
(4) year: Kurdish sal, Balochi so:l, Persian sa"l, but Avestan sared, Old Persian ard, but Zazaki serre with -r-.
(5) heart: Kurdish. dil, Balochi dil, Persian del, but Zazaki zerre.

Consequently, the linguitsic urban legend of Kurdish being related to a semi-legendary unattested language of Media cannot hold true, although this may be true of Zazaki, Mazandarani and some other Northwest Iranian languages which evidently exhibit many differences from the languages that descend from the Persian Empire.

East Iranian

This is probably the most complex and controversial group among the Indo-European languages. Having been studied only as late as the 19-20th century, it remains largely unknown to many Indo-Europeanists in the west. For years, researches have tacitly assumed that there should be nothing in Iranian which can't be found in Avestan ignoring the many bizarre peculiarities of this family. It was, for instance, poorly represented in Dyen's lexicostatistical research. The group's textbook classification (Northeast to Southeast Iranian) is completely unacceptable and is hardly supported by any linguistic arguments at all. In fact, a closer look reveals a complicated branch with many different sprouts. The present lexicostatistical study, for instance, shows that the actual difference between Russian and Lithuanian might, in fact, be less than between Wakhi and Shughni, both of which are believed to be "Pamir", or sometimes even called "Pamir dialects".

The group average lexicostatistical depth of about 60% indicates that the East Iranian languages have been hiding around the Pamir and Hindu-Kush mountains probably since about 1000-1500 BC, branching off into several subgroups shortly after the period of separation of the whole Indo-Iranian supergroup. As a result, they can be regareded as complex and probably even a rather independent taxon of Indo-Iranian languages.

Cf. Ossetic rt, Shughni ary (three)
Pahsto lr, Yidga lughdoh; Ishkashimi udoGd (daughter)
Yd. uxsho; Sanglechi khoar; Shughni xo:gh (six)
Pashto le:w , derived from Greek λεων (leon) meaning "lion". []; Shughni urj (wolf)

The Pamir languages may form an internal genetic unity with three following subbranches: (1) Yidgha-Munji; (2) Ishkashimi-Zebaki-Sanglechi; (3) Shughni-Rushan-Sarikoli-Yazgulami. The first two are rather closely related (1)/(2) ~80%, while the third is a little more differentiated (1)/(2); (2)/(3) ~ 70% (on average) with Yazgulami being particularly different. There might also be some speculations on relating ancient Bactrian (the language of the Kushan Kingdom) to Yidga-Munji, but the precise lexicostatical study of Bactrian is absent due to lack of lexical material.

Wakhi, a language located in the Hindu-Kush mountains, just across the ridge from Burushaski, is normally thought to be "Pamir", but differes from other Pamir languages in many respects. It exhibits less East Iranian lenition (cf. trui "three"; sha:d "six"; aGd "daughter"'), and possesses certain archaic lexemes and innovative phonological formations (suk "we"; bu "two"; pazuv "heart"; naghd "night" cf. Av. xshap; naxtu), which demonstrates the archaicness of Wakhi. It has probably separated early, and has been isolated from the rest of Iranian languages for a long time.

Yagnobi (Yaghnobi), or Neo-Sogdian spoken only in a few villages in Tadzikistan, is presently strongly contaminated by Tadzik even in basic lexis, which creates many difficulties in lexicostatistical studies. However, it should be noted that there is no evidence it is particularly close to Ossetic as it is assumed in the Northeast-to-Southeast textbook classification.

Ormuri and Parachi have been excluded from the calculations due to insufficient material, yet there are reasons to believe their separation from other Iranian subgroups is quite ancient.

Ossetic is one of the most famous offshoots of Proto-East-Iranian that must have separated quite early on (not later than 700 BC judging from historical assumptions). Among other features, it is characterized by an extensive metathesis:

rt < *tere (three),
vzhag <*zevag (tongue)
rvad < *verad (brother)
art <*at(e)r (fire),

and further lenitive changes (*p > f):

fd < *ped (father)
frt < *putr (son)

The lexicostatistical relatedness of Modern Persian to East Iranian (62%) is nearly the same or just slightly greater than among East Iranian languages to each other (58-59%), which means that all Iranian languages separated from the common Iranian stem almost simultaneously, and if the East Iranian languages constituted a genetic unity, it was only for a relatively short time.



Khotanese is a historically important and well-attested Iranian language of the Tarim Basin (Taklamakan Desert) c. 500-700 AD, but almost completely forgotten in most classical Indo-European studies. It probably has nothing to do with the ancient Sakas, but the name stuck and is unlikely to change. For all practical purposes, we could think of Khotanese (in the south) and Tumshuquese (in the north) as the "Taklamkan" Iranian languages, not "Sakan", at least this name would be more self-explanatory. There also existed several other languages of this branch, although they are poorly attested.

The Khotanese/Avestan lexicostatistical relatedess of ~ 73% corresponds to the glottochronological separation of about 2000 years prior to the mean dating of Khotanese (600 AD) and Avestan (600 BC), that is c. 2000 BC. This separation depth matches the average relatedness of Khotanese / Modern Iranian languages = (66 + 64 + 59 + 60 + 56 + 62 + 54) / 6 ~ 60%, which should be adjusted by a coefficient of about 0.9 to correct for the early dating of Khotanese (500-700 AD) thus yielding ~54%, or again 1800 BC.

This means that Proto-Saka could have separated from other Iranian languages at a very early stage, probably as early as the period of existence of the Oxus civilization; therefore it should be regarded as a separate Iranian group, which is also phonologically corroborated by the lack of East Iranian lenition, and geographically, by the great distance from the West Iranian languages.

Iranian in general

As to the Iranian languages in general, they do share many often unique lexical, semantical and phonological innovations which prove the existence of a rather long historical period of common Proto-Iranian state.

. chasman; Khotanese ceima; Pr. cheshm; Wakhi czm; Yidga. cam; Shugni cem; Ossetic csht (also Sanskrit chakshus.h) (eye)

. xshap, Av. xshap; Pers. shab; Pashto shpa; Yaghnobi xishap; Ishkashimi sab, sxab; Shugni shab; Ossetian xshv <*xeshev
(but also archaic Av. naxturu 'nocturnal'; Wakhi naGd) (night)

raocah 'daylight'; Pr. ruz; Zazaki roje; Pashto wradz; Wakhi rwor; Sanglechi rusht (day)

asanga 'stone'; Khotanese samga; Pr. sang; Yaghnobi sank; Sanglechi song (stone)

gaosha; Pers. gush, Pashto Gvazh; Yaghnobi Gu:sh; Wakhi ghish; Shugni ghox; Ossetic x"ush (ear)

taoxma; Pr. tokhm; Wakhi tuxm murG; Shugni. tarmurx (egg) ; Pashto Agey ( egg)

These lexical items can be seen as typically Iranian, indicating that Proto-Iranian has existed as a single unity for a time ong enough to produce local innovations even in a short 46-list.

Also see a similar
Starostin's dendrogram of Iranian languages, which tends to confirm the conclusions of the present study as far the tree structure and lexicostatistical percentages are corncerned. However, it should be noted that Starostin's "recalibrated" glottochronology often yields too early datings and should be regarded without enthusiam. For instance, he gives (-620) for the separation of West Iranian, whereas we know well from history that the Median Kingdom was first mentioned in 836 BC, whereas its language is normally believed to be West Iranian, as a result we have an obvious contradiction.


The position of Armenian is seen herein as highly controversial, and its discussion has been excluded from the present notes. It may very well be related to Indo-Iranian languages, not Proto-Greek, as many people assume.


Nuristani, Dardic and Indo-Aryan

The Nuristani-Dardic branch (~65%) seems to be as internally close as Balto-Slavic [although Khowar shows many dissimilarities]. The same is true for the mainstream Indo-Aryan languages (~65%).

Kashmiri does not seem to be Dardic, and was herein included into the mainstream Indo-Aryan subgroup.

Note the considerable difference between Sinhalese and Hindi-Kashmiri (~55%), which indicates that the Sinhalese-Maldivian subgroup could have been a very early offshoot.

The separaion of Proto-Nuristani-Dardic from the main Indo-Aryan branch seem to have occured at the depth of 54% which must correspond to roughly 1800-1600 BC (the archeaological and historical date normally associated with the Aryan invasion).

To appreciate the shared Nuristani-Dardo-Indo-Aryan (or Indic, for short) phonological transformations and lexical innovations, consider the following examples from the 46-list:

dits; Kalasha Jhiph; Skr. jihvha:; Hindi jibh; Sinh. diva (tongue) (as opposed to Av. *hizva:s; Pers. zabn, Bactrian ezbago; Pashto zhba; Yaghnobi zivok; Shugni zev). [jh >d : z]

Kati su; Kalasha sri; Skr. su:ryaH; Hindi su:rey (sun) (as opposed to Av. hvar-z; Pers. khurshed, Yaghnobi khur; Shugni xer; Wakhi yir) [s : x]

Kalasha ha; Skr. h'Rdaya; Hindi hridey; Sinh. <hardaya>; Dhivehi hi-iy (heart) (as opposed to Zazaki <zerr>; Pashto zrr; zaru; Shugni zra; Ossetic zhrd); (but Kati ziri < Iranian loanword?) [h : z]

ango; Khowar angar; Kalasha angr; Skr. agniH, Ng'ara; Hindi a:g (fire) (as opposed to Av. tar-sh; Pers. atesh, Ossetic art < *atr; Pashto or; Yaghnobi ol; Yidgha yur; Shugni yc). [-ng- : -t-/0 ?]

Kati kor; Kalasha ka; Skr. karNa; Hindi kan; Sinh. kana (ear) (probably akin to Av. karana 'side, flank') (as opposed to Av. gaosha; Pers. gush, Pashto Gvazh; Yaghnobi Gu:sh; Shugni ghox) (semantic innovation).

Kati radur; Kalasha rat; Skr. ra:tra; Hindi ra:t; Sinh. <raeya> (night) (akin to Lith. /ri:ta/ 'morning'; Pers. ruz 'day') (as opposed to Av. xshap; Pers. shab; Pashto shpa; Yaghnobi xishap; Wakhi naGd). (semantic innovation)

Again, as in in the case with Proto-Iranian, from a great number of shared features within a short word list, we can deduce that Proto-Nuristani-Dardo-Indo-Aryan (Proto-Indic) has had a prolonged period of separate existence, at least 2000 years long.


Hypothesis: Nuristani are part of (or close to) Dardic

You can see from the examples above that the Nuristani languages (such as Kati (Kata-viri), Kami (Kam-viri), Wasi) are clearly related to Indic, since they inherit the same transformations and innovations, and thus cannot be seen as "intermediate" between Indic and Iranian, as sometimes claimed. They also seem to share some common phonolgical and semantical formations with the Dardic languages, and can hardly be viewed as radically separate:

(1) Kati g'u; Khowar g'oG; Kalasha gok (worm);

(2) Kati uts; Khowar awa; Kalasha a (I) (as opposed to Skr. asmad; Kashmiri bu, boh'; Hindi me; Lahnda mae; Bengali ami; Sinh. mama) (probably, an early loss of the second part of *as-mad);

(3) Kati nu; Khowar nan ; Kalasha ya (mother) (curiously, probably akin to Eng. "nanny" as also to a similar word in Eastern Iranian languages) (as opposed to Skr. ma:tar, ma:ta:; Kashmiri moju; Hindi ma:, ma:ta:ji; Sinh. <mava>) (the much too overused objection to children's words is herein seen as exaggerated: 'mother, father, nanny' are quite normal words, they are not easily re-created from scratch each time in each language);

(4) Kati sh't; Khowar chuti; Kalasha chom (earth) (as opposed to Skr. mahi:; Kashmiri metsu, boh'; Hindi mitti 'clay'; Bengali mati; Gujarati mati; Sinh. <pas>, <poloova> ) (apparently, akin to East Iranian: Ishkashimi shit; Sarikoli sit; Yazgulami shat; Ossetic sdJt)

However, the shared innovations in question are few and may have formed independently because of an Iranian adstratum, borrowings or by other means.


The Proto-Indo-Iranian language existed a long time ago or/and was rather short-lived. This conclusion may be drawn from the fact that relatively few traces remain in the 46-list in modern languages which demonstrate its existence. These could be:

(1) Av. f-sh, ap; Pr. b; Pashto ob; Yaghnobi op; Wakhi yupk; Kami oa, op; Skr. a:paH > paniya (?); (akin to Lith. /upe/ 'river') (a semantic innovation, which probably arose because water was closely associated with rivers in desert Central Asian regions); it's more likely, however, that this lexeme is only present in Iranian, whereas its appearance in Indic is recent, also cf. Sinh. <vatura>.

(2) Av. bu:m; Old Pr. bu:mis; Kurdish bin; Ormuri (Logar) bouma; Kami. b'm; Khowar b'um; Skr. bhu:miH; Hindi bhu:mi; Sinh. bin

(3) Av. masya; Pr. mhi; Skr. ma:tsya; Hindi machhi; Marathi masa; Gujarati macheli; Sinh. malu (fish); probably akin to Lith. mesa; Eng. meat (not in the 46-list); the introduction of this word may indicate that fishery was an important component of Indo-Iranian subsistence.

As we have seen, the changes in Proto-Indic and Proto-Iranian are pronounced and they share few common innovations, which indicates both languages have existed separately from each other for some considerable amount of time, and no longer have much in common (~40% in the present study). Glottochronologically, from the considerations of the present study, they could have separated c. 3000-3500 BC, which is about 1000-1500 years earlier than usually assumed. That would mean that the Proto-Indo-Iranians entered Central Asia soon after 4000 BC (see Map of Indo-Iranian Migrations), quickly migrated along the Oxus valley, reached the Hindu-Kush and Pamir mountains, where early Proto-Indic completely separated by 3000 BC, getting lost among the mountain ridges, and then stayed there with some internal differentiation until c. 1700 BC when the Indo-Aryan languages descendant of this stock finally began to migrate to northern India. Although this is not shown in this study, it's also plausible to assume that Indo-Aryan were essentially a few Dardic languages that expanded into the Indian subcontient.

Then, why do we often hear about a close proximity of Sanskrit to Avestan? The probable explanation is that classical "dictionary" Sanskrit" is not a real language, it is rather a quasi-etymological collection of lexemes which belong to different Indo-Aryan dialects from different periods; whereas the earliest Vedic Sanskrit, which has been passed down orally for many generations, is even more confusing and sometimes not even entirely decipherable. (Note that classical Sanskrit cannot be seen as something of a Proto-Indo-Aryan, since it was basically an artificial conlang created by Panini, and then lexically expanded over the course of centuries.) Consequently, a casual comparison with Avesta would produce many synonyms and many obscure parts in Vedic Sanskrit texts that may give a superficial impression of a close relationship.

On the other hand, in the present study, only real languages of the same period are contrasted, which helps to uncover the lack of common lexical background and offsets the Indo-Iranian separation further back in time. Similar difficulties of finding the common Indo-Iranian proto-state were also noted in other lexicostatistical studies of modern languages, first by Dyen, Kruskal, Black (1992) who complained about the "absence in the present classification of an Indoiranian group" and then by Ringe et al. (2005)

This rather important question stands to be further investigated by a more accurate research, however.



The current study confirms the early separation of Hittite. This is evident from the following consideration. The comparison of Hittite to: Latin (52%)(attested c. 100 BC), Attic Greek (51%)(400 BC), Avestan (50%) (600 BC), Sanskrit (~50%) (400BC) renders nearly equal results, which means that Hittite seems to be equidistant from other Indo-European groups.

Since Hittite is dated to c. 1600 BC, there would be even fewer matches if it had lived 1300 ys. longer to see Latin and Avestan, therefore this average figure of ~50% should be further reduced to about 45% of relatedness to most Indo-European groups. Now this is more difference than Latin/Greek (67%), Greek/Avestan (57%), Latin/Sanskrit (~57%). Glottochronologically, that figure would translate to about 5000 years before Latin/Greek/Avestan/Sanskrit (c. 300 BC), or circa 5300 BC (see below).

Therefore, we repeat the conclusion that Anatolian group should be regarded separately from the mainstream Indo-European languages, and support the hypothesis of Indo-Hittite (Indo-Anatolian).


Attempting to date Proto-Indo-European

One of the common reasons for the critisism of glottochronology is the alleged insufficient lexicostatistical distance between Modern Icelandic, Modern Armenian, and Modern Georgian and their respective old languages. However, the critics of glottochronology seem to ignore the law of large numbers, which states that even if some of the languages might deviate considerably from the mean in their phono- and lexistatistical behavior, the arithmetic average over a large number of languages would be relatively stable and trustably correct. On the other hand, the probability of running into languages with considerable deviation from the mean would be rather low, while, in many cases, the non-conformity of such deviant languages may be explained and even consistently predicted using various ad-hoc assumptions, such as geographical and linguistic isolation on a distant island (as in the case with Icelandic) or in the mountains (as with Armenian, and Georgian).

However, we will try not to use any ad-hoc assumptions herein. The law of large numbers would be just enough to establish the temporal position of PIE. Here is what we'll do. We will (1) calculate the mean average over the Indo-Aryan languages, (2) calibrate the rest of the list using the obtained lexicostatistical depth set to 1600 BC, the archaelogically attested date of Indo-Aryan invasion into India (3) calculate the mean for all of the Indo-European groups, (3) and finally convert that number into an approximate date in years using the aforementioned calibration date.

The mean percentage of separation among Nuristani-Dardic (excluding the unreasonably deviating Khowar), Sinhalese, and Hindi-Kashmiri seems to converge at an average depth of about 56% [(54 + 57 + 50 + 60 + 62 + 50 + 64 + 59 + 52) / 9 = 56], which should correspond to circa 1600 BC judging from the archaeological and historical record (also see Map of Indo-European Migrations).

Hence, from the logarithmic glottochronological formula, we have:
0.56 = x ^ 3.6
x = 0.56 ^ 0.28 = 0.85

After some calculations, that would produce the following calibrated glottochronological row:

2000 AD 1000 AD 0 1000 BC 2000 BC 3000 BC 4000 BC 5000 BC
100% 85% 72% 61% 52% 44% 38% 32%

This glottochronological row seems to be more or less consistent with the following historically attested facts and plausible assumptions:

(1) (very approximately) with the dating of Proto-Celtic (Irish/Welsh ~ 52% (or 57%, as corrected for synonyms -- see above) that is c. 1500 BC to the Early Bronze Urnfield culture (1300-750 BC); note the latter figure is unique, not mean, therefore may contain statistical errors, neither we know the exact dating of Proto-Celtic;

(2) with the separation of Hellenic from Italic occurring before 1600-1900 BC when the Proto-Greek tribes must have first entered Greece. Glottochronologically, the Attic Greek/Latin relatedness (69%) corresponds to about 2300 years before 200 BC (a mean value between the approximate datings of the Greek and Latin languages), thus yielding c. 2500 BC for the late Helleno-Italic proto-state.

(3) with the Baltic (~75%) expanding around 200 AD, just a little earlier than late Proto-Slavic (c. 450-500 AD), because the lexicostatistical distance between Lithuanian and Latvian in a more accurate lexicostatistical study by Mazhulis (1994) using Swadesh-200 is just slightly greater than among the Slavic languages to each other, whereas the period of Proto-Slavic split is probably historically datable to 400-500 AD; therefore, 0-200 AD seems plausible for the separation date of Lithuanian and Latvian-Latgalian.

(4) with the likely separation of Proto-Zazaki from Persion (70-75%) soon after the end of the Old Persian period (300-500 BC);

(5) with Ossetic separating from other East Iranian at 60% (~1100 BC). This dating looks right, because the Scythian languages are attested in the Caucasus just circa 800 BC, and their migration from the Pamir had to be relatively fast (occuring within centuries or even decades) because they used horse-based technology.

(6) with the existence of the BMAC civilization along the Oxus river during 1800-2300 BC, which should probably be attributed to the Proto-Iranian state, apparently located along the Oxus as well (see Map of Indo-Iranian Migrations). According to the present calculations, the era of Proto-Iranian would roughly correspond to 60-40% thus embracing the period from 1100 to 3000 BC, which includes the period of the BMAC as a subset.

(7) with the diversification of West Iranian languages (70%, 200 BC) after the fall of Persian Empire by 330 BC.

Calculating the approximate upper date for late PIE:

Now that we have calibrated the glottochronological row, we can use the figures for the Greek/Avestan (57%), and Latin/Sanskrit (~57%) relatedness to place the upper limit for Proto-Indo-European at 3500 years before the average dating of Avestan (600 BC), Sanskrit (400 BC), Latin (100 BC), and Greek (400 BC), or circa 3900 BC.

We can also obtain a similar number startng from modern languages:

Celtic / Indo-Aryan = (38 + 39 + 42) / 3 ~ 39%
Celti / Balto-Slavic = (32 + 32 + 38 + 37 + 44 + 35) ~ 36%
Balto-Salvic / Indo-Aryan = (38 + 36 + 35 + 32) / 4 ~ 35%
Balto-Slavi / Iranian = (39 + 38 + 28 + 35) / 4 ~ 35%
European / Iranian = (39 + 37 + 30 + 34) / 4 ~ 35%
European / Indo-Aryan (29 + 34 + 39) / 3 ~ 34 %

PIE ~ 36-35% or circa 4400 BC

The conclusion is that PIE must have separated into early Indo-European dialects by circa 4100 BC, which is in rather good correspondence with Gimbutas' theory.



Does any of this agree with other models?

Does any of this agree with other researchers' models? Sometimes, it does.

See Vaclav Blazhek, On the internal classification of Indo-European languages: survey (2005)

(1) Eric Hamp (1990)
We have some essential agreement with non-lexicostatical model by Eric Hamp's (1990), who based his classification on specific isoglosses in phonology, morphology and lexicon. For instance, he also tends to place Balto-Slavic in the same group with Centum, a purely lexicostatistical possible conclusion in this work. He also agrees that Thracian is an early Balto-Slavic offshoot. He seems to misplace Greek though, because of its alleged proximity to Armenian (a question I have not addressed herein). Otherwise, his conclusions are rather classical.

(2) Starostin (2004)
There's some interesting agreement with Starostin's glottochronolical study (2004). Note that the counting and calibration methods in this lexicostatistical study were completely different. Starostin has:
-4600 [my -5300] for the Anatolian separation;
-3800-3300 [my -4100] for the mainstream Indo-European languages separation;
-1200 [my -700] for Balto-Slavic;
-80 [my -200] for Latvian-Lithuanian;
-1000 [my -1900] for Brythonic-Goidelic;
-250 [my -700] for late Proto-Indo-Aryan;
-1200-700 [my -1100] for late Proto-Iranian;
+180 [my -200] for Shugni-Munji(Yidgha)-Ishkashimi (he also found an early separation of Wakhi (-500), which surprised me too; and correctly identified a long separation of Ormuri-Parachi, etc, see The dendrogram of Iranian languages);
+300 [my -100] for late West Iranian;
-1100 [my -1700] for late Proto-Dardic (he also noticed the early separation of Khowar and Kalasha);
etc. Any of which is not too far from the figures in the present study.
At least, we have some basic, fundamental agreement here. You can also notice that Starostin has a smaller offset value, so it's basically a matter of calibration (whereas my calibration method was very rough and approximate in this work, so I don't even insist on it it's not even the aim of this work to elaborate a correct glottochronological calibration, because I was mostly interested in percentage values).

The rest of his cladistics seems to be skrewed up, probably because Starostin relied too much on statistical calculations in short word lists, which are not always sufficiently accurate to produce an error margin small enough to building a correct dendrogram, when the separation times are much too close (a common problem in statistical phylogeny). To avoid this common error, I simply put an honest I-don't-know and relied on classical conclusions and rough approximations, whenever I felt there may be something wrong with the statitical side. On the other hand, some of his dates may in fact be more accurate, because I used a very small lexical base for just a small number of languages.