.

KGRKJGETMRETU895U-589TY5MIGM5JGB5SDFESFREWTGR54TY
Server : Apache/2.4.62
System : FreeBSD fbsdweb2.web.rcn.net 14.1-RELEASE FreeBSD 14.1-RELEASE releng/14.1-n267679-10e31f0946d8 GENERIC amd64
User : www ( 80)
PHP Version : 8.3.8
Disable Function : NONE
Directory : /domains/markrose/
Upload File :
Current File : /domains/markrose/chanceph.htm
<HTML>
<HEAD><TITLE>How likely are chance resemblances between languages?
</TITLE></HEAD>

<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<IMG  Align=Top SRC="verddrop.gif"> 

<h2><font color="#000060">How likely are chance resemblances between languages?</font></h2>

<A HREF="default.html">[ Home ]</A>
<A HREF="chance.htm">[ Top of paper ]</A>
<A HREF="chance.htm#better">[ Remainder of paper ]</A>

<h3><font color="#000060">Taking phoneme frequencies into account</a> </font></h3>

This page, as an example of analysis, will consider random matches between Quechua and Chinese, 
taking into account the fact that phonemes don't occur with equal probability.


<p>First, we need to <b>decide</b> <b>what constitutes a phonetic match</b> between the two languages.  One way of doing this is to decide for each Quechua phoneme what Chinese phonemes we'll accept as matches.  (Think of it this way: is Qu. <i>runa</i> a match for Ch. <i>r&eacute;n</i>?  Is <i>chinchi</i> a match for <i>chong?</i>  Is <i>chay</i> a match for <i>zh&egrave;</i>?)  

<p>We might decide as follows.  The criterion here is obviously phonetic similarity.  We could certainly improve on this by requiring a particular phonological distance; e.g. a difference of no more than two phonetic features, such as voicing or place or articulation.  The important point, as we will see, is to be clear about what we count or do not count as a match; or if we are evaluating someone else's work, to use the same phonetic criteria they do.

<table>
<tr><td><b>Qu. <td><b>Ch.
<tr><td>p  <td>p, b
<tr><td>t  <td>t, d
<tr><td>ch <td>ch, zh, j, q, c, z
<tr><td>k <td>k, g
<tr><td>s <td>s, sh, c, z, x, zh
<tr><td>h <td>h
<tr><td>q <td>h, k
<tr><td>m <td>m, n
<tr><td>n <td>m, n, ng
<tr><td>&ntilde; <td>m, n, ng, y
<tr><td>l  <td>l, r
<tr><td>ll <td>l, r, y
<tr><td>r <td>l, r
<tr><td>w <td>w, u
<tr><td>y <td>y, i
<tr><td>a <td>a, e, o
<tr><td>i <td>i, e, y
<tr><td>u <td>u, o, w
</table>

<p>We will next need to know the frequency with which each phoneme occurs in each language.  This can be calculated using a simple program operating on sample texts.  For Quechua we find:

<table>
<tr><td> <td>initial <td>medial <td>final
<tr><td>a <td>5.291005 <td>25.906736 <td>40.211640  
<tr><td>b <td>2.645503 <td>0 <td> 0   
<tr><td>d <td>0 <td> 0.310881 <td>0   
<tr><td>g <td>0.529101 <td>0.103627 <td>0   
<tr><td>h <td>5.820106 <td>0 <td> 0   
<tr><td>i <td>2.645503 <td>8.808290 <td>5.291005  
<tr><td>k <td>14.814815 <td>5.595855 <td>3.174603  
<tr><td>l <td>0.529101 <td>0.414508 <td>0   
<tr><td>m <td>7.407407 <td>4.145078 <td>3.703704  
<tr><td>n <td>1.587302 <td>6.528497 <td>25.396825  
<tr><td>p <td>7.936508 <td>6.010363 <td>0   
<tr><td>q <td>4.232804 <td>3.108808 <td>8.465608  
<tr><td>r <td>4.232804 <td>5.077720 <td>0   
<tr><td>s <td>6.349206 <td>4.145078 <td>2.645503  
<tr><td>t <td>7.407407 <td>6.424870 <td>0   
<tr><td>u <td>3.703704 <td>11.398964 <td>2.645503
<tr><td>w <td>11.111111 <td>1.450777 <td>0.529101  
<tr><td>y <td>3.174603 <td>4.145078 <td>7.936508  
<tr><td>ch <td>6.878307 <td>3.108808 <td>0   
<tr><td>&ntilde; <td>1.058201 <td>1.243523 <td>0   
<tr><td>rr <td>0.529101 <td>0 <td> 0   
<tr><td>ll <td>2.116402 <td>1.865285 <td>0   
</table>

And for Chinese we get:

<table>
<tr><td> <td>initial <td>medial <td>final
<tr><td>a <td>1.400000 <td>21.494371 <td>7.739308  
<tr><td>b <td>7.000000 <td>1.432958 <td>0   
<tr><td>c <td>0.600000 <td>0.102354 <td>0   
<tr><td>d <td>12.800000 <td>1.228250 <td>0  
<tr><td>e <td>0.200000 <td>8.904811 <td>15.885947  
<tr><td>f <td>2.000000 <td>0.614125 <td>0   
<tr><td>g <td>3.200000 <td>1.842375 <td>0  
<tr><td>h <td>3.400000 <td>2.149437 <td>0   
<tr><td>i <td>0 <td> 17.195496 <td>29.327902  
<tr><td>j <td>4.600000 <td>1.944729 <td>0   
<tr><td>k <td>2.200000 <td>0.204708 <td>0   
<tr><td>l <td>6.000000 <td>2.149437 <td>0   
<tr><td>m <td>2.600000 <td>1.330604 <td>0 
<tr><td>n <td>3.800000 <td>6.038895 <td>11.608961  
<tr><td>o <td>0.400000 <td>7.881269 <td>9.368635  
<tr><td>p <td>1.000000 <td>0.102354 <td>0   
<tr><td>q <td>2.000000 <td>1.842375 <td>0   
<tr><td>r <td>0.800000 <td>0.307062 <td>1.629328  
<tr><td>s <td>0.800000 <td>1.023541 <td>0   
<tr><td>t <td>3.800000 <td>1.228250 <td>0   
<tr><td>u <td>0 <td> 8.495394 <td>12.016293  
<tr><td>w <td>7.800000 <td>0.716479 <td>0   
<tr><td>x <td>4.200000 <td>0.614125 <td>0   
<tr><td>y <td>9.600000 <td>0.511771 <td>0   
<tr><td>z <td>4.200000 <td>1.023541 <td>0   
<tr><td>ch <td>2.200000 <td>0.716479 <td>0   
<tr><td>ng <td>0 <td> 5.834186 <td>12.016293  
<tr><td>sh <td>7.800000 <td>1.330604 <td>0   
<tr><td>zh <td>5.600000 <td>1.740020 <td>0   
</table>

<p>(The reader who knows Chinese may wonder how we can have medial consonants at all.  The answer is that I am using Chinese lexemes, not single characters (<i>z&igrave;</i>), so that, for instance, <i>Zhonggu&oacute;</i> 'China' is one word, not two.) 

<p>Now we're in a position to calculate the probability for a match.  Let's start by assuming that there must be a match (within the phonetic categories established above) in both initial, medial, and final. 

<p>To calculate the probability <i>p<font size=2>i</i></font><i> </i>for a match in the initial, we go down the list of Quechua initials, multiplying its probability times the probability of finding the matching sound(s) in that same position in Chinese.  For instance, the probability of a match on initial <b>p</b> is the probability of initial <b>p</b> in Quechua (.0794) times the probability of a match on initial <b>p</b> <i>or</i> <b>b</b> (.07 + .01 = .08), or .00635.  

<p>I show the entire calculations below, because some of them are quite eloquent, and show the value of taking a frequency approach.  If you're looking for a match for a Quechua word in <b>s</b>-, for instance, you have a <i>23% chance</i> of matching any of the sounds we've judged as similar in Chinese.  You're likely to match medial -<b>a</b>- <i>38%</i> of the time; final -<b>a</b> <i>33%</i> of the time, final -<b>n</b> <i>24%</i> of the time.

<p>(The boldface letter is the Quechua sound; it's followed by the Chinese sounds we said would be a match.  The first number is the probability of the Quechua phoneme; the second is the sum of the probabilities of the matching Chinese sounds; the third is the multiplication of the first two.)

<p><b><i>Initials</b></i>

<table>
<tr><td><b>a</b> aeo <td>.05291 * .020  =<td>.00106
<tr><td><b>h</b> h<td>.05820 * .034 =<td>.00198
<tr><td><b>i</b> iey<td>.02646 * .098 =<td>.00259
<tr><td><b>k</b> kg<td>.14815 * .054 =<td>.00800
<tr><td><b>l</b> lr<td>.00529 * .068 =<td>.00036
<tr><td><b>m</b> mn<td>.07407 * .064 =<td>.00474
<tr><td><b>n</b> mn ng<td>.01587 * .160 =<td>.00254
<tr><td><b>p </b>pb<td>.07937 * .080 =<td>.00635
<tr><td><b>q </b>hk<td>.04228 * .056 =<td>.00237
<tr><td><b>r </b>lr<td>.04228 * .068 =<td>.00288
<tr><td><b>s</b> s sh c z x<td>.06349 * .232 =<td>.01473
<tr><td><b>t </b>td<td>.07407 * .166 =<td>.01230
<tr><td><b>u </b>uow<td>.03704 * .082 =<td>.00304
<tr><td><b>w </b>wu<td>.11111 * .078  =<td>.00867
<tr><td><b>y </b>yi<td>.03174 * .096 =<td>.00305
<tr><td><b>ch</b> ch zh jqcz<td>.06883 * .192 =<td>.01322
<tr><td><b>&ntilde;</b> mn ng y<td>.01058 * .160 =<td>.00169
<tr><td><b>ll </b>lry<td>.02121 * .164 =<td>.00348
</table>

Probability for an initial match = .09305 = 9.3%

<p><b><i>Medials</b></i>

<table>
<tr><td><b>a</b> aeo<td>.25907 * .3828 =<td>.09917
<tr><td><b>i</b> iey<td>.08808 * .2661 =<td>.02344
<tr><td><b>k</b> kg <td>.05596 *.0205 = <td>.00114
<tr><td><b>l</b> lr<td>.00415 * .0246 =<td>.00010
<tr><td><b>m</b> mn<td>.04145 * .0737 =<td>.00305
<tr><td><b>n</b> mn ng <td>.06528 * .1320 =<td>.00862
<tr><td><b>p</b> pb<td>.06010 * .0153 =<td>.00092
<tr><td><b>q</b> hk<td>.03109 * .0235 =<td>.00073
<tr><td><b>r</b> lr<td>.05078 * .0246 =<td>.00125
<tr><td><b>s</b> s sh c z x<td>.04145 * .0582 =<td>.00241
<tr><td><b>t</b> td<td>.06425 * .0246 =<td>.00158
<tr><td><b>u</b> uow<td>.11399 * .1710 =<td>.01949
<tr><td><b>w</b> wu<td>.01451 * .0921 =<td>.00134
<tr><td><b>y</b> yi<td>.04145 * .1771 =<td>.00734
<tr><td><b>ch</b> ch zh jqcz<td>.03109 * .0736 =<td>.00229
<tr><td><b>&ntilde;</b> mn ng y<td>.01244 * .1371 =<td>.00170
<tr><td><b>ll</b> lry<td>.01865 * .0297 =<td>.00055
</table>

<p>Probability for a medial match  = .17514 = 17.5 %

<p><b><i>Finals</b></i>

<table>
<tr><td><b>a</b> aeo<td>.40212   * .3299 =<td>.13266
<tr><td><b>i</b> iey<td>.05291   * .4522 =<td>.02393
<tr><td><b>k</b> kg<td>.03175   * 0  = <td>0
<tr><td><b>m</b> mn <td>.03704   *.116  =<td>.00430
<tr><td><b>n</b> mn ng <td>.25397   *.236  =<td>.05994
<tr><td><b>q</b> hk <td>.08466   * 0  = <td>0
<tr><td><b>s</b> s sh c z x<td>.02646   * 0  = <td>0
<tr><td><b>u</b> uow<td>.02646 * .2139 =<td>.00566
<tr><td><b>w</b> wu<td>.00529   * .1202 =<td>.00064
<tr><td><b>y</b> yi<td>.07937   * .2933 =<td>.02328
</table>

<p>Probability for a final match   = .25039 = 25.0 %

<p>So, the probability of finding a random match on a single word (with no semantic leeway) is .0931 * .1751 * .2504 = <b>0.0041</b>, or 1 in 244.  

<h5>Was all that worth it?</h5>

It's worthwhile comparing this to the original seat-of-the-pants estimate (based on 14 equiprobable consonants and 5 equiprobable vowels, and allowing 3 phonetic matches per sound) of 27 in 980, or 0.027-- 6.5 times the above frequency.

<p>Two lessons may be drawn.  First, phoneme <b>frequency matters</b>.   Both Quechua and Chinese have very many medial <b>a</b> sounds, and final nasals, and initial affricates.  That makes random matches involving those sounds much more likely.

<p>Second, seemingly <b>minor points</b> <b>of procedure have a</b> <b>huge impact</b> on our results.  We are used to situations where rough calculations do not lead us far astray.  But in this area differing assumptions or methodologies lead to very different results.  Very careful attention to both is warranted.


<h4><font color="#000060">Additional types of match</a> </font></h4>

We can also answer the question posed above: with the phonetic criteria as given, neither <i>runa/r&eacute;n</i>, nor <i>chinchi/chong</i>, nor <i>chay/zh&egrave; </i>are matches.  Yet a comparer would probably set great store on each of them.  

<p>Obviously the initial-medial-final calculation is still a simplification.  Quechua, for instance, can have both initial and final consonant clusters; both languages have some two-phoneme roots; and of course a vague &quot;medial&quot; category is not a good way of handling multisyllabic words.

<p>We might decide to allow a Quechua medial to match either a Chinese medial or final, to catch resemblances like <i>runa/r&eacute;n</i> and <i>chinchi/chong</i>.  To do this we need to compute the chance that a Quechua medial matches a Chinese final, as follows.  (We can skip Quechua medials for which none of the corresponding Chinese sounds can end a word.)

<p><b><i>Medial-to-final</b></i>
<table>
<tr><td><b>a</b> aeo<td>.25907 * .3300 =<td>.08549
<tr><td><b>i</b> iey<td>.08808 * .4521 =<td>.03982
<tr><td><b>l</b> lr<td>.00415 * .0163  =<td>.00007
<tr><td><b>m</b> mn <td>.04145 *.1161  = <td>.00481
<tr><td><b>n</b> mn ng<td>.06528 * .2363 =<td>.01543
<tr><td><b>r</b> lr<td>.05078 * .0163  =<td>.00083
<tr><td><b>u</b> uow<td>.11399 * .2138 =<td>.02437
<tr><td><b>w</b> wu<td>.01451 * .1202 =<td>.00174
<tr><td><b>y</b> yi<td>.04145 * .2933 =<td>.01216
<tr><td><b>&ntilde;</b> mn ng y<td>.01244 * .2363 =<td>.00294
<tr><td><b>ll</b> lry<td>.01865 * .0163  =<td>.00030
</table>

Probability for a medial-to-final match = .18796 = 18.8 %

<p>This can be added to the previous medial-to-medial estimate, on the grounds that when a medial doesn't match another medial, we're giving it another chance to match a final.  However, the additional chance should be discounted by the probability (30% in my sample Chinese text) that the initial and final are the same (that is, that the word is just two phonemes long).  So the medial-to-medial-or-final probability is .1751 + (.1880 * .70) = .3067.   

<p>The probability of finding a random match on a single word (no semantic leeway) can now be given as .0931 * .3067 * .2504 = <b>0.0071</b>.  

<p>This estimate could be revised still further to take account of such things as metathesis (switched consonants), or Quechua's initial consonant clusters.  Note that both examples allow additional matches, and thus will increase <i>p</i> even more.


<h4><font color="#000060">Matching just two phonemes</a> </font></h4>

We still haven't really taken account of <i>chay/zh&egrave;</i> (nor of <i>runa/r&eacute;n</i>, since we decided above that <i>u</i> and <i>e</i> don't match).  The probabilities calculated so far require three phonemes to match.  It might be interesting to know the probability that just two phonemes match.

<p>Since this probability is obviously going to be much higher, I don't recommend trying to combine both types of match into a single <i>p</i>, which would understate the difficulty of finding 3-phoneme matches and overstate that of 2-phoneme matches.  

<p>We can estimate the probability of a 2-phoneme match by using the probability of a match on initials times that of a Quechua medial matching a Chinese medial or final: .0931 * .3066 = <b>.0285</b> or about 1 in 35.

<p>This could be refined by adding the probability that a Quechua final matches a Chinese medial or final, this time discounted by the probability that the Quechua medial is also the final.

<h4><font color="#000060">An alternative approach</a> </font></h4>

<p>If you want to avoid phonetic calculations entirely, there's an alternative approach: We pick a word <i>a</i> in A, then pick the word <i>b</i> in B which most closely resembles it phonetically.  To handle phonetic looseness, we pick the <i>n</i> words in B which most closely resemble it phonetically. 

<p>The advantage is that we don't have to mess with phonetic details or how to match the phonologies of different languages.  We can proceed quickly to an estimate of how many matches we can expect to find in general between two languages.

<p>The disadvantage is that this approach doesn't lend itself to evaluating other people's claims.  You can picture (say) Greenberg & Ruhlen examining the <i>n</i> words in Tfaltik that most closely resemble <i>maliq'a</i>.   But what is their <i>n</i>?  To give a reasonable estimate we have to dive back into phonetic details and probabilities.

<hr>

<A HREF="default.html">[ Home ]</A>
<A HREF="chance.htm">[ Top of paper ]</A>
<A HREF="chance.htm#better">[ Remainder of paper ]</A>


</BODY></HTML>
Anon7 - 2021