仮説検定結果より導いたp値を検証する

Pythonプログラミングイントロダクション(19章) まで読みました。 19章では統計的仮説検定についてです。

自転車競技の記録を伸ばす薬であるPED-XとPED-Yのどちらが優れているか無作為試験を行った結果から、得られた平均タイムについて検討していきます。本当に平均タイムに有意な差があるのか、偶然発生した差なのかを検定をする方法が説明されています。

フィッシャーの統計的仮説検定

本書で述べられているフィッシャーの統計的仮説検定の手順は下記になります。この手順により観測結果が偶然起こったのか確率を評価していきます。

1. 帰無仮説(null hypothesis)と対立仮説(alternative hypothesis)を立てる.
2. 評価する標本についての統計的仮定を理解する.
3. 適切な検定統計量(test statistic)を計算する.
4. 帰無仮説における検定統計量の確率を得る.
5. その確率が、帰無仮説が偽であると推測できるほどに十分小さいかどうか,すなわち,帰無仮説を棄却(reject)するかどうかを決定する.

1. 帰無仮説(null hypothesis)と対立仮説(alternative hypothesis)を立てる.

PED-XとPED-Yの平均値を例にとった場合、下記のようになります。

帰無仮説：PED-XとPED-Yの平均の差が0

対立仮説：PED-XとPED-Yの平均の差が0でない

2. 評価する標本についての統計的仮定を理解する.

PED-XとPED-Y使用者のフィニッシュタイムは正規分布となり、標本は無限母集団から抽出した物でした。

3. 適切な検定統計量(test statistic)を計算する.

t値を計算します。t値とは二つの平均の誤差が標準誤差で見て0からどれくらい差があるかを表した値です。本書では-2.13165598142となりました。

標準誤差は17章で出てきたものです。

\( \sigma_{a}^{2} : 母集団の分散 \\\ \sigma_{a} : 母集団の標準偏差 \\\ n : 標本サイズ \)

\[ SE = \frac{\sigma}{\sqrt{n}} \]

4. 帰無仮説における検定統計量の確率を得る.

p値を計算します。p値とは帰無仮説が成立する前提で、統計量が観測された値以上の極端な値が得られる確率のことです。つまり、PED-XとPED-Yがの平均の差が0であったときに差が現れるような極端な値が得られる確率のことになります。本書では0.0343720799815、つまり3.4%程になりました。

5. その確率が、帰無仮説が偽であると推測できるほどに十分小さいかどうか,すなわち,帰無仮説を棄却(reject)するかどうかを決定する.

本書ではp値が3.4%となり、PED-Xの方が優れていそうと言う結論になりかけますが、この後p値が信用ならないと言う話につながっていきます。

p値を何度も計算してみる

本書を読み進めていくと、この例はプログラムにより作り出されたデータであると言うたねあかしがされます。平均値119.5、標準偏差5.0の分布と平均値120.0、標準偏差が4.0の分布からランダムにサンプリングした値が使われていました。 random.gauss関数でデータを作っています。

実際に何度か下記のようなプログラムを流してみると、p値の値は乱高下します。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import random
from scipy import stats


def get_t_value():
    treatmenDist = (119.5, 5.0)
    controlDist = (120, 4.0)
    sampleSize = 1000

    treatmentTimes, controlTimes = [], []

    for s in range(sampleSize):
        treatmentTimes.append(random.gauss(treatmenDist[0], treatmenDist[1]))
        controlTimes.append(random.gauss(controlDist[0], controlDist[1]))

    controleMeans = sum(controlTimes) / len(controlTimes)
    treatMeans = sum(treatmentTimes) / len(treatmentTimes)

    twoSampleTest = stats.ttest_ind(treatmentTimes, controlTimes, equal_var=False)
    print('Treatmeant mean - control mean = ',
          treatMeans - controleMeans, ' minutes')
    print('The t-statistic from two-sample test is ', twoSampleTest[0])
    print('The p-value from two-sample test is ', twoSampleTest[1])


for i in range(10):
    twoSampleTest = get_t_value()

結果は下記のようになります。p値が1%台の時もあれば60%台の時もあります。

Treatmeant mean - control mean =  -0.6843325129559474  minutes
The t-statistic from two-sample test is  -1.0716961219246917
The p-value from two-sample test is  0.285226164284344
Treatmeant mean - control mean =  -0.39645994742086543  minutes
The t-statistic from two-sample test is  -0.6751950331252392
The p-value from two-sample test is  0.5004085432884583
Treatmeant mean - control mean =  0.25269386654427706  minutes
The t-statistic from two-sample test is  0.3950053957541872
The p-value from two-sample test is  0.6932658871131978
Treatmeant mean - control mean =  0.50100925426743  minutes
The t-statistic from two-sample test is  0.7660580410192988
The p-value from two-sample test is  0.4445624847162438
Treatmeant mean - control mean =  -0.9383152222142996  minutes
The t-statistic from two-sample test is  -1.3925614858916715
The p-value from two-sample test is  0.16555904568337207
Treatmeant mean - control mean =  -0.4504161747373985  minutes
The t-statistic from two-sample test is  -0.7134525526996994
The p-value from two-sample test is  0.4764536661236095
Treatmeant mean - control mean =  -0.517887936147531  minutes
The t-statistic from two-sample test is  -0.818800036206608
The p-value from two-sample test is  0.41390883875821693
Treatmeant mean - control mean =  -0.7535810558372305  minutes
The t-statistic from two-sample test is  -1.2462075073793941
The p-value from two-sample test is  0.21417712992121332
Treatmeant mean - control mean =  -1.5033637291589912  minutes
The t-statistic from two-sample test is  -2.347673847276862
The p-value from two-sample test is  0.019992408614120653
Treatmeant mean - control mean =  -0.3150896341041829  minutes
The t-statistic from two-sample test is  -0.5046312441817462
The p-value from two-sample test is  0.6144229896604907

と言うことで、200程度のサンプリング数だとp値がこれだけ上下することが示されました。

まとめ

仮説検定についての流れを追ってみました。取り上げた流れは2標本の両側検定と言うものです。本書では他にも片側検定と1標本検定についての話題も取り扱っています。

本書ではp値が低いのは帰無仮説が実際に低いかもしれないし、母集団の代表的な例でない場合もあると言われています。そして、このp値の問題を解決する考え方として20章のベイズ統計に続いていくわけです。