Tuesday, May 22, 2012

Note on mathematical statistics: 5.7 Chi-Square Test

1. Mathematical basis of Chi-Square test
  • Multivariate normal distribution implies a Chi-Square (n) distribution of Y=Sum( (X_i-mu_i)^2 / sigma_i^2 ). 
  • By CLT, the joint pdf of different groups of sample are treated multivariate normal.
  • The statistic Q_k-1 = Sum( Y_i^2) = Sum [1..k] ( (X_i-E_i)^2/E_i ) has a Chi-Square(k-1) distribution.
  • With the idea of interval  frequency approximation, every distribution can be treated as Multi-binomial.
2. Procedure (the idea of interval  frequency approximation)
  • Partition the domain of experiment result into finite mutually disjoint sets A_1, A_2 ... A_n
  • Count the number of result in A_i as frequency X_i
  • Assign df  (different ways)
  • Assign the probability of result in A_i as p_i (different ways)
  • Evaluate statistic Q_k-1
  • Test
3. Three tests

Test Goodness of fit Homogeneity Independence
example 5.7.1, 5.7.2 5.7.3 5.7.4
H0 Result has the theoretical distribution Two sets of sample have the same distribution Two attributions of subjects are independent 
Key fact X_i/n = p_0i p_1i=p_2i=p_0i=E_i/nP_ij=Pi*Pj
source of E_i   multinomial modelMLE MLE
formula of E_i (1<=i<=k) E_i=n*p_0iE_i=(X_i1+X_i2)/(n1+n2) E_ij=(X_i./n) * (X_.j/n)
df k-1 k-1=2(k-1)-(k-1) (a-1)(b-1)=(a*b-1)-(a+b-2)
statistic Sum [i=1..k] ( (X_i-E_i)^2/ E_i ) Sum [j=1,2] [i=1..k]( (X_ij-n_j*E_i)^2/n_j*E_i ) Sum [j=1,a] [i=1..b]( (X_ij-n*E_ij)^2/n*E_ij)
Dataset x_1, x_2 ... x_n x_1, x_2,... x_n1
y_1, y_2 ... y_n2
contingency table
Xij, 1<=i<=a, 1<=j<=b
k=a*b

4. Remarks
  • Chi-Square tests are not exact test, but approximate test
  • The statistic is based on frequency of result in interval, instead of the result itself
  • Make sure E_i > 5 or use Fisher exact test
  • Minimum Chi-Square estimation
  • MLE based Chi-square tests have greater rejection rate than tests based on Minimum Chi-Square estimator
  • Every estimated parameter p0i costs one df

No comments:

Post a Comment