Sean's Data Analysis Note

Wednesday, May 23, 2012

Formula Sheet for mathematical statistics

Formula Sheet for mathematical statistics

1. Probability Cheat Sheet
one page (2 sides) by Peleg Michaeli
peleg.yogiley.org/math/probability/probabilitycs.pdf

2. Probability and Statistics Cookbook
detailed list by Matthias Vallentin
http://matthias.vallentin.net/probability-and-statistics-cookbook/cookbook-en.pdf

Tuesday, May 22, 2012

Note on mathematical statistics: 5.7 Chi-Square Test

1. Mathematical basis of Chi-Square test

Multivariate normal distribution implies a Chi-Square (n) distribution of Y=Sum( (X_i-mu_i)^2 / sigma_i^2 ).
By CLT, the joint pdf of different groups of sample are treated multivariate normal.
The statistic Q_k-1 = Sum( Y_i^2) = Sum [1..k] ( (X_i-E_i)^2/E_i ) has a Chi-Square(k-1) distribution.
With the idea of interval frequency approximation, every distribution can be treated as Multi-binomial.

2. Procedure (the idea of interval frequency approximation)

Partition the domain of experiment result into finite mutually disjoint sets A_1, A_2 ... A_n
Count the number of result in A_i as frequency X_i
Assign df (different ways)
Assign the probability of result in A_i as p_i (different ways)
Evaluate statistic Q_k-1
Test

3. Three tests

Test	Goodness of fit	Homogeneity	Independence
example	5.7.1, 5.7.2	5.7.3	5.7.4
H0	Result has the theoretical distribution	Two sets of sample have the same distribution	Two attributions of subjects are independent
Key fact	X_i/n = p_0i	p_1i=p_2i=p_0i=E_i/n	P_ij=Pi*Pj
source of E_i	multinomial model	MLE	MLE
formula of E_i (1<=i<=k)	E_i=n*p_0i	E_i=(X_i1+X_i2)/(n1+n2)	E_ij=(X_i./n) * (X_.j/n)
df	k-1	k-1=2(k-1)-(k-1)	(a-1)(b-1)=(a*b-1)-(a+b-2)
statistic	Sum [i=1..k] ( (X_i-E_i)^2/ E_i )	Sum [j=1,2] [i=1..k]( (X_ij-n_jE_i)^2/n_jE_i )	Sum [j=1,a] [i=1..b]( (X_ij-nE_ij)^2/nE_ij)
Dataset	x_1, x_2 ... x_n	x_1, x_2,... x_n1 y_1, y_2 ... y_n2	contingency table Xij, 1<=i<=a, 1<=j<=b k=a*b

4. Remarks

Chi-Square tests are not exact test, but approximate test
The statistic is based on frequency of result in interval, instead of the result itself
Make sure E_i > 5 or use Fisher exact test
Minimum Chi-Square estimation
MLE based Chi-square tests have greater rejection rate than tests based on Minimum Chi-Square estimator
Every estimated parameter p0i costs one df

Notes on mathematical statistics: textbook

Here is a serial of notes on my studying on mathematical statistics, specifically on the textbook of
Introduction to Mathematical Statistics,
R.V. Hogg, A Craig and J. W. McKean
6th edition, Pearson.

There was an official solution book pdf file including partial answers to exercises.
However, it can be located and downloaded only after intensive Google search.

I keep my homework and exercises from Chapter 4 to Chapter 7 which can be used as an "AS IS" manual for the book (not only even number questions in official solution, but also some odd numbers questions).

Contact me with the page number and exercises index.
I will see what I can do for you.

Tuesday, January 31, 2012

Embedding data in workflow

Source: QnA in Knime forum

Contributor ID: jfalgout

Goal
To embed data in workflow, so it can be distribute with the workflow package zip file.

Strategy

Create a directory with files underneath the directory containing the node artifacts for a workflow, the "knime.node" flow variable will be populated the next time you edit the node.

The folder you create within the node's folder has to be named "drop". Otherwise when you save the workspace whatever folders/files you add get deleted.

Test

Not yet

Sunday, January 15, 2012

Example to create new Node

Example to create new Node: http://tech.knime.org/developer/example

Monday, January 9, 2012

Typical workflow for test on specific Culumn

Source: QnA in Knime forum

Contributor ID: James Davidson

Goal

Derive a new column from a test on existing column

Strategy
1. Use [Row Filter] to do the test and split (with Include & exclude) the original table
2. Use [Java snippet] to generate the value for the new column
3. Use [Concatenate] to merge table back

Comment
1. In some case, [Joiner] my be a shortcut for the goal

2. Do not known whether it is the best solution, since I am just a beginner

Resource
Download James's Package

Friday, January 6, 2012

Remove empty columns in Knime

Goal
Delete the columns comprise of blank value (0, NaN, NA, NULL or fix string)

Strategy
1. If the target is number column (double or int), use [Low Variance Filter] node
2. Use [Transpose] + [Row Filter] + [Transpose] , suggested by
http://tech.knime.org/forum/knime-general/removing-columns-where-every-value-is-empty
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table

Comment
None of the three works for me.
1. My target columns are string type. [String to Number] node need to be wired to source column manually.
It does not make sense if I have tons of empty columns.
2. Missing Value in [Row filter]'s setting only works on specific column, which implies it is not an automation.
3. I can write R, Python or script outside Knime to do filtering.

Final solution
1. Pre-process the CSV file outside Knime
2. Manually skip the column in [File Reader] setting.