Sean's Data Analysis Note: 2012

Wednesday, November 7, 2012

Generate random permutation in R

Generate random permutation in R

start = 1
end = 10
seq = c(start, end)
sample(seq,length(seq), replace=FALSE)

Equivalent MathLab: X = randperm(End-Begin+1)+Begin

example: [1] 5 1 2 8 6 4 3 7 9 10

Thursday, September 20, 2012

Remove duplicated rows in Knime

Goal
Delete the duplicated rows in table.

Strategy
1. Use GroupBy node, refer to
Detect and delete duplicate files (rows) based on two variable identifiers (date and lot number) and not based on row ID
2. Use Database Query. SELECT DISTINCT
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table

Comment
1. Delete empty rows is a special case of this topic.
2. Two rows are duplicated at all the columns, or only on several columns (combination key)
3. S2, and S3 are not tested

Saturday, July 21, 2012

Regression Analysis of Airline’s Incidents

Term Research Project in SAS and Statistic Consulting

Title: Regression Analysis of Airline’s Incidents

Outline
============================================

Introduction

Data

Analysis

Simple linear regression

Variance stabilizing transformation

Non-linear Regression (Poisson )

Conclusion

Download slide: https://docs.google.com/open?id=0Bw64rMSoJR_ZUjl6Q2NkODNmRFE

Wednesday, July 18, 2012

SAS PROC SQL enhancement to ANSI

The difference between PROC SQL and ANSI SQL is documented in "PROC SQL and the ANSI Standard" (9.2 Help link), with content on:

Compliance
SQL Procedure Enhancements
SQL Procedure Omissions

Useful enhancements include:

Format, Informat, Label
Contain
Calculated

Monday, July 9, 2012

Pricing scheme of Group-buy Websites in China

Term Project for Applied Statistics

Outline
============================================

Background: Group Buying/ C2C
Data set: Public production data
Method: Descriptive/ Inference/ Diagnostics
Analysis: Preparing/Visualization/2 Models
Conclusion: Competition/Market

Here

Download slide: https://docs.google.com/file/d/0Bw64rMSoJR_ZNkhJRWVNUG5tUG8/edit

Friday, June 15, 2012

Python code for Gaussian Elimination Algorithm

This is a homework for Math 630: Linear Algebra
Textbook : Linear Algebra and Its Applications, G. Strang, Thomson Brooks ISBN: 0030105676

Code Brief

Written in Python
All matrix indices (row and column) are Zero-based
line 1-92 Algorithm
line 92-332 Tests

=======================Begin of Code=============================
001 import unittest
002
003 ######################################
004 # Three elementary transformation
005 ######################################
006 def SwapRow(aMatrix, aRow1, aRow2):
007 if aRow1!=aRow2:
008 aMatrix[aRow1],aMatrix[aRow2] = aMatrix[aRow2],aMatrix[aRow1];
009
010 def ScaleRow(aMatrix, aRow, aScalar):
011 aMatrix[aRow] = map(lambda x:x*aScalar,aMatrix[aRow]);
012
013 def CombineRow(aMatrix, aAddTo, aAddBy, aScalar):
014 aMatrix[aAddTo] = map(lambda x,y:x+y*aScalar,
015 aMatrix[aAddTo],aMatrix[aAddBy]);
016
017 def DoZeroCorrection(aValue): #not tested
018 if IsZero(aValue):
019 return 0;
020 else:
021 return aValue;
022 aMatrix[aAddTo] = map(DoZeroCorrection, aMatrix[aAddTo]);
023
024 ######################################
025 # Helpers
026 ######################################
027 def GetPivotCoeff(aPivot, aTarget):
028 return -1.0*(aTarget+0.0)/(aPivot+0.0);
029
030 EPSION = 0.000001; # 1 over million
031 def IsZero(aValue):
032 return abs(aValue)<EPSION;
033
034 def IsZeroRow(aRow):
035 return len(filter(IsZero,aRow))== len(aRow);
036
037
038 ######################################
039 # Operations
040 ######################################
041 def PivotDown(aMatrix, aLastNonZeroRow, aRow, aCol):
042 for i in range(aRow+1,aLastNonZeroRow+1):
043 if IsZeroRow(aMatrix[i]):
044 continue;
045
046 CombineRow(aMatrix,i,aRow,
047 GetPivotCoeff(aMatrix[aRow][aCol],aMatrix[i][aCol]))
048
049 def MakeColumnPivot(aMatrix, aRow, aLastNonZeroRow, aCol):
050 for c in range(aCol,len(aMatrix[0])):
051 for r in range(aRow,aLastNonZeroRow+1):
052 if not IsZero(aMatrix[r][c]):
053 SwapRow(aMatrix,aRow,r);
054 return c;
055
056 # This line will never be hit, because after doing MakeRowPivot,
057 #at least, current row has non-zero entry, thus return with the
058 # shortcut in SwapRow when Row1=Row2
059 return len(aMatrix[0])-1;
060
061 def MakeRowPivot(aMatrix, aCurrRow, aLastNonZeroRow):
062 for r in range(aLastNonZeroRow,aCurrRow,-1):
063 if IsZeroRow(aMatrix[aCurrRow]):
064 SwapRow(aMatrix,aCurrRow,r);
065 else:
066 return r;
067
068 return aCurrRow;
069
070
071
072 ######################################
073 # Gaussian Elimination Algorithm
074 ######################################
075 def DoGaussianElimination(aMatrix):
076 M = len(aMatrix);
077 N = len(aMatrix[0]);
078
079 pivotCol = 0;
080 lastRow = M-1;
081 for pivotRow in range(0,M):
082 lastRow = MakeRowPivot(aMatrix, pivotRow,lastRow)
083 if pivotRow >= lastRow: #In case that all rows down are zero
084 break;
085
086 pivotCol = MakeColumnPivot(aMatrix,pivotRow,lastRow,pivotCol);
087 if pivotCol >= (N-1): #In case that all rows down are zero
088 break;
089
090 PivotDown(aMatrix,lastRow,pivotRow,pivotCol);
091
092 pivotCol +=1;
093
094
095
096 ######################################
097 # Test fixtures
098 ######################################
099 class TestEGAlgorithm(unittest.TestCase):
100 def testDoGaussianElimination_EmptyRow(self):
101 B = [ [ 0, 0]];
102 A = [ [ 0, 0]];
103 DoGaussianElimination(A);
104 self.assertEqual(A,B);
105 B = [ [ 0, 0],
106 [ 0, 0]];
107 A = [ [ 0, 0],
108 [ 0, 0]];
109 DoGaussianElimination(A);
110 self.assertEqual(A,B);
111 B = [ [ 0, 0],
112 [ 0, 0],
113 [ 0, 0]];
114 A = [ [ 0, 0],
115 [ 0, 0],
116 [ 0, 0]];
117 DoGaussianElimination(A);
118 self.assertEqual(A,B);
119
120 B = [ [11,12],
121 [ 0,-1],
122 [ 0, 0]];
123 A = [ [11,12],
124 [22,23],
125 [ 0, 0]];
126 DoGaussianElimination(A);
127 self.assertEqual(A,B);
128 A = [ [0, 0],
129 [22,23],
130 [11,12]];
131 DoGaussianElimination(A);
132 self.assertEqual(A,B);
133 A = [ [11,12],
134 [ 0, 0],
135 [22,23]];
136 DoGaussianElimination(A);
137 self.assertEqual(A,B);
138
139 B = [ [11,12],
140 [ 0, 0],
141 [ 0, 0]];
142 A = [ [ 0, 0],
143 [ 0, 0],
144 [11,12]];
145 DoGaussianElimination(A);
146 self.assertEqual(A,B);
147 A = [ [11,12],
148 [ 0, 0],
149 [ 0, 0]];
150 DoGaussianElimination(A);
151 self.assertEqual(A,B);
152
153 def testDoGaussianElimination_Rank(self):
154 # M > N case
155 A = [ [0,0],
156 [0,1],
157 [0,0],
158 [1,0]];
159 B = [ [1,0],
160 [0,1],
161 [0,0],
162 [0,0]];
163 DoGaussianElimination(A);
164 self.assertEqual(A,B);
165
166 # N > M case
167 A = [ [0,0,0,0,0,1],
168 [0,0,0,1,0,0],
169 [0,0,0,0,0,0],
170 [1,0,0,0,0,0]];
171 B = [ [1,0,0,0,0,0],
172 [0,0,0,1,0,0],
173 [0,0,0,0,0,1],
174 [0,0,0,0,0,0]];
175 DoGaussianElimination(A);
176 self.assertEqual(A,B);
177
178 def testDoGaussianElimination_Column(self):
179 A = [ [0,0,0,0,0],
180 [0,0,0,0,0],
181 [0,0,0,0,1],
182 [0,0,1,0,0],
183 [1,0,0,0,0]];
184 B = [ [1,0,0,0,0],
185 [0,0,1,0,0],
186 [0,0,0,0,1],
187 [0,0,0,0,0],
188 [0,0,0,0,0]];
189 DoGaussianElimination(A);
190 self.assertEqual(A,B);
191
192 A = [ [0,1, 3],
193 [0,2, 4]];
194 B = [ [0,1, 3],
195 [0,0,-2]];
196 DoGaussianElimination(A);
197 self.assertEqual(A,B);
198
199 A = [ [0,12,0],
200 [0,22,0],
201 [0,32,0]];
202 B = [ [0,12,0],
203 [0, 0,0],
204 [0, 0,0]];
205 DoGaussianElimination(A);
206 self.assertEqual(A,B);
207
208 A = [[0, 0, 0],
209 [21,22,23],
210 [0, 0, 0]];
211 B = [[21,22,23],
212 [0, 0, 0],
213 [0, 0, 0]];
214 DoGaussianElimination(A);
215 self.assertEqual(A,B);
216
217 def testDoGaussianElimination_Textbook(self):
218 # Simple elimination example(Textbook p.12)
219 A = [ [ 2, 1, 1, 5],
220 [ 4,-6, 0,-2],
221 [-2, 7, 2, 9]];
222 B = [ [ 2, 1, 1, 5],
223 [ 0,-8,-2,-12],
224 [ 0, 0, 1, 2]];
225 DoGaussianElimination(A);
226 self.assertEqual(A,B);
227
228 # Roundoff error test(Textbook p.62)
229 A = [ [ 0.0001, 1.0, 1],
230 [ 1.0, 1.0, 2]];
231 B = [ [ 0.0001, 1.0, 1],
232 [ 0,-9999,-9998]];
233 DoGaussianElimination(A);
234 self.assertEqual(A,B);
235
236
237 def testIsEqualToZero(self):
238 self.assertTrue(IsZero(EPSION/2.0));
239 self.assertFalse(IsZero(EPSION));
240 self.assertFalse(IsZero(EPSION+EPSION/10.0));
241
242 def testGetPivotCoeff(self):
243 self.assertEquals(GetPivotCoeff(1,2),-2);
244 self.assertEquals(GetPivotCoeff(3,1),-1.0/3.0);
245
246 def testIsZeroRow(self):
247 r = [0];
248 self.assertTrue(IsZeroRow(r));
249 r = [0,0];
250 self.assertTrue(IsZeroRow(r));
251 r = [0,0,0];
252 self.assertTrue(IsZeroRow(r));
253
254 r = [0,1];
255 self.assertFalse(IsZeroRow(r));
256 r = [0,0,1];
257 self.assertFalse(IsZeroRow(r));
258
259 r = [1,0];
260 self.assertFalse(IsZeroRow(r));
261 r = [1,0,0];
262 self.assertFalse(IsZeroRow(r));
263 r = [1,0,1];
264 self.assertFalse(IsZeroRow(r));
265
266
267 def testSwap(self):
268 a = [[11,12],[21,22]];
269 b = [[21,22],[11,12]];
270 SwapRow(a,0,1)
271 self.assertEquals(a,b);
272
273 a = [[11,12]];
274 b = [[11,12]];
275 SwapRow(a,0,0)
276 self.assertEquals(a,b);
277
278 a = [[11,12]];
279 b = [[11,12]];
280 SwapRow(a,0,0)
281 self.assertEquals(a,b);
282
283 a = [[11],[21]];
284 b = [[21],[11]];
285 SwapRow(a,0,1)
286 self.assertEquals(a,b);
287
288 a = [[11],[21]];
289 b = [[21],[11]];
290 SwapRow(a,1,0)
291 self.assertEquals(a,b);
292
293 def testScalaRow(self):
294 a = [[11,12]];
295 b = [[110,120]];
296 ScaleRow(a,0,10.0);
297 self.assertEquals(a,b);
298
299 a = [[11],[12]];
300 b = [[110],[12]];
301 ScaleRow(a,0,10.0);
302 self.assertEquals(a,b);
303
304 a = [[11,12],[21,22]];
305 b = [[110,120],[21,22]];
306 ScaleRow(a,0,10.0);
307 self.assertEquals(a,b);
308 b = [[110,120],[210,220]];
309 ScaleRow(a,1,10.0);
310 self.assertEquals(a,b);
311
312 def testCombineRow(self):
313 a = [[11],
314 [21]];
315 b = [[2111],
316 [ 21]];
317 CombineRow(a,0,1,100);
318 self.assertEquals(a,b);
319
320 a = [[11,12],
321 [21,22]];
322 b = [[2111,2212],
323 [ 21, 22]];
324 CombineRow(a,0,1,100);
325 self.assertEquals(a,b);
326
327 a = [[11,12],
328 [21,22]];
329 b = [[ 11, 12],
330 [1121,1222]];
331 CombineRow(a,1,0,100);
332 self.assertEquals(a,b);
333
334
335 if __name__ == "__main__":
336 unittest.main();
337

==========================End of Code=============================

Monday, June 11, 2012

Confusing SAS statements

Here are some wiered (different with other programming languge) SAS syntax:
1. IF statement in DATA steps.
eg. IF expression ;
if the expression fails, the process will skip rest statements and start over the next observation in the DATA step.
2. RETAIN and SUM expression
e.g. sumVar+ValueExpression
the initiation of SUM variable is 0, instead of missing value like other variables. When you need an other than zero initiation for sumVar, you have to use modifier RETAIN. This is not consistent grammar design.
3. Left SUBSTR
e.g. SUBSTR(TagetStr, insertPos, length)=InsertString
overloaded the function, left assigment

Monday, June 4, 2012

Note on mathematical statistics: 6.2 Rao-Cramer lower bound and efficency

Mindmap for Section 6.2 Rao-Cramer lower bound and efficiency

Download MindJet file: https://docs.google.com/open?id=0Bw64rMSoJR_Za3FFMWM1LS1QZWs

Research Project on Sampling Rare Population

Term Research Project in Sampling Theory

Title: Sampling Rare Population

Outline
============================================
* Introduction
* Methods of SRP
Disproportional Stratified Sampling
Multiple Frames
Network (Multiplicity) Sampling
Snowball Sampling
Staged methods & Screening
* Example
* Research Note
* Concluding Remarks

Download slide in PDF: https://docs.google.com/open?id=0Bw64rMSoJR_ZM3hnRWNCMVN6THc
Download Reference: https://docs.google.com/document/d/1UsYYLU66HI0PBruoTE8tu6qbwFqVWtGJwdBe4C2OWB4/edit

Wednesday, May 23, 2012

Formula Sheet for mathematical statistics

Formula Sheet for mathematical statistics

1. Probability Cheat Sheet
one page (2 sides) by Peleg Michaeli
peleg.yogiley.org/math/probability/probabilitycs.pdf

2. Probability and Statistics Cookbook
detailed list by Matthias Vallentin
http://matthias.vallentin.net/probability-and-statistics-cookbook/cookbook-en.pdf

Tuesday, May 22, 2012

Note on mathematical statistics: 5.7 Chi-Square Test

1. Mathematical basis of Chi-Square test

Multivariate normal distribution implies a Chi-Square (n) distribution of Y=Sum( (X_i-mu_i)^2 / sigma_i^2 ).
By CLT, the joint pdf of different groups of sample are treated multivariate normal.
The statistic Q_k-1 = Sum( Y_i^2) = Sum [1..k] ( (X_i-E_i)^2/E_i ) has a Chi-Square(k-1) distribution.
With the idea of interval frequency approximation, every distribution can be treated as Multi-binomial.

2. Procedure (the idea of interval frequency approximation)

Partition the domain of experiment result into finite mutually disjoint sets A_1, A_2 ... A_n
Count the number of result in A_i as frequency X_i
Assign df (different ways)
Assign the probability of result in A_i as p_i (different ways)
Evaluate statistic Q_k-1
Test

3. Three tests

Test	Goodness of fit	Homogeneity	Independence
example	5.7.1, 5.7.2	5.7.3	5.7.4
H0	Result has the theoretical distribution	Two sets of sample have the same distribution	Two attributions of subjects are independent
Key fact	X_i/n = p_0i	p_1i=p_2i=p_0i=E_i/n	P_ij=Pi*Pj
source of E_i	multinomial model	MLE	MLE
formula of E_i (1<=i<=k)	E_i=n*p_0i	E_i=(X_i1+X_i2)/(n1+n2)	E_ij=(X_i./n) * (X_.j/n)
df	k-1	k-1=2(k-1)-(k-1)	(a-1)(b-1)=(a*b-1)-(a+b-2)
statistic	Sum [i=1..k] ( (X_i-E_i)^2/ E_i )	Sum [j=1,2] [i=1..k]( (X_ij-n_jE_i)^2/n_jE_i )	Sum [j=1,a] [i=1..b]( (X_ij-nE_ij)^2/nE_ij)
Dataset	x_1, x_2 ... x_n	x_1, x_2,... x_n1 y_1, y_2 ... y_n2	contingency table Xij, 1<=i<=a, 1<=j<=b k=a*b

4. Remarks

Chi-Square tests are not exact test, but approximate test
The statistic is based on frequency of result in interval, instead of the result itself
Make sure E_i > 5 or use Fisher exact test
Minimum Chi-Square estimation
MLE based Chi-square tests have greater rejection rate than tests based on Minimum Chi-Square estimator
Every estimated parameter p0i costs one df

Notes on mathematical statistics: textbook

Here is a serial of notes on my studying on mathematical statistics, specifically on the textbook of
Introduction to Mathematical Statistics,
R.V. Hogg, A Craig and J. W. McKean
6th edition, Pearson.

There was an official solution book pdf file including partial answers to exercises.
However, it can be located and downloaded only after intensive Google search.

I keep my homework and exercises from Chapter 4 to Chapter 7 which can be used as an "AS IS" manual for the book (not only even number questions in official solution, but also some odd numbers questions).

Contact me with the page number and exercises index.
I will see what I can do for you.

Tuesday, January 31, 2012

Embedding data in workflow

Source: QnA in Knime forum

Contributor ID: jfalgout

Goal
To embed data in workflow, so it can be distribute with the workflow package zip file.

Strategy

Create a directory with files underneath the directory containing the node artifacts for a workflow, the "knime.node" flow variable will be populated the next time you edit the node.

The folder you create within the node's folder has to be named "drop". Otherwise when you save the workspace whatever folders/files you add get deleted.

Test

Not yet

Sunday, January 15, 2012

Example to create new Node

Example to create new Node: http://tech.knime.org/developer/example

Monday, January 9, 2012

Typical workflow for test on specific Culumn

Source: QnA in Knime forum

Contributor ID: James Davidson

Goal

Derive a new column from a test on existing column

Strategy
1. Use [Row Filter] to do the test and split (with Include & exclude) the original table
2. Use [Java snippet] to generate the value for the new column
3. Use [Concatenate] to merge table back

Comment
1. In some case, [Joiner] my be a shortcut for the goal

2. Do not known whether it is the best solution, since I am just a beginner

Resource
Download James's Package

Friday, January 6, 2012

Remove empty columns in Knime

Goal
Delete the columns comprise of blank value (0, NaN, NA, NULL or fix string)

Strategy
1. If the target is number column (double or int), use [Low Variance Filter] node
2. Use [Transpose] + [Row Filter] + [Transpose] , suggested by
http://tech.knime.org/forum/knime-general/removing-columns-where-every-value-is-empty
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table

Comment
None of the three works for me.
1. My target columns are string type. [String to Number] node need to be wired to source column manually.
It does not make sense if I have tons of empty columns.
2. Missing Value in [Row filter]'s setting only works on specific column, which implies it is not an automation.
3. I can write R, Python or script outside Knime to do filtering.

Final solution
1. Pre-process the CSV file outside Knime
2. Manually skip the column in [File Reader] setting.