Generate random permutation in R
start = 1
end = 10
seq = c(start, end)
sample(seq,length(seq), replace=FALSE)
Equivalent MathLab: X = randperm(End-Begin+1)+Begin
example: [1] 5 1 2 8 6 4 3 7 9 10
Wednesday, November 7, 2012
Thursday, September 20, 2012
Remove duplicated rows in Knime
Goal
Delete the duplicated rows in table.
Strategy
1. Use GroupBy node, refer to
Detect and delete duplicate files (rows) based on two variable identifiers (date and lot number) and not based on row ID
2. Use Database Query. SELECT DISTINCT
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table
Comment
1. Delete empty rows is a special case of this topic.
2. Two rows are duplicated at all the columns, or only on several columns (combination key)
3. S2, and S3 are not tested
Saturday, July 21, 2012
Regression Analysis of Airline’s Incidents
Term Research Project in SAS and Statistic Consulting
Title: Regression Analysis of Airline’s Incidents
Outline
============================================
Download slide: https://docs.google.com/open?id=0Bw64rMSoJR_ZUjl6Q2NkODNmRFE
Title: Regression Analysis of Airline’s Incidents
Outline
============================================
Introduction
Data
Analysis
Simple linear regression
Variance stabilizing transformation
Non-linear Regression (Poisson )
Conclusion
Wednesday, July 18, 2012
SAS PROC SQL enhancement to ANSI
The difference between PROC SQL and ANSI SQL is documented in "PROC SQL and the ANSI Standard" (9.2 Help link), with content on:
- Compliance
- SQL Procedure Enhancements
- SQL Procedure Omissions
Useful enhancements include:
- Format, Informat, Label
- Contain
- Calculated
Monday, July 9, 2012
Pricing scheme of Group-buy Websites in China
Term Project for Applied Statistics
Outline
============================================
Here
Outline
============================================
Background: Group Buying/ C2C
Data set: Public production data
Method: Descriptive/ Inference/ Diagnostics
Analysis: Preparing/Visualization/2 Models
Conclusion: Competition/Market
Data set: Public production data
Method: Descriptive/ Inference/ Diagnostics
Analysis: Preparing/Visualization/2 Models
Conclusion: Competition/Market
Here
Friday, June 15, 2012
Python code for Gaussian Elimination Algorithm
This is a homework for Math 630: Linear Algebra
Textbook : Linear Algebra and Its Applications, G. Strang, Thomson Brooks ISBN: 0030105676
Code Brief
- Written in Python
- All matrix indices (row and column) are Zero-based
- line 1-92 Algorithm
- line 92-332 Tests
=======================Begin of Code=============================
001 import unittest
002
003 ######################################
004 # Three elementary transformation
005 ######################################
006 def SwapRow(aMatrix, aRow1, aRow2):
007 if aRow1!=aRow2:
008 aMatrix[aRow1],aMatrix[aRow2] = aMatrix[aRow2],aMatrix[aRow1];
009
010 def ScaleRow(aMatrix, aRow, aScalar):
011 aMatrix[aRow] = map(lambda x:x*aScalar,aMatrix[aRow]);
012
013 def CombineRow(aMatrix, aAddTo, aAddBy, aScalar):
014 aMatrix[aAddTo] = map(lambda x,y:x+y*aScalar,
015 aMatrix[aAddTo],aMatrix[aAddBy]);
016
017 def DoZeroCorrection(aValue): #not tested
018 if IsZero(aValue):
019 return 0;
020 else:
021 return aValue;
022 aMatrix[aAddTo] = map(DoZeroCorrection, aMatrix[aAddTo]);
023
024 ######################################
025 # Helpers
026 ######################################
027 def GetPivotCoeff(aPivot, aTarget):
028 return -1.0*(aTarget+0.0)/(aPivot+0.0);
029
030 EPSION = 0.000001; # 1 over million
031 def IsZero(aValue):
032 return abs(aValue)<EPSION;
033
034 def IsZeroRow(aRow):
035 return len(filter(IsZero,aRow))== len(aRow);
036
037
038 ######################################
039 # Operations
040 ######################################
041 def PivotDown(aMatrix, aLastNonZeroRow, aRow, aCol):
042 for i in range(aRow+1,aLastNonZeroRow+1):
043 if IsZeroRow(aMatrix[i]):
044 continue;
045
046 CombineRow(aMatrix,i,aRow,
047 GetPivotCoeff(aMatrix[aRow][aCol],aMatrix[i][aCol]))
048
049 def MakeColumnPivot(aMatrix, aRow, aLastNonZeroRow, aCol):
050 for c in range(aCol,len(aMatrix[0])):
051 for r in range(aRow,aLastNonZeroRow+1):
052 if not IsZero(aMatrix[r][c]):
053 SwapRow(aMatrix,aRow,r);
054 return c;
055
056 # This line will never be hit, because after doing MakeRowPivot,
057 #at least, current row has non-zero entry, thus return with the
058 # shortcut in SwapRow when Row1=Row2
059 return len(aMatrix[0])-1;
060
061 def MakeRowPivot(aMatrix, aCurrRow, aLastNonZeroRow):
062 for r in range(aLastNonZeroRow,aCurrRow,-1):
063 if IsZeroRow(aMatrix[aCurrRow]):
064 SwapRow(aMatrix,aCurrRow,r);
065 else:
066 return r;
067
068 return aCurrRow;
069
070
071
072 ######################################
073 # Gaussian Elimination Algorithm
074 ######################################
075 def DoGaussianElimination(aMatrix):
076 M = len(aMatrix);
077 N = len(aMatrix[0]);
078
079 pivotCol = 0;
080 lastRow = M-1;
081 for pivotRow in range(0,M):
082 lastRow = MakeRowPivot(aMatrix, pivotRow,lastRow)
083 if pivotRow >= lastRow: #In case that all rows down are zero
084 break;
085
086 pivotCol = MakeColumnPivot(aMatrix,pivotRow,lastRow,pivotCol);
087 if pivotCol >= (N-1): #In case that all rows down are zero
088 break;
089
090 PivotDown(aMatrix,lastRow,pivotRow,pivotCol);
091
092 pivotCol +=1;
093
094
095
096 ######################################
097 # Test fixtures
098 ######################################
099 class TestEGAlgorithm(unittest.TestCase):
100 def testDoGaussianElimination_EmptyRow(self):
101 B = [ [ 0, 0]];
102 A = [ [ 0, 0]];
103 DoGaussianElimination(A);
104 self.assertEqual(A,B);
105 B = [ [ 0, 0],
106 [ 0, 0]];
107 A = [ [ 0, 0],
108 [ 0, 0]];
109 DoGaussianElimination(A);
110 self.assertEqual(A,B);
111 B = [ [ 0, 0],
112 [ 0, 0],
113 [ 0, 0]];
114 A = [ [ 0, 0],
115 [ 0, 0],
116 [ 0, 0]];
117 DoGaussianElimination(A);
118 self.assertEqual(A,B);
119
120 B = [ [11,12],
121 [ 0,-1],
122 [ 0, 0]];
123 A = [ [11,12],
124 [22,23],
125 [ 0, 0]];
126 DoGaussianElimination(A);
127 self.assertEqual(A,B);
128 A = [ [0, 0],
129 [22,23],
130 [11,12]];
131 DoGaussianElimination(A);
132 self.assertEqual(A,B);
133 A = [ [11,12],
134 [ 0, 0],
135 [22,23]];
136 DoGaussianElimination(A);
137 self.assertEqual(A,B);
138
139 B = [ [11,12],
140 [ 0, 0],
141 [ 0, 0]];
142 A = [ [ 0, 0],
143 [ 0, 0],
144 [11,12]];
145 DoGaussianElimination(A);
146 self.assertEqual(A,B);
147 A = [ [11,12],
148 [ 0, 0],
149 [ 0, 0]];
150 DoGaussianElimination(A);
151 self.assertEqual(A,B);
152
153 def testDoGaussianElimination_Rank(self):
154 # M > N case
155 A = [ [0,0],
156 [0,1],
157 [0,0],
158 [1,0]];
159 B = [ [1,0],
160 [0,1],
161 [0,0],
162 [0,0]];
163 DoGaussianElimination(A);
164 self.assertEqual(A,B);
165
166 # N > M case
167 A = [ [0,0,0,0,0,1],
168 [0,0,0,1,0,0],
169 [0,0,0,0,0,0],
170 [1,0,0,0,0,0]];
171 B = [ [1,0,0,0,0,0],
172 [0,0,0,1,0,0],
173 [0,0,0,0,0,1],
174 [0,0,0,0,0,0]];
175 DoGaussianElimination(A);
176 self.assertEqual(A,B);
177
178 def testDoGaussianElimination_Column(self):
179 A = [ [0,0,0,0,0],
180 [0,0,0,0,0],
181 [0,0,0,0,1],
182 [0,0,1,0,0],
183 [1,0,0,0,0]];
184 B = [ [1,0,0,0,0],
185 [0,0,1,0,0],
186 [0,0,0,0,1],
187 [0,0,0,0,0],
188 [0,0,0,0,0]];
189 DoGaussianElimination(A);
190 self.assertEqual(A,B);
191
192 A = [ [0,1, 3],
193 [0,2, 4]];
194 B = [ [0,1, 3],
195 [0,0,-2]];
196 DoGaussianElimination(A);
197 self.assertEqual(A,B);
198
199 A = [ [0,12,0],
200 [0,22,0],
201 [0,32,0]];
202 B = [ [0,12,0],
203 [0, 0,0],
204 [0, 0,0]];
205 DoGaussianElimination(A);
206 self.assertEqual(A,B);
207
208 A = [[0, 0, 0],
209 [21,22,23],
210 [0, 0, 0]];
211 B = [[21,22,23],
212 [0, 0, 0],
213 [0, 0, 0]];
214 DoGaussianElimination(A);
215 self.assertEqual(A,B);
216
217 def testDoGaussianElimination_Textbook(self):
218 # Simple elimination example(Textbook p.12)
219 A = [ [ 2, 1, 1, 5],
220 [ 4,-6, 0,-2],
221 [-2, 7, 2, 9]];
222 B = [ [ 2, 1, 1, 5],
223 [ 0,-8,-2,-12],
224 [ 0, 0, 1, 2]];
225 DoGaussianElimination(A);
226 self.assertEqual(A,B);
227
228 # Roundoff error test(Textbook p.62)
229 A = [ [ 0.0001, 1.0, 1],
230 [ 1.0, 1.0, 2]];
231 B = [ [ 0.0001, 1.0, 1],
232 [ 0,-9999,-9998]];
233 DoGaussianElimination(A);
234 self.assertEqual(A,B);
235
236
237 def testIsEqualToZero(self):
238 self.assertTrue(IsZero(EPSION/2.0));
239 self.assertFalse(IsZero(EPSION));
240 self.assertFalse(IsZero(EPSION+EPSION/10.0));
241
242 def testGetPivotCoeff(self):
243 self.assertEquals(GetPivotCoeff(1,2),-2);
244 self.assertEquals(GetPivotCoeff(3,1),-1.0/3.0);
245
246 def testIsZeroRow(self):
247 r = [0];
248 self.assertTrue(IsZeroRow(r));
249 r = [0,0];
250 self.assertTrue(IsZeroRow(r));
251 r = [0,0,0];
252 self.assertTrue(IsZeroRow(r));
253
254 r = [0,1];
255 self.assertFalse(IsZeroRow(r));
256 r = [0,0,1];
257 self.assertFalse(IsZeroRow(r));
258
259 r = [1,0];
260 self.assertFalse(IsZeroRow(r));
261 r = [1,0,0];
262 self.assertFalse(IsZeroRow(r));
263 r = [1,0,1];
264 self.assertFalse(IsZeroRow(r));
265
266
267 def testSwap(self):
268 a = [[11,12],[21,22]];
269 b = [[21,22],[11,12]];
270 SwapRow(a,0,1)
271 self.assertEquals(a,b);
272
273 a = [[11,12]];
274 b = [[11,12]];
275 SwapRow(a,0,0)
276 self.assertEquals(a,b);
277
278 a = [[11,12]];
279 b = [[11,12]];
280 SwapRow(a,0,0)
281 self.assertEquals(a,b);
282
283 a = [[11],[21]];
284 b = [[21],[11]];
285 SwapRow(a,0,1)
286 self.assertEquals(a,b);
287
288 a = [[11],[21]];
289 b = [[21],[11]];
290 SwapRow(a,1,0)
291 self.assertEquals(a,b);
292
293 def testScalaRow(self):
294 a = [[11,12]];
295 b = [[110,120]];
296 ScaleRow(a,0,10.0);
297 self.assertEquals(a,b);
298
299 a = [[11],[12]];
300 b = [[110],[12]];
301 ScaleRow(a,0,10.0);
302 self.assertEquals(a,b);
303
304 a = [[11,12],[21,22]];
305 b = [[110,120],[21,22]];
306 ScaleRow(a,0,10.0);
307 self.assertEquals(a,b);
308 b = [[110,120],[210,220]];
309 ScaleRow(a,1,10.0);
310 self.assertEquals(a,b);
311
312 def testCombineRow(self):
313 a = [[11],
314 [21]];
315 b = [[2111],
316 [ 21]];
317 CombineRow(a,0,1,100);
318 self.assertEquals(a,b);
319
320 a = [[11,12],
321 [21,22]];
322 b = [[2111,2212],
323 [ 21, 22]];
324 CombineRow(a,0,1,100);
325 self.assertEquals(a,b);
326
327 a = [[11,12],
328 [21,22]];
329 b = [[ 11, 12],
330 [1121,1222]];
331 CombineRow(a,1,0,100);
332 self.assertEquals(a,b);
333
334
335 if __name__ == "__main__":
336 unittest.main();
337
==========================End of Code=============================
Monday, June 11, 2012
Confusing SAS statements
Here are some wiered (different with other programming languge) SAS syntax:
1. IF statement in DATA steps.
eg. IF expression ;
if the expression fails, the process will skip rest statements and start over the next observation in the DATA step.
2. RETAIN and SUM expression
e.g. sumVar+ValueExpression
the initiation of SUM variable is 0, instead of missing value like other variables. When you need an other than zero initiation for sumVar, you have to use modifier RETAIN. This is not consistent grammar design.
3. Left SUBSTR
e.g. SUBSTR(TagetStr, insertPos, length)=InsertString
overloaded the function, left assigment
1. IF statement in DATA steps.
eg. IF expression ;
if the expression fails, the process will skip rest statements and start over the next observation in the DATA step.
2. RETAIN and SUM expression
e.g. sumVar+ValueExpression
the initiation of SUM variable is 0, instead of missing value like other variables. When you need an other than zero initiation for sumVar, you have to use modifier RETAIN. This is not consistent grammar design.
3. Left SUBSTR
e.g. SUBSTR(TagetStr, insertPos, length)=InsertString
overloaded the function, left assigment
Monday, June 4, 2012
Note on mathematical statistics: 6.2 Rao-Cramer lower bound and efficency
Mindmap for Section 6.2 Rao-Cramer lower bound and efficiency
Download MindJet file: https://docs.google.com/open?id=0Bw64rMSoJR_Za3FFMWM1LS1QZWs
Research Project on Sampling Rare Population
Term Research Project in Sampling Theory
Title: Sampling Rare Population
Outline
============================================
* Introduction
* Methods of SRP
Disproportional Stratified Sampling
Multiple Frames
Network (Multiplicity) Sampling
Snowball Sampling
Staged methods & Screening
* Example
* Research Note
* Concluding Remarks
Download slide in PDF: https://docs.google.com/open?id=0Bw64rMSoJR_ZM3hnRWNCMVN6THc
Download Reference: https://docs.google.com/document/d/1UsYYLU66HI0PBruoTE8tu6qbwFqVWtGJwdBe4C2OWB4/edit
Title: Sampling Rare Population
Outline
============================================
* Introduction
* Methods of SRP
Disproportional Stratified Sampling
Multiple Frames
Network (Multiplicity) Sampling
Snowball Sampling
Staged methods & Screening
* Example
* Research Note
* Concluding Remarks
Download slide in PDF: https://docs.google.com/open?id=0Bw64rMSoJR_ZM3hnRWNCMVN6THc
Download Reference: https://docs.google.com/document/d/1UsYYLU66HI0PBruoTE8tu6qbwFqVWtGJwdBe4C2OWB4/edit
Wednesday, May 23, 2012
Formula Sheet for mathematical statistics
Formula Sheet for mathematical statistics
1. Probability Cheat Sheet
one page (2 sides) by Peleg Michaeli
peleg.yogiley.org/math/probability/probabilitycs.pdf
2. Probability and Statistics Cookbook
detailed list by Matthias Vallentin
http://matthias.vallentin.net/probability-and-statistics-cookbook/cookbook-en.pdf
1. Probability Cheat Sheet
one page (2 sides) by Peleg Michaeli
peleg.yogiley.org/math/probability/probabilitycs.pdf
2. Probability and Statistics Cookbook
detailed list by Matthias Vallentin
http://matthias.vallentin.net/probability-and-statistics-cookbook/cookbook-en.pdf
Tuesday, May 22, 2012
Note on mathematical statistics: 5.7 Chi-Square Test
1. Mathematical basis of Chi-Square test
- Multivariate normal distribution implies a Chi-Square (n) distribution of Y=Sum( (X_i-mu_i)^2 / sigma_i^2 ).
- By CLT, the joint pdf of different groups of sample are treated multivariate normal.
- The statistic Q_k-1 = Sum( Y_i^2) = Sum [1..k] ( (X_i-E_i)^2/E_i ) has a Chi-Square(k-1) distribution.
- With the idea of interval frequency approximation, every distribution can be treated as Multi-binomial.
- Partition the domain of experiment result into finite mutually disjoint sets A_1, A_2 ... A_n
- Count the number of result in A_i as frequency X_i
- Assign df (different ways)
- Assign the probability of result in A_i as p_i (different ways)
- Evaluate statistic Q_k-1
- Test
Test | Goodness of fit | Homogeneity | Independence |
example | 5.7.1, 5.7.2 | 5.7.3 | 5.7.4 |
H0 | Result has the theoretical distribution | Two sets of sample have the same distribution | Two attributions of subjects are independent |
Key fact | X_i/n = p_0i | p_1i=p_2i=p_0i=E_i/n | P_ij=Pi*Pj |
source of E_i | multinomial model | MLE | MLE |
formula of E_i (1<=i<=k) | E_i=n*p_0i | E_i=(X_i1+X_i2)/(n1+n2) | E_ij=(X_i./n) * (X_.j/n) |
df | k-1 | k-1=2(k-1)-(k-1) | (a-1)(b-1)=(a*b-1)-(a+b-2) |
statistic | Sum [i=1..k] ( (X_i-E_i)^2/ E_i ) | Sum [j=1,2] [i=1..k]( (X_ij-n_j*E_i)^2/n_j*E_i ) | Sum [j=1,a] [i=1..b]( (X_ij-n*E_ij)^2/n*E_ij) |
Dataset | x_1, x_2 ... x_n | x_1, x_2,... x_n1 y_1, y_2 ... y_n2 |
contingency table Xij, 1<=i<=a, 1<=j<=b k=a*b |
4. Remarks
- Chi-Square tests are not exact test, but approximate test
- The statistic is based on frequency of result in interval, instead of the result itself
- Make sure E_i > 5 or use Fisher exact test
- Minimum Chi-Square estimation
- MLE based Chi-square tests have greater rejection rate than tests based on Minimum Chi-Square estimator
- Every estimated parameter p0i costs one df
Notes on mathematical statistics: textbook
Here is a serial of notes on my studying on mathematical statistics, specifically on the textbook of
Introduction to Mathematical Statistics,
R.V. Hogg, A Craig and J. W. McKean
6th edition, Pearson.
There was an official solution book pdf file including partial answers to exercises.
However, it can be located and downloaded only after intensive Google search.
I keep my homework and exercises from Chapter 4 to Chapter 7 which can be used as an "AS IS" manual for the book (not only even number questions in official solution, but also some odd numbers questions).
Contact me with the page number and exercises index.
I will see what I can do for you.
Introduction to Mathematical Statistics,
R.V. Hogg, A Craig and J. W. McKean
6th edition, Pearson.
There was an official solution book pdf file including partial answers to exercises.
However, it can be located and downloaded only after intensive Google search.
I keep my homework and exercises from Chapter 4 to Chapter 7 which can be used as an "AS IS" manual for the book (not only even number questions in official solution, but also some odd numbers questions).
Contact me with the page number and exercises index.
I will see what I can do for you.
Tuesday, January 31, 2012
Embedding data in workflow
Source: QnA in Knime forum
Goal
To embed data in workflow, so it can be distribute with the workflow package zip file.
Contributor ID: jfalgout
Goal
To embed data in workflow, so it can be distribute with the workflow package zip file.
Strategy
Create a directory with files underneath the directory containing the
node artifacts for a workflow, the "knime.node" flow variable will be
populated the next time you edit the node.
The folder you create within the node's folder has to be named "drop". Otherwise when you save the workspace whatever folders/files you add get deleted.
Test
Not yet
Sunday, January 15, 2012
Example to create new Node
Example to create new Node: http://tech.knime.org/developer/example
Monday, January 9, 2012
Typical workflow for test on specific Culumn
Source: QnA in Knime forum
Contributor ID: James Davidson
Derive a new column from a test on existing column
Strategy
1. Use [Row Filter] to do the test and split (with Include & exclude) the original table
2. Use [Java snippet] to generate the value for the new column
3. Use [Concatenate] to merge table back
Comment
1. In some case, [Joiner] my be a shortcut for the goal
2. Do not known whether it is the best solution, since I am just a beginner
Resource
Download James's Package
Resource
Download James's Package
Friday, January 6, 2012
Remove empty columns in Knime
Goal
Delete the columns comprise of blank value (0, NaN, NA, NULL or fix string)
Strategy
1. If the target is number column (double or int), use [Low Variance Filter] node
2. Use [Transpose] + [Row Filter] + [Transpose] , suggested by
http://tech.knime.org/forum/knime-general/removing-columns-where-every-value-is-empty
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table
Comment
None of the three works for me.
1. My target columns are string type. [String to Number] node need to be wired to source column manually.
It does not make sense if I have tons of empty columns.
2. Missing Value in [Row filter]'s setting only works on specific column, which implies it is not an automation.
3. I can write R, Python or script outside Knime to do filtering.
Final solution
1. Pre-process the CSV file outside Knime
2. Manually skip the column in [File Reader] setting.
Delete the columns comprise of blank value (0, NaN, NA, NULL or fix string)
Strategy
1. If the target is number column (double or int), use [Low Variance Filter] node
2. Use [Transpose] + [Row Filter] + [Transpose] , suggested by
http://tech.knime.org/forum/knime-general/removing-columns-where-every-value-is-empty
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table
Comment
None of the three works for me.
1. My target columns are string type. [String to Number] node need to be wired to source column manually.
It does not make sense if I have tons of empty columns.
2. Missing Value in [Row filter]'s setting only works on specific column, which implies it is not an automation.
3. I can write R, Python or script outside Knime to do filtering.
Final solution
1. Pre-process the CSV file outside Knime
2. Manually skip the column in [File Reader] setting.
Subscribe to:
Posts (Atom)