Wednesday, November 7, 2012

Generate random permutation in R

Generate random permutation in R

start = 1
end = 10
seq = c(start, end)
sample(seq,length(seq), replace=FALSE)

Equivalent MathLab: X = randperm(End-Begin+1)+Begin

example: [1]  5  1  2  8  6  4  3  7  9 10

Thursday, September 20, 2012

Remove duplicated rows in Knime


Goal
Delete the duplicated rows in table.

Strategy
1. Use GroupBy node, refer to
Detect and delete duplicate files (rows) based on two variable identifiers (date and lot number) and not based on row ID
2. Use Database Query. SELECT DISTINCT
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table

Comment
1. Delete empty rows is a special case of this topic.
2. Two rows are duplicated at all the columns, or only on several columns (combination key)
3. S2, and S3 are not tested



Saturday, July 21, 2012

Regression Analysis of Airline’s Incidents

Term Research Project in SAS and Statistic Consulting

Title: Regression Analysis of Airline’s Incidents

Outline
============================================

Introduction

Data
Analysis
     Simple linear regression
    Variance stabilizing transformation 
    Non-linear Regression (Poisson )
Conclusion



Download slide: https://docs.google.com/open?id=0Bw64rMSoJR_ZUjl6Q2NkODNmRFE

Wednesday, July 18, 2012

SAS PROC SQL enhancement to ANSI

The difference between PROC SQL and ANSI SQL is documented in "PROC SQL and the ANSI Standard" (9.2  Help link), with content on:


  • Compliance
  • SQL Procedure Enhancements
  • SQL Procedure Omissions
Useful enhancements include:
  • Format, Informat, Label
  • Contain
  • Calculated

Monday, July 9, 2012

Pricing scheme of Group-buy Websites in China

Term Project for Applied Statistics

Outline
============================================
Background: Group Buying/ C2C
Data set: Public production data
Method: Descriptive/ Inference/ Diagnostics
Analysis: Preparing/Visualization/2 Models
Conclusion: Competition/Market

Here



Download slide: https://docs.google.com/file/d/0Bw64rMSoJR_ZNkhJRWVNUG5tUG8/edit

Friday, June 15, 2012

Python code for Gaussian Elimination Algorithm


This is a homework for Math 630: Linear Algebra     
Textbook : Linear Algebra and Its Applications, G. Strang, Thomson Brooks ISBN: 0030105676  

Code Brief

  • Written in Python
  • All matrix indices (row and column) are Zero-based
  • line 1-92 Algorithm
  • line 92-332  Tests

=======================Begin of Code=============================
001  import unittest
002  
003  ######################################
004  # Three elementary transformation 
005  ######################################
006  def SwapRow(aMatrix, aRow1, aRow2):
007      if aRow1!=aRow2:     
008          aMatrix[aRow1],aMatrix[aRow2] = aMatrix[aRow2],aMatrix[aRow1];       
009  
010  def ScaleRow(aMatrix, aRow, aScalar):
011      aMatrix[aRow] = map(lambda x:x*aScalar,aMatrix[aRow]);
012              
013  def CombineRow(aMatrix, aAddTo, aAddBy, aScalar):   
014      aMatrix[aAddTo] = map(lambda x,y:x+y*aScalar, 
015                            aMatrix[aAddTo],aMatrix[aAddBy]);
016                                                       
017      def DoZeroCorrection(aValue): #not tested
018          if IsZero(aValue): 
019              return 0;
020          else: 
021              return aValue;        
022      aMatrix[aAddTo] = map(DoZeroCorrection, aMatrix[aAddTo]);    
023  
024  ######################################
025  # Helpers  
026  ######################################
027  def GetPivotCoeff(aPivot, aTarget):
028      return -1.0*(aTarget+0.0)/(aPivot+0.0);  
029  
030  EPSION = 0.000001; # 1 over million                       
031  def IsZero(aValue):
032      return  abs(aValue)<EPSION;           
033  
034  def IsZeroRow(aRow):
035      return len(filter(IsZero,aRow))== len(aRow); 
036  
037  
038  ######################################
039  # Operations
040  ######################################
041  def PivotDown(aMatrix, aLastNonZeroRow, aRow, aCol):
042      for i in range(aRow+1,aLastNonZeroRow+1):
043          if IsZeroRow(aMatrix[i]):
044              continue;
045              
046          CombineRow(aMatrix,i,aRow,
047                     GetPivotCoeff(aMatrix[aRow][aCol],aMatrix[i][aCol]))       
048          
049  def MakeColumnPivot(aMatrix, aRow, aLastNonZeroRow, aCol):      
050      for c in range(aCol,len(aMatrix[0])):
051          for r in range(aRow,aLastNonZeroRow+1):
052              if not IsZero(aMatrix[r][c]):
053                  SwapRow(aMatrix,aRow,r);
054                  return c;
055                  
056      # This line will never be hit, because after doing MakeRowPivot,
057      #at least, current row has non-zero entry, thus return with the 
058      # shortcut in SwapRow when Row1=Row2                 
059      return len(aMatrix[0])-1;                
060  
061  def MakeRowPivot(aMatrix, aCurrRow, aLastNonZeroRow):
062      for r in range(aLastNonZeroRow,aCurrRow,-1):
063          if IsZeroRow(aMatrix[aCurrRow]):
064              SwapRow(aMatrix,aCurrRow,r);
065          else:
066              return r;
067   
068      return aCurrRow;
069          
070  
071  
072  ######################################
073  # Gaussian Elimination Algorithm
074  ######################################
075  def DoGaussianElimination(aMatrix):
076      M = len(aMatrix);
077      N = len(aMatrix[0]);
078      
079      pivotCol = 0;                    
080      lastRow = M-1;
081      for pivotRow in range(0,M): 
082          lastRow = MakeRowPivot(aMatrix, pivotRow,lastRow)                              
083          if pivotRow >= lastRow:  #In case that all rows down are zero
084              break;   
085                       
086          pivotCol = MakeColumnPivot(aMatrix,pivotRow,lastRow,pivotCol);
087          if pivotCol >= (N-1):    #In case that all rows down are zero
088              break;
089              
090          PivotDown(aMatrix,lastRow,pivotRow,pivotCol);
091             
092          pivotCol +=1; 
093         
094   
095                  
096  ######################################
097  # Test fixtures
098  ######################################                
099  class TestEGAlgorithm(unittest.TestCase):         
100      def testDoGaussianElimination_EmptyRow(self):        
101          B = [ [ 0, 0]];
102          A = [ [ 0, 0]];
103          DoGaussianElimination(A);              
104          self.assertEqual(A,B);  
105          B = [ [ 0, 0],
106                [ 0, 0]];
107          A = [ [ 0, 0],
108                [ 0, 0]];
109          DoGaussianElimination(A);              
110          self.assertEqual(A,B);                  
111          B = [ [ 0, 0],
112                [ 0, 0],
113                [ 0, 0]];
114          A = [ [ 0, 0],
115                [ 0, 0],
116                [ 0, 0]];
117          DoGaussianElimination(A);
118          self.assertEqual(A,B);
119          
120          B = [ [11,12],
121                [ 0,-1],
122                [ 0, 0]];
123          A = [ [11,12],
124                [22,23],
125                [ 0, 0]];
126          DoGaussianElimination(A);
127          self.assertEqual(A,B);                      
128          A = [ [0,  0],
129                [22,23],
130                [11,12]];
131          DoGaussianElimination(A);
132          self.assertEqual(A,B);
133          A = [ [11,12],
134                [ 0, 0],
135                [22,23]];
136          DoGaussianElimination(A);
137          self.assertEqual(A,B);         
138          
139          B = [ [11,12],
140                [ 0, 0],
141                [ 0, 0]];
142          A = [ [ 0, 0],
143                [ 0, 0],
144                [11,12]];
145          DoGaussianElimination(A);
146          self.assertEqual(A,B);
147          A = [ [11,12],
148                [ 0, 0],
149                [ 0, 0]];
150          DoGaussianElimination(A);              
151          self.assertEqual(A,B);  
152          
153      def testDoGaussianElimination_Rank(self):   
154          # M > N case
155          A = [ [0,0],     
156                [0,1],
157                [0,0],
158                [1,0]];
159          B = [ [1,0],    
160                [0,1],
161                [0,0],
162                [0,0]];
163          DoGaussianElimination(A);
164          self.assertEqual(A,B);     
165         
166          # N > M case 
167          A = [ [0,0,0,0,0,1],     
168                [0,0,0,1,0,0],
169                [0,0,0,0,0,0],
170                [1,0,0,0,0,0]];
171          B = [ [1,0,0,0,0,0],     
172                [0,0,0,1,0,0],
173                [0,0,0,0,0,1],
174                [0,0,0,0,0,0]];
175          DoGaussianElimination(A);
176          self.assertEqual(A,B);                                               
177                              
178      def testDoGaussianElimination_Column(self):    
179          A = [ [0,0,0,0,0],     
180                [0,0,0,0,0],
181                [0,0,0,0,1],
182                [0,0,1,0,0],
183                [1,0,0,0,0]];
184          B = [ [1,0,0,0,0],    
185                [0,0,1,0,0],
186                [0,0,0,0,1],
187                [0,0,0,0,0],
188                [0,0,0,0,0]];
189          DoGaussianElimination(A);
190          self.assertEqual(A,B);                         
191          
192          A = [ [0,1, 3],
193                [0,2, 4]];
194          B = [ [0,1, 3],
195                [0,0,-2]];
196          DoGaussianElimination(A);
197          self.assertEqual(A,B); 
198          
199          A = [ [0,12,0],
200                [0,22,0],
201                [0,32,0]];
202          B = [ [0,12,0],
203                [0, 0,0],
204                [0, 0,0]];              
205          DoGaussianElimination(A);
206          self.assertEqual(A,B); 
207                        
208          A = [[0,  0, 0],
209               [21,22,23],
210               [0,  0, 0]];
211          B = [[21,22,23],
212               [0,  0, 0],
213               [0,  0, 0]];              
214          DoGaussianElimination(A);
215          self.assertEqual(A,B); 
216                        
217      def testDoGaussianElimination_Textbook(self):
218          # Simple elimination example(Textbook p.12)
219          A = [ [ 2, 1, 1, 5],
220                [ 4,-6, 0,-2],
221                [-2, 7, 2, 9]];
222          B = [ [ 2, 1, 1,  5],
223                [ 0,-8,-2,-12],
224                [ 0, 0, 1,  2]]; 
225          DoGaussianElimination(A);
226          self.assertEqual(A,B);   
227          
228          # Roundoff error test(Textbook p.62)
229          A = [ [ 0.0001, 1.0, 1],
230                [    1.0, 1.0, 2]];
231          B = [ [ 0.0001,  1.0,    1],
232                [      0,-9999,-9998]];
233          DoGaussianElimination(A);
234          self.assertEqual(A,B); 
235  
236          
237      def testIsEqualToZero(self):
238          self.assertTrue(IsZero(EPSION/2.0));
239          self.assertFalse(IsZero(EPSION));
240          self.assertFalse(IsZero(EPSION+EPSION/10.0));
241                
242      def testGetPivotCoeff(self):
243          self.assertEquals(GetPivotCoeff(1,2),-2);
244          self.assertEquals(GetPivotCoeff(3,1),-1.0/3.0);
245   
246      def testIsZeroRow(self):
247          r = [0];
248          self.assertTrue(IsZeroRow(r));        
249          r = [0,0];
250          self.assertTrue(IsZeroRow(r));        
251          r = [0,0,0];
252          self.assertTrue(IsZeroRow(r));                
253  
254          r = [0,1];
255          self.assertFalse(IsZeroRow(r));
256          r = [0,0,1];
257          self.assertFalse(IsZeroRow(r));
258  
259          r = [1,0];
260          self.assertFalse(IsZeroRow(r));
261          r = [1,0,0];
262          self.assertFalse(IsZeroRow(r));
263          r = [1,0,1];
264          self.assertFalse(IsZeroRow(r));
265          
266                              
267      def testSwap(self):
268          a = [[11,12],[21,22]];
269          b = [[21,22],[11,12]];
270          SwapRow(a,0,1)
271          self.assertEquals(a,b);
272          
273          a = [[11,12]];
274          b = [[11,12]];
275          SwapRow(a,0,0)
276          self.assertEquals(a,b);
277          
278          a = [[11,12]];
279          b = [[11,12]];
280          SwapRow(a,0,0)
281          self.assertEquals(a,b);
282                          
283          a = [[11],[21]];
284          b = [[21],[11]];
285          SwapRow(a,0,1)
286          self.assertEquals(a,b);
287  
288          a = [[11],[21]];
289          b = [[21],[11]];
290          SwapRow(a,1,0)
291          self.assertEquals(a,b);
292                 
293      def testScalaRow(self):
294          a = [[11,12]];
295          b = [[110,120]];        
296          ScaleRow(a,0,10.0);
297          self.assertEquals(a,b); 
298          
299          a = [[11],[12]];
300          b = [[110],[12]];        
301          ScaleRow(a,0,10.0);
302          self.assertEquals(a,b);  
303          
304          a = [[11,12],[21,22]];
305          b = [[110,120],[21,22]];        
306          ScaleRow(a,0,10.0);
307          self.assertEquals(a,b);        
308          b = [[110,120],[210,220]];        
309          ScaleRow(a,1,10.0);
310          self.assertEquals(a,b);        
311  
312      def testCombineRow(self):    
313          a = [[11],
314               [21]];
315          b = [[2111],
316               [  21]];        
317          CombineRow(a,0,1,100);
318          self.assertEquals(a,b);         
319             
320          a = [[11,12],
321               [21,22]];
322          b = [[2111,2212],
323               [  21,  22]];        
324          CombineRow(a,0,1,100);
325          self.assertEquals(a,b);       
326          
327          a = [[11,12],
328               [21,22]];
329          b = [[  11,  12],
330               [1121,1222]];        
331          CombineRow(a,1,0,100);
332          self.assertEquals(a,b);           
333          
334                  
335  if __name__ == "__main__":
336      unittest.main();
337       

==========================End of Code=============================

Monday, June 11, 2012

Confusing SAS statements

Here are some wiered (different with other programming languge) SAS syntax:
1. IF statement in DATA steps.
eg. IF expression ;
if the expression fails, the process will skip rest statements and start over the next observation in the DATA step.
2. RETAIN and SUM expression
e.g. sumVar+ValueExpression
the initiation of SUM variable is 0, instead of missing value like other variables. When you need an other than zero initiation for sumVar, you have to use modifier RETAIN. This is not consistent grammar design.
3. Left SUBSTR
e.g. SUBSTR(TagetStr, insertPos, length)=InsertString
overloaded the function, left assigment

Monday, June 4, 2012

Note on mathematical statistics: 6.2 Rao-Cramer lower bound and efficency

Mindmap for Section 6.2 Rao-Cramer lower bound and efficiency
Download MindJet file: https://docs.google.com/open?id=0Bw64rMSoJR_Za3FFMWM1LS1QZWs

Research Project on Sampling Rare Population

Term Research Project in Sampling Theory

Title: Sampling Rare Population


Outline
============================================
* Introduction
* Methods of SRP
Disproportional Stratified Sampling
Multiple Frames
Network (Multiplicity) Sampling
Snowball Sampling
Staged methods & Screening
* Example
* Research Note
* Concluding Remarks



Download slide in PDF: https://docs.google.com/open?id=0Bw64rMSoJR_ZM3hnRWNCMVN6THc
Download Reference: https://docs.google.com/document/d/1UsYYLU66HI0PBruoTE8tu6qbwFqVWtGJwdBe4C2OWB4/edit


Tuesday, May 22, 2012

Note on mathematical statistics: 5.7 Chi-Square Test

1. Mathematical basis of Chi-Square test
  • Multivariate normal distribution implies a Chi-Square (n) distribution of Y=Sum( (X_i-mu_i)^2 / sigma_i^2 ). 
  • By CLT, the joint pdf of different groups of sample are treated multivariate normal.
  • The statistic Q_k-1 = Sum( Y_i^2) = Sum [1..k] ( (X_i-E_i)^2/E_i ) has a Chi-Square(k-1) distribution.
  • With the idea of interval  frequency approximation, every distribution can be treated as Multi-binomial.
2. Procedure (the idea of interval  frequency approximation)
  • Partition the domain of experiment result into finite mutually disjoint sets A_1, A_2 ... A_n
  • Count the number of result in A_i as frequency X_i
  • Assign df  (different ways)
  • Assign the probability of result in A_i as p_i (different ways)
  • Evaluate statistic Q_k-1
  • Test
3. Three tests

Test Goodness of fit Homogeneity Independence
example 5.7.1, 5.7.2 5.7.3 5.7.4
H0 Result has the theoretical distribution Two sets of sample have the same distribution Two attributions of subjects are independent 
Key fact X_i/n = p_0i p_1i=p_2i=p_0i=E_i/nP_ij=Pi*Pj
source of E_i   multinomial modelMLE MLE
formula of E_i (1<=i<=k) E_i=n*p_0iE_i=(X_i1+X_i2)/(n1+n2) E_ij=(X_i./n) * (X_.j/n)
df k-1 k-1=2(k-1)-(k-1) (a-1)(b-1)=(a*b-1)-(a+b-2)
statistic Sum [i=1..k] ( (X_i-E_i)^2/ E_i ) Sum [j=1,2] [i=1..k]( (X_ij-n_j*E_i)^2/n_j*E_i ) Sum [j=1,a] [i=1..b]( (X_ij-n*E_ij)^2/n*E_ij)
Dataset x_1, x_2 ... x_n x_1, x_2,... x_n1
y_1, y_2 ... y_n2
contingency table
Xij, 1<=i<=a, 1<=j<=b
k=a*b

4. Remarks
  • Chi-Square tests are not exact test, but approximate test
  • The statistic is based on frequency of result in interval, instead of the result itself
  • Make sure E_i > 5 or use Fisher exact test
  • Minimum Chi-Square estimation
  • MLE based Chi-square tests have greater rejection rate than tests based on Minimum Chi-Square estimator
  • Every estimated parameter p0i costs one df

Notes on mathematical statistics: textbook

Here is a serial of notes on my studying on mathematical statistics, specifically on the textbook of
Introduction to Mathematical Statistics,
R.V. Hogg, A Craig and J. W. McKean
6th edition, Pearson.


There was an official solution book pdf file including partial answers to exercises.
However, it can be located and downloaded only after intensive Google search.

I keep my homework and exercises from Chapter 4 to Chapter 7 which can be used as an "AS IS" manual for the book (not only even number questions in official solution, but also some odd numbers questions).

Contact me with the page number and  exercises  index.
I will see what I can do for you.

Tuesday, January 31, 2012

Embedding data in workflow

SourceQnA in Knime forum

Contributor ID: 


Goal 
To embed data in workflow, so it can be distribute with the workflow package zip file.

Strategy
Create a directory with files underneath the directory containing the node artifacts for a workflow, the "knime.node" flow variable will be populated the next time you edit the node.
The folder you create within the node's folder has to be named "drop". Otherwise when you save the workspace whatever folders/files you add get deleted.

Test
Not yet

Monday, January 9, 2012

Typical workflow for test on specific Culumn

Contributor ID: James Davidson


Goal

Derive a new column from a test on existing column

Strategy
1. Use [Row Filter] to do the test and split (with Include & exclude) the original table
2. Use [Java snippet] to generate the value for the new column
3. Use [Concatenate] to merge table back

Comment
1. In some case, [Joiner] my be a shortcut for the goal
2. Do not known whether it is the best solution, since I am just a beginner

Resource
Download James's Package












Friday, January 6, 2012

Remove empty columns in Knime

Goal
Delete the columns comprise of blank value (0, NaN, NA, NULL or fix string)

Strategy
1. If the target is number column (double or int), use [Low Variance Filter] node
2. Use [Transpose] + [Row Filter] + [Transpose] , suggested by
http://tech.knime.org/forum/knime-general/removing-columns-where-every-value-is-empty
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table

Comment
None of the three works for me.
1. My target columns are string type. [String to Number] node need to be wired to source column manually.
It does not make sense if I have tons of empty columns.
2. Missing Value in [Row filter]'s setting only works on specific column, which implies it is not an automation.
3. I can write R, Python or script outside Knime to do filtering.

Final solution
1.  Pre-process the CSV file outside Knime
2. Manually skip the column in [File Reader] setting.