Wednesday, June 19, 2013

Methods to save data frame in file in R

When you try to "save" your data set in a data frame object in R, you have several options:
Method
Pro
Con
Funcitons
Image the object in binary format
Fast, can keep object name and other environment information
R specific
save(df, file= "filename")
rm(df)
load("filename", .GlobalEnv)
Save in coded text
Full information, e.g. data mode
Size is big, can not exchange with other software
dump(c("df"),"filename")
newDf = source(“filename")$value
Export  to plain text
Human readable, and software exchangeable
May need to recast R types when read in
write.table()
read.table()
Export to other format
Software specific
Software specific
Write.X()
Read.X(), where X can be spss, sas,csv, excel

Friday, January 18, 2013

Data Mining: Best Buy mobile web log

Ongoing project

Outline: finished part
===================================

Objective of the project
Data features:Query/Product Data
Data exploring
    Descriptive statistics
    Survival analysis
Models
    Collaborative Filtering
Summary



Download slide: https://docs.google.com/presentation/d/1i37NWAxcqnsETR4a9jNJaqvhLn90FkpbFKJDjxE70zU/edit


Wednesday, November 7, 2012

Generate random permutation in R

Generate random permutation in R

start = 1
end = 10
seq = c(start, end)
sample(seq,length(seq), replace=FALSE)

Equivalent MathLab: X = randperm(End-Begin+1)+Begin

example: [1]  5  1  2  8  6  4  3  7  9 10

Thursday, September 20, 2012

Remove duplicated rows in Knime


Goal
Delete the duplicated rows in table.

Strategy
1. Use GroupBy node, refer to
Detect and delete duplicate files (rows) based on two variable identifiers (date and lot number) and not based on row ID
2. Use Database Query. SELECT DISTINCT
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table

Comment
1. Delete empty rows is a special case of this topic.
2. Two rows are duplicated at all the columns, or only on several columns (combination key)
3. S2, and S3 are not tested



Saturday, July 21, 2012

Regression Analysis of Airline’s Incidents

Term Research Project in SAS and Statistic Consulting

Title: Regression Analysis of Airline’s Incidents

Outline
============================================

Introduction

Data
Analysis
     Simple linear regression
    Variance stabilizing transformation 
    Non-linear Regression (Poisson )
Conclusion



Download slide: https://docs.google.com/open?id=0Bw64rMSoJR_ZUjl6Q2NkODNmRFE

Wednesday, July 18, 2012

SAS PROC SQL enhancement to ANSI

The difference between PROC SQL and ANSI SQL is documented in "PROC SQL and the ANSI Standard" (9.2  Help link), with content on:


  • Compliance
  • SQL Procedure Enhancements
  • SQL Procedure Omissions
Useful enhancements include:
  • Format, Informat, Label
  • Contain
  • Calculated

Monday, July 9, 2012

Pricing scheme of Group-buy Websites in China

Term Project for Applied Statistics

Outline
============================================
Background: Group Buying/ C2C
Data set: Public production data
Method: Descriptive/ Inference/ Diagnostics
Analysis: Preparing/Visualization/2 Models
Conclusion: Competition/Market

Here



Download slide: https://docs.google.com/file/d/0Bw64rMSoJR_ZNkhJRWVNUG5tUG8/edit