Sean's Data Analysis Note

Wednesday, June 19, 2013

Methods to save data frame in file in R

When you try to "save" your data set in a data frame object in R, you have several options:

Method	Pro	Con	Funcitons
Image the object in binary format	Fast, can keep object name and other environment information	R specific	save(df, file= "filename") rm(df) load("filename", .GlobalEnv)
Save in coded text	Full information, e.g. data mode	Size is big, can not exchange with other software	dump(c("df"),"filename") newDf = source(“filename")$value
Export to plain text	Human readable, and software exchangeable	May need to recast R types when read in	write.table() read.table()
Export to other format	Software specific	Software specific	Write.X() Read.X(), where X can be spss, sas,csv, excel

Friday, January 18, 2013

Data Mining: Best Buy mobile web log

Ongoing project

Outline: finished part
===================================

Objective of the project
Data features:Query/Product Data
Data exploring
Descriptive statistics
Survival analysis
Models
Collaborative Filtering
Summary

Download slide: https://docs.google.com/presentation/d/1i37NWAxcqnsETR4a9jNJaqvhLn90FkpbFKJDjxE70zU/edit

Wednesday, November 7, 2012

Generate random permutation in R

Generate random permutation in R

start = 1
end = 10
seq = c(start, end)
sample(seq,length(seq), replace=FALSE)

Equivalent MathLab: X = randperm(End-Begin+1)+Begin

example: [1] 5 1 2 8 6 4 3 7 9 10

Thursday, September 20, 2012

Remove duplicated rows in Knime

Goal
Delete the duplicated rows in table.

Strategy
1. Use GroupBy node, refer to
Detect and delete duplicate files (rows) based on two variable identifiers (date and lot number) and not based on row ID
2. Use Database Query. SELECT DISTINCT
3. Use code snippet in scripts: R, JPyhton, and Java to deal the table

Comment
1. Delete empty rows is a special case of this topic.
2. Two rows are duplicated at all the columns, or only on several columns (combination key)
3. S2, and S3 are not tested

Saturday, July 21, 2012

Regression Analysis of Airline’s Incidents

Term Research Project in SAS and Statistic Consulting

Title: Regression Analysis of Airline’s Incidents

Outline
============================================

Introduction

Data

Analysis

Simple linear regression

Variance stabilizing transformation

Non-linear Regression (Poisson )

Conclusion

Download slide: https://docs.google.com/open?id=0Bw64rMSoJR_ZUjl6Q2NkODNmRFE

Wednesday, July 18, 2012

SAS PROC SQL enhancement to ANSI

The difference between PROC SQL and ANSI SQL is documented in "PROC SQL and the ANSI Standard" (9.2 Help link), with content on:

Compliance
SQL Procedure Enhancements
SQL Procedure Omissions

Useful enhancements include:

Format, Informat, Label
Contain
Calculated

Monday, July 9, 2012

Pricing scheme of Group-buy Websites in China

Term Project for Applied Statistics

Outline
============================================

Background: Group Buying/ C2C
Data set: Public production data
Method: Descriptive/ Inference/ Diagnostics
Analysis: Preparing/Visualization/2 Models
Conclusion: Competition/Market

Here

Download slide: https://docs.google.com/file/d/0Bw64rMSoJR_ZNkhJRWVNUG5tUG8/edit