## October 04, 2013

Although parallel techniques in R has been prevailing, I will only focus on Loading the complete data into RAM in R, that is to say, no Hadoop or similar. What other more I won’t mention in this post is about manipulating and saving big data in R, and parallel computing.

• load csv file and using ff package (Rtools)

  bigdata <- read.csv.ffdf(file = ”bigdata.csv”, first.rows=5000, colClasses = NA)


Notice that ff package should be in Rtools on Windows.

• using sqldf() from SQLite

this is a method from StackOverflow: using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R

  library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))


it includes data.frame, but some of the syntax is different. Luckily, the documentation (and the FAQ) are excellent.

Notice that fread() cannot directly read gzipped files and it comes with a big warning sign “not for production use yet”. One trick it uses is to read the first, middle, and last 5 rows to determine column types.

This option takes a vector whose length is equal to the number of columns in year table. Specifying this option instead of using the default can make ‘read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the of each column in your data frame. - See more at hear.

  read.table("test.csv",header=TRUE,sep=",",quote="",
stringsAsFactors=FALSE,comment.char="",nrows=n,
colClasses=c("integer","integer","numeric",
"character","numeric","integer"))

• load a portion using nrows

Also you can read in only a portion of your file, to get a feel of the dataset.

  data_first_100 <- read.table("file", header=T, sep="\t", stringsAsFactors=F, nrows=100)

• in summary

Here is a great comparison summary for the method above with their system time. I just copy the summary table below:

  ##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf


See more in 11 Tips on How to Handle Big Data in .

### Highway Networks and Deep Residual Networks

Recently, a breakthrough news spread over social networks. In this post, I will explain this ResNet as a special case of Highway Networks, which has been proposed before. Both of the work is amazing and thought-provoking. Continue reading

#### NIPS 2015 Deep Learning Symposium Part II

Published on January 09, 2016

#### NIPS 2015 Deep Learning Symposium Part I

Published on December 11, 2015