Loading Big Data in R

Although parallel techniques in R has been prevailing, I will only focus on Loading the complete data into RAM in R, that is to say, no Hadoop or similar. What other more I won’t mention in this post is about manipulating and saving big data in R, and parallel computing.

Just start with different implementations:

load csv file and using ff package (Rtools)
```
  bigdata <- read.csv.ffdf(file = ”bigdata.csv”, first.rows=5000, colClasses = NA)
```
Notice that ff package should be in Rtools on Windows.

using sqldf() from SQLite

this is a method from StackOverflow: using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R

  library(sqldf)
  f <- file("bigdf.csv")
  system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

magic data.table and fread

it includes data.frame, but some of the syntax is different. Luckily, the documentation (and the FAQ) are excellent.

Read csv-files with the fread function instead of read.csv (read.table). It is faster in reading a file in table format and gives you feedback on progress.

Notice that fread() cannot directly read gzipped files and it comes with a big warning sign “not for production use yet”. One trick it uses is to read the first, middle, and last 5 rows to determine column types.
optimized read.table()* with **colClasses

This option takes a vector whose length is equal to the number of columns in year table. Specifying this option instead of using the default can make ‘read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the of each column in your data frame. - See more at hear.
```
  read.table("test.csv",header=TRUE,sep=",",quote="",  
                    stringsAsFactors=FALSE,comment.char="",nrows=n,                   
                    colClasses=c("integer","integer","numeric",                        
                                 "character","numeric","integer"))
```
load a portion using nrows

Also you can read in only a portion of your file, to get a feel of the dataset.
```
  data_first_100 <- read.table("file", header=T, sep="\t", stringsAsFactors=F, nrows=100)
```

in summary

Here is a great comparison summary for the method above with their system time. I just copy the summary table below:

  ##    user  system elapsed  Method
  ##   24.71    0.15   25.42  read.csv (first time)
  ##   17.85    0.07   17.98  read.csv (second time)
  ##   10.20    0.03   10.32  Optimized read.table
  ##    3.12    0.01    3.22  fread
  ##   12.49    0.09   12.69  sqldf
  ##   10.21    0.47   10.73  sqldf on SO
  ##   10.85    0.10   10.99  ffdf

See more in 11 Tips on How to Handle Big Data in .

Loading Big Data in R

October 04, 2013

Highway Networks and Deep Residual Networks

NIPS 2015 Deep Learning Symposium Part II

NIPS 2015 Deep Learning Symposium Part I