Loading Data in R

June 27, 2013

There are several types of problems one may meet when loading data in R. I solved some of them and taken down the notes below.

tough garbage CN characters in R

中文乱码的问题在很多情况下都遇到了。内因是R是用本地码(通常是GBK)来解释unicode。 目测整体解决办法有几种:

  • Encodings

这个办法我只成功解决过一次。

source(file,encoding="utf-8") 
  • 改R的环境

    很奇怪的是在英文环境下都反而有时候不乱码。

  • 操作系统的系统编码问题

    Windows是 gbk 编码,且不可改!(所以只能 Encodings 改了);Linux 是 utf-8 。可以用 sessioninfo() 来查看 locale 的编码,然后改掉。一般有时候比如 mysql 也乱码的时候这个方法很好用,应该是个通用性很高的方法。

    Windows 一般是gbk的编码,读取utf-8的文件时,需要声明读取编码就OK了。

      source(file,encoding="utf-8") 
    

    Linux的情况复杂一些

    • locale要设置成zh_CN
    • 要安装中文字符集,或者从window下复制过去
    • R读取,统一用utf-8的。

    最复杂的情况是DB连接

    • 有时候DB的字符集是gb2312, gbk, utf8等
    • 在DB读取的时候,DBI包,要设置DB的字符编码
    • 当把数据读到R中时,要跟R的环境的编码要统一
    • linux/win两套环境,编码部分要是区别写的。

    Ref

  • 强大的iconv()

    Usage

      iconv(x, from = "", to = "", sub = NA, mark = TRUE) 
      iconvlist()  
    

    除此以外还可以用于除掉一些乱码,比如 Removing non-ASCII characters. Ref

  • 强大的iconv()也失效时

    • 更多更好的去理解网页编码 Ref

        url= htmlParse(url,encoding="UTF-8")  
      
    • embedded null characters (‘\0’) in strings

    这个似乎也是个 devils 在 inferno 的书里有写,下次再开坑吧。 Ref

missing values

大概是 missing value 要仔细处理。

和 missing value 有关的大概有4件事:

  • 如何填充 missing value
  • misquote 等等会引起 missing value
  • whitespace 可能丧失
  • extraneous fields 用 fill 解决或者用 count.fields 诊断

      x <- count.fields("UserProfile.tsv", sep = '\t') 
      table(x) 
      which(x != legal.length) // check where the illegal lines are 
    	
    	
      userlist <- read.table("UserProfile.tsv", sep = '\t', header = FALSE, stringAsFactors = FALSE, fill = TRUE) // "file" matters. 
    

其中填充 missing value 涉及到 na.strings()。这里牵扯到如果一个 string value 真的是 NA,要注意加quote。 Ref

再之, 对 NA 的问题又牵扯出 na.action.

group to summary

  • The ddply() function. It is the easiest to use, though it requires the plyr package. This is probably what you want to use.
  • The summarizeBy() function. It is easier to use, though it requires the doBy package.
  • The aggregate() function. It is more difficult to use but is included in the base install of R.

Highway Networks and Deep Residual Networks

Recently, a breakthrough news spread over social networks. In this post, I will explain this ResNet as a special case of Highway Networks, which has been proposed before. Both of the work is amazing and thought-provoking. Continue reading

NIPS 2015 Deep Learning Symposium Part II

Published on January 09, 2016

NIPS 2015 Deep Learning Symposium Part I

Published on December 11, 2015