因子Tips大全

因子 Tips 大全

前置き：初級Q&Aでデータフレーム中の因子変数の件で悩まれていた方がいらしたので、関連知識をまとめてみました。確かに、因子は実体とその見掛けの表現が異なるという意味で、困惑させられる概念です。要するに、表と同じく値が同じケースをグループ（カテゴリー）化するわけですが、各グループの代表値が水準集合で別個に保管・表現されているわけです。

Rの因子(factor)は整数値ベクトルの一種類と考えられるが，その真の値は対応する水準ベクトル(文字列ベクトルです)により間接的に表現される．結果として、同じ値を持つケースがグループ化されるとともに、大きなサイズの文字列変数の保管メモリを少なくする効果もある。データフレーム中の文字列変数は原則因子として扱われる．因子は統計モデル関数で特に重要(help(contrasts)参照)になる．

ベクトルを因子にする factor()†

関数 factor() はベクトルを因子にする．Sとの互換性のために同値な関数 ordered() がある．データフレーム等を操作する関数の内部で，変数を因子化するのに使われる．

書式: factor(x = character(), levels = sort(unique.default(x), na.last=TRUE),
　　         labels = levels, exclude = NA, ordered = is.ordered(x))
　　  is.factor(x), is.ordered(x), as.factor(x), as.ordered(x)
引数:      x データのベクトル．普通小数の異なる値を持つ
　　  levels オプション．xが持ち得る値のベクトル．既定では昇順に並べたx中の値
　　  labels 水準に対するオプションのラベル，もしくは長さ1の文字列
　　 exclude 水準を作る際除かれる値のベクトル．xと同じ型(に変換される)
　　 ordered 水準は与えられた順番で順序つけられているかどうかを指示する論理フラグ

もし ordered = TRUE ならば因子水準は順序つけられていると見なされる．これはクラス属性だけの違いであるが，モデル当てはめ関数等では全く異なった扱いがされる．基本的には，もし x[i] が levels[j] に等しければ x[i] の水準は j とされる．exclude で与えられた値は水準に含められない．水準値集合 levels は既定でデータ値から決まるが，もし独自に与えた levels 中に無い値は NA とされる．数値 x に対し exclude=NULL とすると NA 値は特殊水準 "NA" とされ，水準の最後の値とされる．

返り値はクラス属性 "factor" を持つオブジェクトで，属性 "levels"，モード character を持つ x と同じ長さの整数値コードの集合である．順序つけられた場合はクラス属性 c("ordered", "factor") を持つ．

因子の実際の解釈はコードと水準集合双方に依存する．水準集合が同じ二つの因子の比較には as.numeric(levels(f))[f] で数値ベクトル化することが勧められる．因子水準は既定でソートされるが，ソートはロケールに依存する．例え小数でも同じ値が頻繁に繰り返されている文字列データは，因子として扱えば必要メモリ量が減る．32ビット機では n バイトの文字列の保管には 28 + 8*ceiling((n+1)/8) バイトが必要だが，因子であれば4バイトで済む．64ビット機では28は56以上になる．

> (ff <- factor(substring("statistics",1:10,1:10), levels=letters)) # 水準集合を指定
 [1] s t a t i s t i c s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> as.integer(ff)
 [1] 19 20  1 20  9 19 20  9  3 19
> factor(ff)                                 # 水準集合を簡略化
 [1] s t a t i s t i c s
Levels: a c i s t
> factor(letters[1:20], label = "letter")    # 水準名の接頭時を与える
 [1] letter1  letter2  letter3  letter4  letter5  letter6  letter7  letter8 
 [9] letter9  letter10 letter11 letter12 letter13 letter14 letter15 letter16
[17] letter17 letter18 letter19 letter20
20 Levels: letter1 letter2 letter3 letter4 letter5 letter6 letter7 ... letter20
> (x <- factor(c(1, 2, "NA"), exclude = "")) # NAを水準に入れる
[1] 1  2  NA
Levels: 1 2 NA

↑

水準パターンを与えて因子を作る gl()†

関数 gl() は水準パターンを与えて因子を作る．

書式: gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)
引数:     n 水準数
　　      k 繰り返し数
　　 length 結果の長さ
　　 labels 水準に対するオプションラベル
　　ordered 論理値．水準は順序つけられているとみなすか？

> gl(2, 8, label = c("Control", "Treat"))
 [1] Control Control Control Control Control Control Control Control Treat  
[10] Treat   Treat   Treat   Treat   Treat   Treat   Treat  
Levels: Control Treat
> gl(2, 1, 20)
 [1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
Levels: 1 2
> gl(2, 2, 20)
 [1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2

↑

組合せ因子を作る interaction()†

関数 interaction() は複数因子の水準の組合せからなる因子を作る．

書式: interaction(..., drop = FALSE, sep = ".")
引数: ... 複数の因子，もしくはそれらのリスト
　　 drop もしTRUEなら，使われていない水準は除かれる．既定では全て保存する
　　  sep それらを結合して新しい水準ラベルを作る文字列

結果は常に順序つけられない．水準ラベルは元の水準ラベルをピリオドで結合したものになる．

> a <- gl(2, 4, 8)
> b <- gl(2, 2, 8, label = c("ctrl", "treat"))
> s <- gl(2, 1, 8, label = c("M", "F"))
> interaction(a, b)
[1] 1.ctrl  1.ctrl  1.treat 1.treat 2.ctrl  2.ctrl  2.treat 2.treat
Levels: 1.ctrl 2.ctrl 1.treat 2.treat
> interaction(a, b, s, sep = ":")      # 分離記号を変える
[1] 1:ctrl:M  1:ctrl:F  1:treat:M 1:treat:F 2:ctrl:M  2:ctrl:F  2:treat:M
[8] 2:treat:F
8 Levels: 1:ctrl:M 2:ctrl:M 1:treat:M 2:treat:M 1:ctrl:F ... 2:treat:F

↑

因子の水準属性 levels()†

関数 levels() は因子の水準属性を与え，また変更することに使える．

書式: levels(x), levels(x) <- value
引数:    x 因子オブジェクト
　　 value levels(x)に対する適正な値

> x <- gl(2, 4, 8)
> levels(x)[1] <- "low"
> levels(x)[2] <- "high"
> x
[1] low  low  low  low  high high high high
Levels: low high
> y <- gl(2, 4, 8)
> levels(y) <- c("low", "high")
> y
[1] low  low  low  low  high high high high
Levels: low high
> z <- gl(3, 2, 12)
> levels(z) <- c("A", "B", "A")
> z
 [1] A A B B A A A A B B A A
Levels: A B
> z <- gl(3, 2, 12)
> levels(z) <- list(A = c(1, 3), B = 2)
> z
 [1] A A B B A A A A B B A A
Levels: A B

因子の水準属性を変更するには添え字指定で置換する

> w <- gl(3, 2, 12)
> levels(w) <- c("A", "B", "C")
> w
[1] A A B B C C A A B B C C
Levels: A B C
> levels(w)[2:3] <- "B"
> w
[1] A A B B B B A A B B B B
Levels: A B

↑

因子の水準数 nlevels()†

nlevels(x) は因子 x の水準数を与える．

> nlevels(gl(3, 7))
[1] 3

↑

因子の添字操作†

鈎括弧演算子による添字操作で，因子の一部の取り出し，置き換えができる．

 書式: x[..., drop = FALSE], x[[i]], x[...] <- value
 引数:   x 因子
　　 ...,i 範囲を指定する添字集合
　　  drop もしTRUEならば使われていない水準は取り除かれる
　　 value 水準を表す文字列集合

> (ff <- factor(substring("statistics", 1:10, 1:10), levels = letters))
 [1] s t a t i s t i c s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> ff[1:3]                        # 一重鈎括弧演算子
[1] s t a
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> ff[[3]]                        # 二重鈎括弧演算子では単一成分だけ取り出せる
[1] a
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> ff[1:3] <- c("f","a","n")      # 成分の置き換え
> ff
 [1] f a n t i s t i c s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> ff[, drop = TRUE]              # 使われていない水準を取り去る
 [1] f a n i s t i c s
Levels: a c f i n s t

↑

因子水準の並べ変え reorder() ，relevel()，factor(x, levels= )†

reorder() は総称的関数で，その因子メソッド関数は因子を第2引数(普通数値)の値に応じて並べ変える．

 書式: reorder(x, ...)
       ## クラスfactorに対するS3メソッド
　　   reorder(x, X, FUN = mean, ...,  order = is.ordered(x)) 
 引数:   x 水準の順序を変更したい(順序)因子
　　     X xと同じ長さのベクトルで，xの各水準に対応する部分集合が水準の最終的な順序を決定する
　　   FUN 最初の引数がベクトルでスカラー値を返す関数．xの水準で決まるXの各部分集合に適用される
　　   ... FUNに渡されるオプション引数
　　 order 論理値．返り値を単なる因子でなく順序つき因子にするか

因子 x の値で決まるグループに対応する X の部分集合に FUN を適用した値が増加するように因子 x の順序が並べ変えられる．FUN を適用した値が score 属性としてつけ加えられる．

> str(InsectSprays)                             # 組み込みデータフレーム
'data.frame':   72 obs. of  2 variables:
 $ count: num  10 7 20 14 14 12 10 23 17 20 ... 
 $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
> InsectSprays$spray                            # 因子sprayを持つ
 [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
[39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
Levels: A B C D E F
## 因子水準A,B,C,D,E,F毎にcount値の中央値を計算し，それが昇順になるよう水準を並べ変え
> reorder(InsectSprays$spray, InsectSprays$count ,median)  
 [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
[39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
attr(,"scores")
   A    B    C    D    E    F 
14.0 16.5  1.5  5.0  3.0 15.0                   # 計算された因子グループ毎の中央値
Levels: C E D A F B                             # 並べかえられた水準

relevel()では、refに選んだ因子水準を先頭に持って来る。 contrastsの基準を任意の因子にしたい場合。

> relevel(InsectSprays$spray, ref="F")          # 水準Fを先頭に並べ替え
 [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C
[35] C C D D D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F
[69] F F F F
Levels: F A B C D E

levels パラメータで因子水準の並び順を任意に指定する。

> factor(InsectSprays$spray, levels=c("A", "E", "B", "C", "D", "F")) # levels パラメータを与えて再因子化
 [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C
[35] C C D D D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F
[69] F F F F
Levels: A E B C D F

↑

空の因子レベルを除く†

subsetを使ってデータセットの一部を取り出すと、以下のように空の因子レベルが残ってしまう。これを取り除くにはdroplevels()を使う。（以前は「factor を使う」と書かれていたが，droplevels の関数定義は，function (x, ...) factor(x) です。別名にすぎない。factor でよい。）

> IS2<-subset(InsectSprays,InsectSprays$spray!="D")
> IS2$spray
 [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C
[31] C C C C C C E E E E E E E E E E E E F F F F F F F F F F F F
Levels: A B C D E F
> summary(IS2$spray)
 A  B  C  D  E  F 
12 12 12  0 12 12 
> IS2$spray<-droplevels(IS2$spray)
> summary(IS2$spray)
 A  B  C  E  F 
12 12 12 12 12

↑

Tips 1. 因子の操作は時間がかかる†

因子は，数値・文字列ベクトルを水準集合という文字列ベクトルの添字という間接的参照として表現する．これにより，同じ値が自動的にグループ化される．こうした間接的な参照は，本来のベクトルに比べれば処理にかかる可能性があるため，場合によれば元のベクトルのまま処理することが好ましい．

> y <- sample(1:10,1e4, replace=TRUE)
> x <- as.factor(y)                     # 因子化
## 処理速度比は108:1
> for (i in 1:1e4) x[i]
> for (i in 1:1e4) y[i]

逆に order(), sort() 関数等，グループ単位で処理可能な処理は同等か，かえって早くなる可能性がある．

## 処理速度は3.6:3.7(秒)
> for (n in 1:1000) order(x)
> for (n in 1:1000) order(y)

データフレームは因子変数を含むことが多く，これが一層の処理速度低下をもたらすことがある．

> z <- data.frame(x=runif(1e3), y=sample(letters[1:10], 1e3, replace=TRUE))
> str(z$y)                                        # y変数は因子化されている
 Factor w/ 10 levels "a","b","c","d",..: 8 6 2 8 8 10 5 4 6 10 ...
> zz <- transform(z, y=levels(z$y)[z$y])          # y変数の非因子化
> z4 <- levels(z$y)[z$y]                          # y変数を文字列ベクトルとして取り出す
## 処理速度2.4:1.1:1
> for(n in 1:1e3) which(z$y == "a")
> for(n in 1:1e3) which(zz$y == "a")
> for(n in 1:1e3) which(z4 == "a")

↑

Tips 2. データフレーム中の文字列変数は既定では因子化される。不審な際は str() 関数でチェックする。†

> x <- data.frame(a=1:10, b=letters[1:10])
> str(x$b)      # 文字列変数 b は因子化されている 
 Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
> xx <- data.frame(a=1:10, b=I(letters[1:10]))　# 因子化を防ぐには「そのまま」関数 I() を使う
> str(xx$b)
Class 'AsIs'  chr [1:10] "a" "b" "c" "d" ...
> x <- transform(x, b=levels(b)[b])　# もし後から非因子化したければ　　　　　　   
## もし数値ならば as.numeric(levels(b))[b] とする　　　
> str(x$b)
 chr [1:10] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

↑

　Tips 3. read.table() 関数の　as.is 引数の利用†

read.table() 関数でファイルをデータフレームに読みこむ際，文字列は既定で因子化される．これを抑制するには as.is 引数で、因子にしたくない文字列変数には TRUE、因子にしたい文字列変数には FALSE を指定した論理値ベクトルをあたえる．

> xx <- data.frame(a=1:10, b=I(letters[1:10]), c=I(LETTERS[1:10]))
> write.table(xx, "test.table")　# テスト用データファイル
> str(read.table("test.table")) # 既定では b.c 変数は因子化
'data.frame':   10 obs. of  3 variables:
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
 $ c: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
> str(read.table("test.table", as.is=c(TRUE,TRUE))) # b.c 変数を非因子化
'data.frame':   10 obs. of  3 variables:
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b: chr  "a" "b" "c" "d" ...
 $ c: chr  "A" "B" "C" "D" ...
> str(read.table("test.table", as.is=c(TRUE,FALSE)))　# c変数のみを因子化
'data.frame':   10 obs. of  3 variables:
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b: chr  "a" "b" "c" "d" ...
 $ c: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
> str(read.table("test.table", as.is=c(FALSE,TRUE)))　# b変数のみを因子化
'data.frame':   10 obs. of  3 variables:
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
 $ c: chr  "A" "B" "C" "D" ...
> str(read.table("test.table", as.is=c(FALSE,FALSE)))　＃ as.is引数無しと同じこと
'data.frame':   10 obs. of  3 variables:
 $ a: int  1 2 3 4 5 6 7 8 9 10
 $ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
 $ c: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10

↑

Tips 4. いくつかややこしい例†

因子変数の実際の値は水準セットと因子値の双方で決まる。水準セットを変えれば変数の値自体が変わる。

> set.seed(1); x <- sample(letters[1:5], 10, replace=TRUE)
> x
 [1] "b" "b" "c" "e" "b" "e" "e" "d" "d" "a"
> y <- as.factor(x)
> y
 [1] b b c e b e e d d a
Levels: a b c d e
> x == y   　       # 比較、一致
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> levels(y) <- rev(levels(y))
> y　　　　 　      # 因子水準を逆転　　
 [1] d d c a d a a b b e
Levels: e d c b a
> x == y　　        # もう一度比較。不一致
 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> levels(y) <- 1:5　# 更に変更
> y
 [1] 2 2 3 5 2 5 5 4 4 1
Levels: 1 2 3 4 5

水準セットは冗長な要素を含んでいても良い。二つの因子変数の比較は、仮に外見上同じに見えても問題を引き起こす可能性がある。特に水準セットが異なる二つの因子は比較不能。

> x <- y <- as.factor(c("a","b","a","a","b","a","b"))
> x
[1] a b a a b a b
Levels: a b
> levels(y) <- c("a","b","c") #  
> y
[1] a b a a b a b
Levels: a b c
> x == y
以下にエラーOps.factor(x, y) : 因子の水準セットが異なっています

↑

library(foreign)のread.xlsで文字列が因子になって困った場合の対処†

少し毛色が違う問題ですが、エクセルからread.xlsでデータを取り込んだときに困ったのでここに付け加えておきます。

#dfという名前のdata frameのIDという変数が因子になってしまった場合
df$ID<- as.character(df$ID)

read.xls は ... 引数を持っており，その説明を読めば，読み込むときに文字列として読むか factor として読むかは選択できる。
たいていのことは，それを実現するための引数が用意されている。オンラインヘルプを良く読むことがたいせつ。
stringsAsFactors --- logical: should character vectors be converted to factors? Note that this is overridden by as.is and colClasses, both of which allow finer control.

as.is --- the default behavior of read.table is to convert character variables (which are not converted to logical, numeric or complex) to factors. The variable as.is controls the conversion of columns not otherwise specified by colClasses. Its value is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors. Note: to suppress all conversions including those of numeric columns, set colClasses = "character".

Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped.

colClasses --- character. A vector of classes to be assumed for the columns. Recycled as necessary, or if the character vector is named, unspecified values are taken to be NA.
Possible values are NA (the default, when type.convert is used), "NULL" (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or "factor", "Date" or "POSIXct". Otherwise there needs to be an as method (from package methods) for conversion from "character" to the specified formal class.

Note that colClasses is specified per column (not per variable) and so includes the column of row names (if any).