P_Johnson_tips の変更点

追加された行はこの色です。
削除された行はこの色です。
P_Johnson_tips は削除されています。
P_Johnson_tips の差分を削除
COLOR(red){SIZE(25){Paul Johnson 氏の R Tips  集訳}}

2003/08/09  現在~

注：これは P. Johnson さんが r-help の記事から役にたちそうな箇所を抜き出したものです。オリジナルは Tips 集参照。様々な話題があり、参考になります。とりあえず各見出しだけでも訳して見当を付けやすくしたいと考えています。例によりお手伝いお願いします。お勧めの記事があればその訳も付けたいですね。整形だけでもお願いします。(なお元そのものが r-help 記事の抄約ですから、著作権問題はクリアされていると考えますが？)
----

(私はこれを "StatsRus" と名付けるのが気がきいていると思ったのですが、玩具店の法律顧問が電話してきて、「ご存知のように ...」)

もしあなたが R の Tips 集が必要なら、これがそうです。これは R ドキュメントの代用品ではなく、電子メイルで読んだことを覚えておくために使っているものです。

Brian D. Ripley の言葉を注意しましょう、
"One enquiry came to me yesterday which suggested that some users of binary
distributions do not know that R comes with two Guides in the doc/manual
directory plus an FAQ and the help pages in book form.  I hope those are
distributed with all the binary distributions (they are not made nor
installed by default).  Windows-specific versions are available in PDF in
rw1001d1.zip and rw1001d2.zip"


FaqManager ソフトに対しその著者の Stas Bekman に謝辞を捧げます。

#contents

* 1 Data Input/Output

**1.1 生の数値を R に取り込む
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

This is truly easy. Suppose you've got numbers in a space-separated file "myData", with column names in the first row. R's

 myDataFrame<-read.table("myData",header=TRUE)

command works great.

If you type "?read.table" it tells about importing files with other delimiters.

Suppose you have tab delimited data with blank spaces to indicate "missing" values. Do this:

 > myDataFrame<-read.table("myData",sep="\t",na.strings=" ",header=TRUE)

(new R1.3 feature) Suppose your data is in a gzipped file, myData.gz. Do this:

 myDataFrame <- read.table(gzfile("myData.gz"), header=T)

If you read columns in from separate files, combine into a data frame as:

 variable1 <- scan("file1")
 variable2 <- scan("file2")
 mydata <- cbind(variable1, variable2)

or use the equivalent:
 mydata <- data.frame(variable1, variable2)

Optionally save dataframe in R object file with:
 write.table(mydata, file="filename3")


**1.2 データアクセスに関する基本的注意   
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

To access the columns of a data frame "x" with the column number, say x[,1], to get the first column. If you know the column name, say "pjVar1", it is the same as x$pjVar1.

If you attach that data frame with attach(x) then you can just refer to the variable pjVar1 and R will find it in the data frame. I have had some trouble with this.

In a regression, include the option data=x and then you can use pjVar1 as a variable name. That makes the cleanest code.

Grab an element in a list as x[[1]]

For instance, if a data frame is "nebdata" then

   nebdata[1, 2]

or

   nebdata[nebdata$Volt=="ABal", 2]

or by attaching the data frame

   attach(nebdata)
   nebdata[Volt=="ABal", 2]
   detach(nebdata)

(from Diego Kuonen)


**1.3 新しい Data Import/Export マニュアルのチェック  
(13/08/2001 Paul Johnson <pauljohn@ukans.edu>)

With R-1.2, the R team released a manual about how to get data into R and out of R. That's the first place to look if you need help. Right now it is at http://lib.stat.cmu.edu/R/CRAN/doc/manuals/R-data.pdf


**1.4 データを R と他のプログラム (Excel, etc) 間で交換する   
(04/05/2001 Paul Johnson <pauljohn@ukans.edu>)

I made these notes before the new R Data Import/Export manual existed. You probably ought to read that. (see previous item). For now:

Excel -> R

Use 'save as space delimited text (.prn)' in Excel to create a text file name, say, 'table.txt', and 'data.matrix(read.table("table.txt")) in R to return as two dimensional array in R with row and column labels.

R -> Excel

Use sink("table"), where 'table' is the name of the R object, to print to a text file, then open the file in Excel and follow the import wizard steps. (from Griffith Feeney)

Did you try R-Excel interface ? see http://lib.stat.cmu.edu/R/CRAN/contrib/extra/index.html (from Charles Raux)

Here's what I'm planning to teach my students (Import only, to export use write.table):

Currently the easiest way is to to ... export data as a text file and use one of |read.table|, |read.csv|, and |read.csv2| to read the data as an R data frame.

The current techniques for the three most common PC programs are as follows. Slight variations of the tab-separated format seems to be the common denominator.

SPSS: Use |File, Export| menu in the data editor and export as a tab-separated file (for simplicity, say |mydata.txt| in the top directory on the |C:| drive). Make sure that it contains a header line with the variable names and use |read.csv2("C:/mydata.txt",sep="\t")| to read it in. This refers to locales where the comma is used as decimal separator, otherwise use |read.csv|.

Excel: Use |File,Save as...| and choose tab-sep (|.txt|) or use the clipboard. Subsequent data handling is similar to SPSS. Note that a header line will only be present if it was in the spreadsheet to begin with.

SAS: It is possible to export in a comma-separated or tab-separated format. Notice that SAS will use "." for decimal point irrespective of locale. The files can be read using read.csv("C:/mydata.csv")  or  read.csv("C:/mydata.csv",sep="\t").

(from Peter Dalgaard)


**1.5 データフレームのマージ   
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

 experiment<-data.frame(times=c(0,0,10,10,20,20,30,30),expval=c(1,1,2,2,3,3,4,4))
 simul<-data.frame(times=c(0,10,20,30),simul=c(3,4,5,6))

I want a merged datatset like:

 >   times expval simul
 > 1     0      1     3
 > 2     0      1     3
 > 3    10      2     4
 > 4    10      2     4
 > 5    20      3     5
 > 6    20      3     5
 > 7    30      4     6
 > 8    30      4     6

Answers

 merge(experiment, simul)

does all the work for you. (from Brian D. Ripley)

Or consider:

  exp.sim<-data.frame(experiment,simul=
                simul$simul[match(experiment$times,simul$times)]) 
(from Jim Lemon)

I have dataframes like this:

 state count1 percent1
 CA    19    0.34
 TX    22    0.35
 FL    11    0.24
 OR    34    0.42
 GA    52    0.62
 MN    12    0.17
 NC    19    0.34
 
 state count2 percent2
 FL    22     0.35
 MN    22     0.35
 CA    11     0.24 
 TX    52     0.62

And I want

 state count1 percent1 count2 percent2
 CA    19    0.34      11     0.24
 TX    22    0.35      52     0.62
 FL    11    0.24      22     0.35
 OR    34    0.42       0     0
 GA    52    0.62       0     0
 MN    12    0.17      22     0.35
 NC    19    0.34       0     0

( from Yu-Ling Wu )

In response, Ben Bolker said

 state1 <- c("CA","TX","FL","OR","GA","MN","NC")
 count1 <- c(19,22,11,34,52,12,19)
 percent1 <- c(0.34,0.35,0.24,0.42,0.62,0.17,0.34)

 state2 <- c("FL","MN","CA","TX") 
 count2 <- c(22,22,11,52) 
 percent2 <-   c(0.35,0.35,0.24,0.62)
 data1 <- data.frame(state1,count1,percent1) 
 data2 <-  data.frame(state2,count2,percent2)
 datac <- data1 m <- match(data1$state1,data2$state2,0) 
 datac$count2 <- ifelse(m==0,0,data2$count2[m]) 
 datac$percent2 <- ifelse(m==0,0,data2$percent2[m])

 (日本語訳注!)上記のmatchを利用した結果は，以下のようになるので誤りです。(2004/1/30)

 state count1 percent1 count2 percent2
 CA    19    0.34      11     0.24
 TX    22    0.35      52     0.62
 FL    11    0.24      22     0.35
 OR    34    0.42       0     0
 GA    52    0.62       0     0
 MN    12    0.17      52 <-- 0.62 <-- ここ二箇所が変です。 
 NC    19    0.34       0     0


If you didn't want to keep all the rows in both data sets (but just the shared rows) you could use

 merge(data1,data2,by=1)

**1.6 一時に一つの行を加える
(14/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Question: I would like to create an (empty) data frame with "headings" for every column (column titles) and then put data row-by-row into this data frame (one row for every computation I will be doing), i.e.

 no. time temp pressure <---the headings
 1   0     100   80     <---first result
 2   10    110   87     <---2nd result .....
 

Answer: Depends if the cols are all numeric: if they are a matrix would be better.

If you know the number of results in advance, say, N, do this

 df <- data.frame(time=numeric(N), temp=numeric(N), pressure=numeric(N))
 df[1, ] <- c(0, 100, 80)
 df[2, ] <- c(10, 110, 87)
 ...

or

 m <- matrix(nrow=N, ncol=3)
 colnames(m) <- c("time", "temp", "pressure")
 m[1, ]  <- c(0, 100, 80)
 m[2, ] <- c(10, 110, 87)

The matrix form is better size it only needs to access one vector (a matrix is a vector with attributes), not three.

If you don't know the final size you can use rbind to add a row at a time, but that is substantially less efficient as lots of re-allocation is needed. It's better to guess the size, fill in and then rbind on a lot more rows if the guess was too small.(from Brian Ripley)
[top]

**1.7 データフレームに対するもう一つのマージ法
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

(from Stephen Arthur)

Example Data:
File1 C A T File2 1 2 34 56 2 3 45 67 3 4 56 78

To Yield:

 
 > C A T 1 2 34 56
 > C A T 2 3 45 67
 > C A T 3 4 56 78

This works:

 
 repcbind <- function(x,y) {
  nx <- nrow(x)
  ny <- nrow(y)
  if (nx<ny)
    x <- apply(x,2,rep,length=ny)
  else if (ny<nx)
    y <- apply(y,2,rep,length=nx)
  cbind(x,y)
 }

(from Ben Bolker)


**1.8 NULL オブジェクトがあるかどうか検査する
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

If you load things, R does not warn you when they are not found, it records them as NULL. You have the responsibility of checking them. Use

 is.null(list$component)

to check a thing named component in a thing named list.

Accessing non-existent dataframe columns with "[" does give an error, so you could do that instead.

 >data(trees)
 >trees$aardvark
 NULL
 >trees[,"aardvark"]
 Error in [.data.frame(trees, , "aardvark") : subscript out of bounds 

(from Thomas Lumley)

**1.9 乱数を発生する
(22/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 runif() --uniform random on [0,1) 
 rnorm() --normal random

multivariate normal: Given covar matrix:

 0.25,    0.20
 0.20,    0.25

Brian Ripley said: "mvrnorm in package MASS (part of the VR bundle).

 mvrnorm(2, c(0,0), matrix(c(0.25, 0.20, 0.20, 0.25), 2,2))

and Peter Dalgaard observed "a less general solution for this particular case would be

 rnorm(1,sd=sqrt(0.20)) + rnorm(2,sd=sqrt(0.05))

Wait. You want randomly drawn integers? Here:

If you mean sampling without replacement:

 >sample(1:10,3,replace=FALSE)

If you mean with replacement:

 >sample(1:10,3,replace=TRUE)

(from Bill Simpson)

**1.10 平均/分散を固定して乱数を発生
(06/09/2000 Paul Johnson <pauljohn@ukans.edu>)

If you generate random numbers with a given tool, you don't get a sample with the exact mean you specify. A generator with a mean of 0 will create samples with varying means, right?

So, how to force the sample to have a mean of 0? Take a 2 step approach:

 > x <- rnorm(100, mean = 5, sd = 2)
 > x <- (x - mean(x))/sqrt(var(x))
 > mean(x)
 [1] 1.385177e-16
 > var(x)
 [1] 1

and now create your sample with mean 5 and sd 2:

 > x <- x*2 + 5
 > mean(x)
 [1] 5
 > var(x)
 [1] 4

(from Torsten.Hothorn)

**1.11 重みつきの複製
(30/01/2001 Paul Johnson <pauljohn@ukans.edu>)

 x<-c(10,40,50,100)  # income vector for instance
 w<-c(2,1,3,2)       # the weight for each observation in x with the same
 rep(x,w)
 [1]  10  10  40  50  50  50 100 100

(from P. Malewski)

**1.12 重みつきのデータを表すようにデータセットを拡大する
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

Also, most of the functions that have weights have frequency weights rather than probability weights: that is, setting a weight equal to 2 has exactly the same effect as replicating the observation.

If you have frequency weights you may have to expand your dataset.

  expanded.data<-as.data.frame(lapply(compressed.data,
                   function(x) rep(x,compressed.data$weights)))

should do it.

(from Thomas Lumley)

**1.13 分割表をデータフレームに変換する
(06/09/2000 Paul Johnson <pauljohn@ukans.edu>)

Given a 8 dimensional crosstab, you want a data frame with 8 factors and 1 column for frequencies of the cells in the table.

R1.2 introduces a function as.data.frame.table() to handle this.

This can also be done manually. Here's a function(it's a simple wrapper around expand.grid):

 dfify <- function(arr, value.name = "value", dn.names =
                            names(dimnames(arr))) {
 Version <- "$Id: dfify.sfun,v 1.1 1995/10/09 16:06:12 d3a061 Exp $" 
 dn <- dimnames(arr <- as.array(arr)) 
 if(is.null(dn)) stop("Can't data-frame-ify an array without dimnames") 
 names(dn) <- dn.names 
 ans <- cbind(expand.grid(dn), as.vector(arr)) 
 names(ans)[ncol(ans)] <- value.name ans }

The name is short for "data-frame-i-fy".

For your example, assuming your multi-way array has proper dimnames, you'd just do:

 my.data.frame <- dfify(my.array, value.name="frequency")

(from Todd Taylor)

**1.14 書き込み: アスキーファイルのデータ
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

Say I have a command that produced a 28 x 28 data matrix. How can I output the matrix into a txt file (rather than copy/paste the matrix)?

 write.table(mat,file="filename")

Pesl Thomas said, "I want to output a file to my disc with the contents of a data frame I called "tab"

 > tab
 or 1 2 3 4 5 6 1 57 80 25 23 46 23 2 106 35 59 8 51 4

You can use

 write.table(unclass(tab))

You could also do

  write(t(tab),ncol=NCOL(tab))

since write() transposes rows and columns. (from Thomas Lumley)

Note MASS library has a function write.matrix which is faster if you need to write a numerical matix, not a data frame. Good for big jobs.

*2 データフレームを使った作業: 記録、合体

**2.1 変数をデータフレーム(又はリスト)に加える
 (02/06/2003 Paul Johnson <pauljohn@ku.edu>)

In r-help, I asked " I keep finding myself in a situation where I want to calculate a variable name and then use it on the left hand side of an assignment. For example

 > iteration <- 1
 > varName <- paste("run",iteration,sep="")
 > myList$parse(text=varName) <- aColumn

I got many useful answers that opened my eyes!

Brian Ripley said: For a data frame you could use

 mydf[paste("run",iteration,sep="")] <- aColumn

and for a list or a data frame

 Robject[[paste("run",iteration,sep="")]] <- aColumn

And Thomas Lumley added: " If you wanted to do something of this sort for which the above didn't workyou could also learn about substitute():

  eval(substitute(myList$newColumn<-aColumn),
         list(newColumn=as.name(varName)))

as an alternative to parse().

-(Thomas Lumley)

**2.2 変数名を「その場で」作る
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

Paste together a variable name, set it to a value. Use assign. As in

 > assign(paste("file", 1, "max", sep=""), 1)
 > ls()
 [1] "file1max"

(Brian Ripley, June 18, 2001)

**2.3 一つの列を記録する、値を別の列に出力する
\\Recode one column, output values into another column   
(20/06/2001 Paul Johnson <pauljohn@ukans.edu>)

Aleksey Naumov asked this question. For example, here is a simple data frame:

  var1
 1    1
 2    2
 3    3

How do I add a new column to data - 'var2' in which I group values in 'var1', for example:

   var1 var2
 1    1    4
 2    2    4
 3    3    5
 
 data$var2 <- c(4,4,5)[data$var1]

or

 data <- transform(data, var2=c(4,4,5)[var1])

(from Peter Dalgaard)

I admit the previous seems complicated, and I'm inclined to take the easy road. Some cases are particularly simple because of the way arrays are processed. Suppose you create a variable, and then want to reset some values to missing. Go like this:

  x <- rnorm(10000) 
  x[x > 1.5] <- NA

And if you don't want to replace the original variable, create a new one first ( xnew <- x ) and then do that same thing to xnew.

You can put other variables inside the brackets, so if you want x to equal 234 if y equals 1, then

 x[y == 1] <- 234

So, to do the v1 v2 example above, I think it is clearest to do:

 > v1 <- c(1,2,3)
 
 #now initialize v2
 > v2 <- rep( -9, length(v1))  
 #now recode v2
 > v2[v1==1] <- 4
 > v2[v1==2]<-4
 > v2[v1==3]<-5
 > v2
 [1] 4 4 5

Note that R's "ifelse" command can work too:

 x<-ifelse(x>1.5,NA,x)

I asked Mark Myatt (author of the ReX guide mentioned below) for more examples:

For example, suppose I get a bunch of variables coded on a scale

   1= no
   6  = yes
   8 = tied
   9 = missing
  10 = not applicable.

Recode that into a new variable name with 0=no, 1=yes, and all else NA, for example.

It seems like the replace() function would do it for single values but you end up with empty levels in factors but that can be fixed by re- factoring the variable. Here is a basic recode() function:

 recode <- function(var, old, new)
 {
 x <- replace(var, var == old, new)
 if(is.factor(x)) factor(x) else x
 }

For the above example:

 test <- c(1,1,2,1,1,8,1,2,1,10,1,8,2,1,9,1,2,9,10,1)
 test
 test <- recode(test, 1, 0)
 test <- recode(test, 2, 1)
 test <- recode(test, 8, NA)
 test <- recode(test, 9, NA)
 test <- recode(test, 10, NA)
 test

Although it is probably easier to use replace():

 test <- c(1,1,2,1,1,8,1,2,1,10,1,8,2,1,9,1,2,9,10,1)
 test
 test <- replace(test, test == 8 | test == 9 | test == 10, NA)
 test <- replace(test, test == 1, 0)
 test <- replace(test, test == 2, 1)
 test

I suppose a better function would take from and to lists as arguments:

 recode <- function(var, from, to)
 {
 x <- as.vector(var)
 for (i in 1:length(from))
 {
   x <- replace(x, x == from[i], to[i])
 }
 if(is.factor(var)) factor(x) else x
 } 

For the example:

 test <- c(1,1,2,1,1,8,1,2,1,10,1,8,2,1,9,1,2,9,10,1)
 test
 test <- recode(test, c(1,2,8:10), c(0,1))
 test

and it still works with single values.

Suppose somebody gives me a scale from 1 to 100, and I want to collapse it into 10 groups, how do I go about it?

Mark says: Use cut() for this. This cuts into 10 groups:

 test <- trunc(runif(1000,1,100))
 groups <- cut(test,seq(0,100,10))
 table(test, groups)

To get ten groups without knowing the minimum and maximum value you can use pretty():

 groups <- cut(test,pretty(test,10))
 table(test, groups)

You can specify the cut-points:

 groups <- cut(test,c(0,20,40,60,80,100))
 table(test, groups)

And they don't need to be even groups:

 groups <- cut(test,c(0,30,50,75,100))
 table(test, groups)

Mark added, "I think I will add this sort of thing to the REX pack."


**2.4 指示(ダミー)変数をつくり出す
//Create indicator (dummy) variables  
(20/06/2001 Paul Johnson <pauljohn@ukans.edu>)

2 examples:

c is a column, you want dummy variable, one for each valid value. First, make it a factor, then use model.matrix():

 > x <- c(2,2,5,3,6,5,NA)
 > xf <- factor(x,levels=2:6)
 > model.matrix(~xf-1)
   xf2 xf3 xf4 xf5 xf6
 1   1   0   0   0   0
 2   1   0   0   0   0
 3   0   0   0   1   0
 4   0   1   0   0   0
 5   0   0   0   0   1
 6   0   0   0   1   0
 attr(,"assign")
 [1] 1 1 1 1 1

(from Peter Dalgaard )

Question: I have a variable with 5 categories and I want to create dummy variables for each category.

Answer: Use row indexing or model.matrix.

 ff <- factor(sample(letters[1:5], 25, replace=TRUE))
 diag(nlevels(ff))[ff,]

or

 model.matrix(~ ff - 1)

(from Brian D. Ripley)

**2.5 時系列回帰のための変数の遅延値をつくり出す
//Create lagged values of variables for time series regression   
(02/06/2003 Paul Johnson <pauljohn@ku.edu>)

Peter Dalgaard explained, "the simple way is to create a new variable which shifts the response, i.e.

 yshft <- c(y[-1], NA) # pad with missing
 summary(lm(yshft ~ x + y))

Alternatively, lag the regressors:

 N <- length(x)
 xlag <- c(NA, x[1:(N-1)])
 ylag <- c(NA, y[1:(N-1)])
 summary(lm(y ~ xlag + ylag))


**2.6 関連する観測値を持たない因子水準をデータセットから取り除くには？
//How to drop factor levels for datasets that don't have observations with those values?   
(08/01/2002 Paul Johnson <pauljohn@ukans.edu>)

The best way to drop levels, BTW, is

 problem.factor <- problem.factor[,drop=TRUE] (from Brian D. Ripley)

**2.7 データフレームから観測値を選択する、部分集合を取り出す
\\Select/subset observations out of a dataframe  
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Want to drop out observations that are mising on a variable? John Fox says:

 x2 <- na.omit(x)

produces a copy of the data frame x with all rows that contain missing data removed. The function na.exclude could be used also. For more information on missings, check help : ?na.exclude.

For exclusion of missing, Peter Dalgaard likes

 subset(x,complete.cases(x)) or x[complete.cases(x),]

adding "is.na(x) is preferable to x != "NA"

Suppose you want observations with c=1 in df1. This makes a new data frame you want.

 df2 <- df1[df1$c == 1,]

and note that indexing is pretty central to using S (the language), so it is worth learning all the ways to use it. (from Brian Ripley)

Or use "match" select values from the column "d" by taking the ones that match the values of another column, as in

 > d <- t(array(1:20,dim=c(2,10)))
 > i <- c(13,5,19)
 > d[match(i,d[,1]), 2]
 [1] 14  6 20

(from Peter Wolf)

This gets lines from the dataframe that meet the test like so:

 > d[d[,1] %in% i,]
      [,1] [,2]
 [1,]    5    6
 [2,]   13   14
 [3,]   19   20

(from R. Woodrow Setzer, Jr.)

Till Baumgaertel wanted to select observations for men over age 40, and sex was coded either m or M. Here are two working commands:

 maleOver40 <- subset(d, sex %in% c("m","M") & age > 40)
 maleOver40 <- d[(d$sex == "m" | d$sex == "M") & d$age > 40,]

**2.8 最初の観測値をそれぞれから抹消する
//Delete first observation for each   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Given data like:

 1  ABK     19910711     11.1867461      0.0000000       108
 2  ABK     19910712     11.5298979     11.1867461      111
 6 CSCO   19910102      0.1553819      0.0000000         106
 7 CSCO   19910103      0.1527778      0.1458333         166

remove the first observation for each value of the "sym" variable (the one coded ABK,CSCO, etc). . If you just need to remove rows 1,5, and 13, do:

 newhilodata <- hilodata[-c(1,6,13),]

To solve the more general problem of omitting the first in each group, assuming "sym" is a factor, try something like

 newhilodata <- subset(hilodata, diff(c(0,as.integer(sym))) != 0)

(actually, the as.integer is unnecessary because the c() will unclass the factor automagically)

(from Peter Dalgaard)

Alternatively, you could use the match function because it returns the first match. Suppose jm is the data set. Then:

 > match(unique(jm$sym), jm$sym)
  [1]  1  6 13
 > jm <- jm[ -match(unique(jm$sym), jm$sym), ]

(from Douglas Bates)

As Robert pointed out to me privately: duplicated() does the trick

  subset(hilodata, duplicated(sym))

has got to be the simplest variant.

(from Peter Dalgaard)


**2.9 データのランダム標本を選ぶ
//Select a random sample of data   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 sample(N, n, replace=F)

and

 seq(N)[rank(runif(N)) <= n]

is another general solution.

(from Brian D. Ripley)

**2.10 モデルの変数選択：subset 関数を忘れるな
//Selecting Variables for Models: Don't forget the subset function   
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

You can manage data directly by deleting lines or so forth, but subset() can be used to achieve the same effect without editing the data at all. Do ?select to find out more. Subset is also an option in many statistical functions like lm.

Peter Dalgaard gave this example, using the "builtin" dataset airquality.

 data(airquality)
 names(airquality)
 lm(Ozone~.,data=subset(airquality,select=Ozone:Month))
 lm(Ozone~.,data=subset(airquality,select=c(Ozone:Wind,Month)))
 lm(Ozone~.-Temp,data=subset(airquality,select=Ozone:Month))

The "." on the RHS of lm means "all variables" and the subset command on the rhs picks out different variables from the dataset. "x1:x2" means variables between x1 and x2, inclusive.

**2.11 すべての数値変数を処理し、文字変数を無視するには？
\\Process all numeric variables, ignore character variables?   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 data<-read.table("lastline.txt",header=T,as.is = TRUE) 
 indices<-1:dim(data)[2] 
 indices<-na.omit(ifelse(indices*sapply(data,is.numeric),indices,NA)) 
 mean<-sapply(data[,indices],mean) sd<-sapply(data[,indices],sd)

**2.12 二つ以上の変数に関しソート
\\Sorting by more than one variable   
(06/09/2000 Paul Johnson <pauljohn@ukans.edu>)

Can someone tell me how can I sort a list that contains duplicates (name) but keeping the duplicates together when sorting the values.

 name  M
 1234  8
 1234  8.3
 4321  9
 4321  8.1

I also would like to set a cut-off, so that anything below a certain values will not be sorted.(from Kong, Chuang Fong)

I take it that the cutoff is on the value of M. OK, suppose it is the value of `coff'.

 sort.ind <- order(name, pmax(coff, M))  #  sorting index
 name <- name[sort.ind]
 M <- M[sort.ind]

Notice how using pmax() for "parallel maximum" you can implement the cutoff by raising all values below the mark up to the mark thus putting them all into the same bin as far as sorting is concerned.

If your two variables are in a data frame you can combine the last two steps into one, of course.

 sort.ind <- order(dat$name, pmax(coff, dat$M))
 dat <- dat[sort.ind, ]

In fact it's not long before you are doing it all in one step:

 dat <- dat[order(dat$name, pmax(coff, dat$M)), ]

(from Bill Venables)

I want the ability to sort a data frame lexicographically according to several variables

Here's how:

 spsheet[order(name,age,zip),] 

(from Peter Dalgaard)

**2.13 ある因子により定義されるサブグループ内で順位を取る
//Rank within subgroups defined by a factor   
(06/09/2000 Paul Johnson <pauljohn@ukans.edu>)

Read the help for by()

 > by(x[2], x$group, rank)
 x$group: A
 [1] 4.0 1.5 1.5 3.0
 ------------------------------------------------------------ 
 x$group: B
 [1] 3 2 1
 
 > c(by(x[2], x$group, rank), recursive=T) 
 A1 A2 A3 A4 B1 B2 B3 4.0 1.5 1.5 3.0 3.0 2.0 1.0

(from Brian D. Ripley)


**2.14 データフレームから欠損値を除外するために na.omit を使う。
//Use na.omit to screen missings from a data frame:   
(22/08/2000 Paul Johnson <pauljohn@ukans.edu>)

To make sure missings are omitted from a dataset called "insure":

 thisdf <- na.omit(insure[,c(1, 19:39)])
 body.m <- lm(BI.PPrem ~ ., data=thisdf, na.action=na.fail)


**2.15 各行から一つの集約値を計算
//Aggregate values, one for each line   
(16/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Question: I want to read values from a text file - 200 lines, 32 floats per line - and calculate a mean for each of the 32 values, so I would end up with an "x" vector of 1-200 and a "y" vector of the 200 means.

Peter Dalgaard says do thi:

 y <- apply(as.matrix(read.table("myfile")), 1, mean) x <- seq(along=y)

(possibly adding a couple of options to read.table, depending on the file format)

**2.16 各因子に対する集約値を保持する新しいデータフレームをつくり出す。
//Create new data frame to hold aggregate values for each factor   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

How about this: 
 Z<-aggregate(X, f, sum)

(assumes all X variables can be summed)

Or: [If] X contains also factors. I have to select variables for which summary statistics have to be computed. So I used:

 Z <- data.frame(f=levels(f),x1=as.vector(tapply(x1,f,sum)))

(from Wolfgang Koller)

**2.17 データフレームを選択的に集約
//Selectively aggregate a data frame   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Given

 10 20  23 44 33
 10 20  33 23 67

and you want

 10 20   56 67 100

try this:

 dat<-read.table("data.dat",header=TRUE)
 aggregate(dat[,-(1:2)], by=list(std=dat$std, cf=dat$cf), sum)

note the first two columns are excluded by [,(1:2)] and the by optoin preserves those values in the output.

**2.18 実数から各桁数値を一時に一つずつ取り出す
//Rip digits out of real numbers one at a time   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

I want to "take out" the first decimal place of each output, plot them based on their appearence frequencies. Then take the second decimal place, do the same thing.

 a<- log(1:1000)
 d1<-floor(10*(a-floor(a)))  # first decimal
 par(mfrow=c(2,2))
 hist(d1,breaks=c(-1:9))
 table(d1)
 d2<-floor(10*(10*a-floor(10*a))) # second decimal
 hist(d2,breaks=c(-1:9))
 table(d2)

(from Yudi Pawitan)

 x <- 1:1000
 ndig <- 6
 (ii <- as.integer(10^(ndig-1) * log(x)))[1:7] 
 (ci <- formatC(ii, flag="0", wid= ndig))[1:7] 
 cm <- t(sapply(ci, function(cc) strsplit(cc,NULL)[[1]])) cm [1:7,]
 apply(cm, 2, table) #--> Nice tables

# The plots : 
 par(mfrow= c(3,2), lab = c(10,10,7)) 
 for(i in 1:ndig) 
  hist(as.integer(cm[,i]), breaks = -.5 + 0:10, 
          main = paste("Distribution of ", i,"-th digit"))

(from Martin Maechler)

**2.19 リスト中の各副行列から一つの項目を取り出す
//Grab an item from each of several matrices in a List   
(14/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Let Z denote the list of matrices. All matrices have the same order. Suppose you need to take element [1,2] from each.

 lapply(Z, function(x) x[1,2])

should do this, giving a list. Use sapply if you want a vector. (Brian Ripley)

**2.20 データセット中の値を示すベクトルを得る
//Get vector showing values in a dataset   
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

 xlevels<-sort(unique(x))

Which you can use in a contour plot, as in

 contour(xlevels,ylevels,zvals)

**2.21 Rの命令を表す文字列の値を計算する
\\Calculate the value of a string representing an R command   
(13/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Use the eval(parse()) pairing:

 String2Eval <- "A valid R statement"
        eval(parse(text = String2Eval)) 

(from Mark Myatt)

Also check out substitute(), as.name() et al. for other methods of manipulating expressions and function calls (from Peter Dalgaard)

Or

 eval(parse(text="ls()"))

Or

 eval(parse(text = "x[3] <- 5"))

**2.22 "which" はあるテストをパスする観測値の添字値を抜き出すことができる
//"Which" can grab the index values of cases satisfying a test   
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

To analyse large vectors of data using boxplot to find outliers, try:

 which(x==boxplot(x,range=1)$out)

**2.23 行列/データフレーム中の一意的な行を見出す
//Find unique lines in a matrix/data frame   
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

Jason Liao writes:

"I have 10000 integer triplets stored in A[1:10000, 1:3]. I would like to find the unique triplets among the 10000 ones with possible duplications."

Peter Dalgaard answers, "As of 1.4.0 (!):

 unique(as.data.frame(A))"

**2.24 数値変数だけを選ぶ
//Select only numerical variables   
(12/11/2002 Paul Johnson <pauljohn@ku.edu>)

 sapply(dataframe, is.numeric)

or
 which(sapply(data.frame, is.numeric))

Thomas Lumley 11/11/2002


* 3 行列とベクトルの演算

**3.1 単位行列をつくり出すには？
\\How to create an identity matrix?   
(16/08/2000 Paul Johnson <pauljohn@ukans.edu>)

diag(n)

Or, go the long way:

n<-c(5)
 I <- matrix(0,nrow=n,ncol=n)
I[row(I)==col(I)] <- 1

E.D.Isaia


**3.2 行列 "m" を一つの長いベクトルに変換
//Convert matrix "m" to one long vector   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 dim(m)<-NULL, or c(m) (from Peter Dalgaard )

**3.3 変わった数列 (1 2 3 4 1 2 3 1 2 1) をつくり出す  
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

I don't know why this is useful, but it shows some row and matrix things:

I ended up using Brian Ripley's method, as I got it first and it worked, ie.

 A <- matrix(1, n-1, n-1) 
 rA <- row(A) 
 rA[rA + col(A) <= n] 

However, thanks to Andy Royle I have since discovered that there is a much more simple and sublime solution:

 sequence(n-1:1) 
 [1] 1 2 3 4 1 2 3 1 2 1 

Karen Kotschy

**3.4 すべての n 個めの項目を選ぶ
//Select every n'th item   
(14/08/2000 Paul E Johnson <pauljohn@ukans.edu>)

extract every nth element from a very long vector, vec?

 vec[seq(n, length(vec), n)]  #(from Brian Ripley)

seq(1,11628,length=1000) will give 1000 evenly spaced numbers from 1:11628 that you can then index with. (from Thomas Lumley)

My example:

 vec <- rnorm(1999)
 newvec <- vec[1, length(vec), 200]
 > newvec
  [1]  0.2685562  1.8336569  0.1371113  0.2204333 -1.2798172  0.3337282
  [7] -0.2366385  0.5060078  0.9680530  1.2725744

This shows the items in vec at indexes (1, 201, 401, ..., 1801)


**3.5 ベクトル中の 1.5 に最も近い値の添字を見出す
\\Find index of a value nearest to 1.5 in a vector   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 n <- 1000
 x <- sort(rnorm(n))
 x0 <- 1.5
 dx <- abs(x-x0)
 which(dx==min(dx))

(from Jan Schelling)

 which(abs(x - 1.5) == min(abs(x - 1.5)))

(from Uwe Ligges)

**3.6 ベクトル中の非負値項目の添字を見つける
\\Find index of nonnegative items in vector   
(18/06/2001 Paul Johnson <pauljohn@ukans.edu>)

 which(x!=0) (from Uwe Ligges)

 rfind <- function(x) seq(along=x)[x != 0]  #(from Brian D. Ripley)

Concerning speed and memory efficiency I find

 as.logical(x)

is better than

 x!=0

and

 seq(along=x)[as.logical(x)]

is better than

 which(as.logical(x))

thus

 which(x!=0)

is shortest and

 rfind <- function(x)seq(along=x)[as.logical(x)]

seems to be computationally most efficient

(from Jens Oehlschl?gel-Akiyoshi)

**3.7 欠損値の添字を見出す
\\Find index of missing values   
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Suppose the vector "Pes" has 600 observations. Don't do this:

 (1:600)[is.na(Pes)]

The `approved' method is

 seq(along=Pes)[is.na(Pes)]

In this case it does not matter as the subscript is of length 0, but it has floored enough library/package writers to be worth thinking about.

(from Brian Ripley)

However, the solution I gave

 which(is.na(Pes))

is the one I still really recommend; it does deal with 0-length objects, and it keeps names when there are some, and it has an `arr.ind = FALSE' argument to return array indices instead of vector indices when so desired. (from Martin Maechler)

**3.8 ベクトル中の最大値を持つ項の添字を見出す
\\Find index of largest item in vector   
(16/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 T[which(A==max(A,na.rm=TRUE))]

(pj: note na.rm deprecated? I think it is now na.omit)

**3.9 行列中の値を変更する
\\Replace values in a matrix   
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

 > tmat <- matrix(rep(0,3*3),ncol=3) > tmat [,1] [,2] [,3] [1,] 0 0 0 [2,] 0 0 0 [3,] 0 0 0
 >  tmat[tmat==0]<-1 
 > tmat [,1] [,2] [,3] [1,] 1 1 1 [2,] 1 1 1 [3,] 1 1 1 #(from Jan Goebel)

If la is a data frame, you have to coerce the data into matrix form, with:

 la <- as.matrix(la)
 la[la==0] <- 1

Try this:

 la <- ifelse(la == 0, 1, la)

(from Martyn Plummer)


**3.10 行列から特定の行を取り除く
\\Delete particular rows from matrix   
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

 > x <- matrix(1:10,,2)
 > x[x[,1]%in%c(2,3),]
 > x[!x[,1]%in%c(2,3),]

(from Peter Malewski)

 mat[!(mat$first %in% 713:715),]

(from Peter Dalgaard )

**3.11 ある基準を満たす項目の総数を数える
\\Count number of items meeting a criterion   
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

Apply "length()" to results of which() described in previous question, as in

 length(which(v<7))
or
 sum( v<7,na.rm=TRUE)

**3.12 相関行列から偏相関係数を計算する
\\Compute partial correlation coefficients from correlation matrix   
(08/12/2000 Paul Johnson <pauljohn@ukans.edu>)

I need to compute partial correlation coefficients between multiple variables (correlation between two paired samples with the "effects of all other variables partialled out")? (from Kaspar Pflugshaupt)

Actually, this is quite straightforward. Suppose that R is the correlation matrix among the variables. Then,

        Rinv<-solve(R)
        D<-diag(1/sqrt(diag(Rinv)))
        P<- -D %*% Rinv %*% D

The off-diagonal elements of P are then the partial correlations between each pair of variables "partialed" for the others. (Why one would want to do this is another question.)

(from John Fox )

In general you invert the variance-covariance matrix and then rescale it so the diagonal is one. The off-diagonal elements are the negative partial correlation coefficients given all other variables.

 pcor2 <- function(x){
   conc <- solve(var(x))
   resid.sd <- 1/sqrt(diag(conc))
   pcc <- - sweep(sweep(conc, 1, resid.sd, "*"), 2, resid.sd, "*")
   return(pcc)
 }
 pcor2(cbind(x1,x2,x3))

see J. Whittaker's book "Graphical models in applied multivariate statistics" from Martyn Plummer)

This is the version I'm using now, together with a test for significance of each coefficient (H0: coeff=0):

 f.parcor <-  function (x, test = F, p = 0.05)
 {
    nvar <- ncol(x)
    ndata <- nrow(x)
    conc <- solve(cor(x))
    resid.sd <- 1/sqrt(diag(conc))
    pcc <- -sweep(sweep(conc, 1, resid.sd, "*"), 2, resid.sd,
        "*")
    colnames(pcc) <- rownames(pcc) <- colnames(x)
    if (test) {
        t.df <- ndata - nvar
        t <- pcc/sqrt((1 - pcc^2)/t.df)
        pcc <- list(coefs = pcc, significant = t > qt(1 - (p/2),
            df = t.df))
    }
    return(pcc)
 }

(from Kaspar Pflugshaupt)

**3.13 正準変数を計算
\\順Compute Canonical variates   
(30/01/2001 Paul Johnson <pauljohn@ukans.edu>)

Suppose you have data matrices A and B on the same observations. Then

 cancor <- function(A, B) 
 {
   Ap <- prcomp(scale(A, T, F), retx=T)
   Apc <- Ap$x %*% diag(1/Ap$sdev)
   Bp <- prcomp(scale(B, T, F), retx=T)
   Bpc <- Bp$x %*% diag(1/Bp$sdev)
   Sigma <- crossprod(Apc, Bpc)/(nrow(A) - 1)
   s <- svd(Sigma, ncol(A), ncol(B))
   return(list(cor=s$d, canvar.x=Apc %*% s$u, canvar.y=Bpc %*% s$v))
 }

should do the trick. The canonical variates are zero-mean, unit-variance (unlike S-PLUS). (from Brian D. Ripley)

**3.14 多次元行列(R 配列)をつくり出す
\\Create a multidimensional matrix (R array)   
(20/06/2001 ? <?>)

Brian Ripley said:

 my.array<-array(0,dim=c(10,5,6,8))

will give you a 4-dimensional 10 x 5 x 6 x 8 array.

Or

 array.test <- array(1:64,c(4,4,4))
 array.test[1,1,1]
 1
 array.test[4,4,4]
 64

**3.15 多数の行列を結合する
\\Combine a lot of matrices   
(20/06/2001 ? <?>)

Given a list with 100s of (nx2) matrices, how do you combine them into a giant (Nx2) matrix?

list is the list of matrices.

 nr <- sapply(list, nrow)
 cs <- cumsum(nr)
 st <- c(0, cs[-length(cs)]) + 1
 res <- matrix(, sum(nr), 2)
 for(i in seq(along=nr)) res[st[i]:cs[i],] <- list[[i]]

(from Brian Ripley)

**3.16 特定の論理に基づき "近傍" 行列をつくり出す
\\Create "neighbor" matrices according to specific logics   
(20/06/2001 ? <?>)

Want a matrix of 0s and 1s indicating whether a cell has a neighbor at a location:

 N <- 3
 x <- matrix(1:(N^2),nrow=N,ncol=N)
 rowdiff <- function(y,z,mat)abs(row(mat)[y]-row(mat)[z])
 coldiff <- function(y,z,mat)abs(col(mat)[y]-col(mat)[z])
 rook.case <- function(y,z,mat){coldiff(y,z,mat)+rowdiff(y,z,mat)==1}
 bishop.case <- function(y,z,mat){coldiff(y,z,mat)==1 & rowdiff(y,z,mat)==1}
 queen.case <- function(y,z,mat){rook.case(y,z,mat) | bishop.case(y,z,mat)}
 matrix(as.numeric(
             sapply(x,function(y)sapply(x,rook.case,y,mat=x))),ncol=N^2,nrow=N^2)
 matrix(as.numeric(
             sapply(x,function(y)sapply(x,bishop.case,y,mat=x))),ncol=N^2,nrow=N^2)
matrix(as.numeric(
             sapply(x,function(y)sapply(x,queen.case,y,mat=x))),ncol=N^2,nrow=N^2)

(from Ben Bolker)


**3.17 ある "key" により二つの数値列をマッチング
\\"Matching" two columns of numbers by a "key"   
(20/06/2001 ? <?>)

The question was:

I have a matrix of predictions from an proportional odds model (using the polr function in MASS), so the columns are the probabilities of the responses, and the rows are the data points. I have another column with the observed responses, and I want to extract the probabilities for the observed responses.

As a toy example, if I have

 > x <- matrix(c(1,2,3,4,5,6),2,3)
 > y <- c(1,3)

and I want to extract the numbers in x[1,1] and x[2,3] (the columns being indexed from y), what do I do?

Is

 x[cbind(seq(along=y), y)]

what you had in mind? The key is definitely matrix indexing. (from Brian Ripley)


**3.18 上・下三角行列をつくり出す
\\Create Upper or Lower Triangular matrix   
(20/06/2001 ? <?>)

There are many ways to do this. Whenever it comes up in r-help, there is a different answer.

Suppose you want a matrix like

 a^0  0    0
 a^1  a^0  0
 a^2  a^1  a^0

I don't know why you'd want that, but look at the differences among answers this elicited.

  ma <- matrix(rep(c(a^(0:n), 0), n+1), nrow=n+1, ncol=n+1)
  ma[upper.tri(ma)] <- 0

(from Uwe Ligges)

 > n <- 3
 > a <- 2
 > out <- diag(n+1)
 > out[lower.tri(out)] <-
        a^apply(matrix(1:n,ncol=1),1,function(x)c(rep(0,x),1:(n-x+1)))[lower.tri(out)]
 > out

(from Woodrow Setzer)

 tmpmat <- function(a,n) {
   x <- matrix(a,nrow=n,ncol=n)
   x <- x^pmax(0,row(x)-col(x))
   x[row(x)<- 0
   x
 }

(from Ben Bolker)

If you want an upper triangular matrix, this use of "ifelse" looks great to me:

 fun <- function(x,y) { ifelse(y>x, x+y, 0) }

(then use this with outer()) or

 m <- outer(x,y,FUN="+")
 m[lower.tri(m, diag=T)] <- 0

(from Kaspar Pflugshaupt)

**3.19 X の逆行列を計算
\\Calculate inverse of X   
(20/06/2001 ? <?>)

solve(A) yields the inverse of A.

**3.20 行列添字の面白い使い方
\\Interesting use of Matrix Indices   
(20/06/2001 ? <?>)

Here's a similar problem.

Mehdi Ghafariyan said "I have two vectors A=c(5,2,2,3,3,2) and B=c(2,3,4,5,6,1,3,2,4,3,1,5,1,4,6,1,4) and I want to make the following matrix using the information I have from the above vectors.

 
      0 1 1 1 1 1 
      1 0 1 0 0 0 
      0 1 0 1 0 0 
      1 0 1 0 1 0
      1 0 0 1 0 1 
      1 0 0 1 0 0 

so the first vector says that I have 6 elements therefore I have to make a 6 by 6 matrix and then I have to read 5 elements from the second vector , and put 1 in [1,j] where j=2,3,4,5,6 and put zero elsewhere( i.e. in [1,1]) and so on. Any idea how this can be done in R ?

Use matrix indices:

 a<-c(5,2,2,3,3,2) 
 b<-c(2,3,4,5,6,1,3,2,4,3,1,5,1,4,6,1,4)
 n<-length(a) M<-matrix(0,n,n) M[cbind(rep(1:n,a),b)]<-1

(from Peter Dalgaard )

**3.21 固有値の例
\\Eigenvalues example   
(20/06/2001 ? <?>)

Tapio Nummi asked about double-checking results of spectral decomposition

In what follows, m is this matrix:

  0.4015427  0.08903581 -0.2304132
  0.08903581 1.60844812  0.9061157
 -0.23041322 0.9061157   2.9692562

Brian Ripley posted:

 > sm <- eigen(m, sym=TRUE)
 > sm
 $values
 [1] 3.4311626 1.1970027 0.3510817
 $vectors 
 [,1] [,2] [,3]
 [1,] -0.05508142 -0.2204659 0.9738382 
 [2,] 0.44231784  -0.8797867 -0.1741557 
 [3,] 0.89516533 0.4211533 0.1459759
 
 > V <- sm$vectors 
 > t(V) %*% V 
 [,1] [,2] [,3] 
 [1,] 1.000000e+00 -1.665335e-16 -5.551115e-17 
 [2,] -1.665335e-16 1.000000e+00 2.428613e-16 
 [3,] -5.551115e-17 2.428613e-16 1.000000e+00
  
 > V %*% diag(sm$values) %*% t(V) 
 [,1] [,2] [,3]
  [1,] 0.40154270 0.08903581 -0.2304132 
 [2,] 0.08903581 1.60844812 0.9061157 
 [3,] -0.23041320 0.90611570 2.9692562

*4 関数 tapply の適用等

**4.1 一つの関数から複数の値を返す
\\Return multiple values from a function   
(30/01/2001 Paul Johnson <pauljohn@ukans.edu>)

Below is a summary on returning variable from subfunction, Thanks to Brian Ripley and Douglas Bates:

To access the variables within a function, return a list or data structure assigned to a variable in the calling function. For more than one varaible, assign them to a list variable and separate the components in the calling function using listname$variablename.

 "test" <- function() {
    "hello" <-
      function()    
        {
          x <- 1:3
          y <- 23:34
          z <- cbind(x,y)
          return(x,y,z)              
        }
    c <- hello()
    return(c$z,c$x, c$y)
  }

Sam McClatchie


**4.2 有意性検定のリストから "p" 値を掴み出す
//values out of a list of significance tests   
(22/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Suppose chisq1M11 is a list of htest objects, and you want a vector of p values. Kjetil Kjernsmo observed this works:

 > apply(cbind(1:1000), 1, function(i) chisq1M11[[i]]$p.value)

And Peter Dalgaard showed a more elegant approach:

 
 sapply(chisq1M11,function(x)x$p.value)

In each of these, a simple R function called "function" is created and applied to each item

**4.3 ifelse の用法   
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

If you want to calculate something, but only if y is *between* 0 and 1 then

 ifelse(y==0,0,y*log(y))

(from Ben Bolker)

**4.4 apply を用いて各セル毎の確率の行列を作る
//Apply to create matrix of probabilities, one for each "cell"   
(14/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Suppose you want to calculate the gamma density for various values of "scale" and "shape". So you create vectors "scales" and "shapes", then create a grid if them with expand.grid, then write a function, then apply it (this example courtesy of Jim Robison-Cox) :

 gridS <- expand.grid(scales, shapes)
 survLilel <- function(ss) sum(dgamma(Survival,ss[1],ss[2]))
 Likel <- apply(gridS,1,survLilel)

Actually, I would use

 sc <- 8:12; sh <- 7:12
 args <- expand.grid(scale=sc, shape=sh)
 matrix(apply(args, 1, function(x) sum(dgamma(Survival, scale=x[1], shape=x[2], 
    log=T))), length(sc), dimnames=list(scale=sc, shape=sh))

(from Brian Ripley)

**4.5 外積 outer
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

outer(x,y,f) does just one call to f with arguments created by stacking x and y together in the right way, so f has to be vectorised. (from Thomas Lumley)

**4.6 何かが公式か関数か検査する
//Check if something is a formula/function   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Formulae have class "formula", so inherits(obj, "formula") looks best. (from Prof Brian D Ripley)

And if you want to ensure that it is ~X+Z+W rather than Y~X+Z+W you can use

 inherits(obj,"formula") && (length(obj)==2)

(from Thomas Lumley)

**4.7 変数のベクトルに関する最適化
//Optimize with a vector of variables   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

The function being optimized has to be a function of a single argument. If alf and bet are both scalars you can combine them into a vector and use

 opt.func <- function(arg) 
   t(Y-(X[,1] * arg[1] + X[,2] * arg[2])^delta) %*% covariance.matrix.inverse %*%
            (Y-(X[,1] * arg[1] + X[,2] * arg[2])^delta)

(from Douglas Bates)

**4.8 S+ のような slice.index   
(14/08/2000 Paul E Johnson <pauljohn@ukans.edu>)

 slice.index<-function(x,i){k<-dim(x)[i];sweep(x,i,1:k,function(x,y)y)}

(from Peter Dalgaard)

*5 グラフ

**5.1 グラフ描きの前に par で特性を変更  
(18/06/2001 Paul Johnson <pauljohn@ukans.edu>)

par() is a multipurpose command to affect qualities of graphs. Examples include:

***1. 以降のプロットを一つの絵にまとめる:

 par(mfrow=c(3,2))

creates a 3x2 matrix of plots.

***2. 余白を次の例のように変更

 par(mfrow=c(2,1))
 par(mar=c(10, 5, 4, 2))
 x <- 1:10
 plot(x)
 box("figure", lty="dashed")
 par(mar=c(5, 5, 10, 2))
 plot(x)
 box("figure", lty="dashed")

**5.2 グラフ出力を保存  
(15/08/2001 Paul Johnson <pauljohn@ukans.edu>)

There are two ways to save and/or print graphs. Before you create your graphs, you can decide on a format for output. If you create a plot, it shows on the screen. On Unix, that is using the default device, x11, which refers to the monitor or "screen". On MS Windows, the default screen device is called "windows". R on Windows will use that same device if you use x11() as the device.

There are other devices representing kinds of file output. You can choose among device types, postscript, png, jpeg, bitmap, etc. (windows users can choose windows metafile). Type

 >?device

to see a list of devices R finds on your system.

For each device, you can find out more about it.

 ?x11
or
 ?png
or, in MS Windows
 ?windows
(and so forth)

Before you start using a device, you can configure it. The png, jpeg, and bitmap are "picture" formats, they take a "snapshot", and if you want rescale them, it can get ugly with image processing software. Postscript, pdf, xfig, (and windows metafile) are scalable vector graphics and can be edited somewhat more gracefully. But you can't put them on a web page (or, at least, they won't show anything :)).

You start a device with a configuration command, like

 png()

for the default png (a gif-like) format.

After you create a device, check what you have on:

 > dev.list()

And, if you have more than one device active, you tell which one to use with dev.set(n), for the n device. Suppose we want the third device:

 > dev.set(3)

Now here is the big thing. The devices are configured differently. The x11, postscript, and pdf devices are configured in inches, that's inches of "screen space" or the equivalent. The other devices, like png and jpeg, are configured in pixels. The number of pixels per inch as shown on your monitor will depend on your computer. You can also set the point size. Peter Dalgaard told me that on postscript devices, a point is 1/72nd of an inch. If you expect to shrink an image for final presentation, set a big pointsize, like 20 or so, in order to avoid having really small type.

Suppose we want png output. We create an instance of a png device and adjust features for that device, as in:

 > png(filename="mypicture.png", width=480, height=640, pointsize=12)

Note, if no filename is here, it will print to the default printer. You can then run your graph, and when it is finished you have to close the device:

 > dev.off()

Some print devices can accept multiple plots, and dev.off() is needed to tell them when to stop recording. On a single-plot device, if you try to run another graph before closing this one, you'll get an error.

The postscript device has options to output multiple graphs, one after the other. If you want to set options to the postscript device, but don't actually want to open it at the moment, use ps.options.

Here's an example of how to make a jpeg:

 jpeg(filename="plot.jpg")
 plot(sin, 2*pi)
 dev.off()

For me, the problem with this approach is that I don't usually know what I want to print until I see it. If I am using the jpeg device, there is no screen output, no picture to look at. So I have to make the plot, see what I want, turn on an output device and run the plot all over in order to save a file. It seems complicated, anyway.

So what is the alternative?

If you already have a graph on the screen, and did not prepare a device, use dev.copy() or dev.print(). This may not work great if the size of your display does not match the target device.

dev.copy() opens the device you specify. It is as if (but not exactly like) you ran the png() command before you made the graph:

 dev.copy(device=png, file="foo", width=500, height=300)
 dev.off()

The dev.copy() function takes a device type as the first argument, and then it can take any arguments that would ordinarily be intended for that device, and then it can take other options as well. Be careful, some devices want measurements in inches, while others want them in pixels. Note that, if you do not specify a device, dev.copy() expects a which= option to be specified telling it which pre-existing device to use.

If you forget the last line, it won't work. It's like a write command.

If you use dev.copy or dev.print, you may run into the problem that your graph elements have to be resized and they don't fit together the way you expect. The default x11 size is 7in x 7in, but postscript size is 1/2 inch smaller than the usable papersize. That mismatch means that either you should set the size of the graphics window on the monitor display device to match the eventual output device, or you fiddle the dev.copy() or dev.print() command to make sure the sizes are correct. Recently I was puzzled over this and Peter Dalgaard said you can force the sizes to match, as in

1. Force the pdf output to match the size of the monitor:

 dev.copy(pdf, file="foo", height=7, width=7)

or~
2. change the display size before creating the graphs so that its size matches the intended device you might use for output:

 x11(height=6, width=6)

There are some specialized functions that can do this in a single step, as in dev.copy2eps() or dev2bitmap(). dev.copy2eps creates an eps file, and I'm not sure how it is different from the eps-compatible output created by the postscript() device. dev.copy2eps *is* an indirect way of calling dev.copy(device=postscript,...). It takes the dimensions from the current plot, and sets a couple of options. The same is true of dev.print().

The dev.print() function, as far as I can see, is basically the same as dev.copy(), except it has two appealing features. First, the graph is rescaled according to paper dimensions, and the fonts are rescaled accordingly. Due to the problem mentioned above, not everything gets rescaled perfectly, however, so take care. Second, it automatically turns off the device after it has printed/saved its result.

 dev.print(device=postscript,file="yourFileName.ps")

Before you run the graphs, you can do use the postscript() command to start a PostScript device that dumps the graphs you make onto disk.

If you want pdf output, as of R 1.3 there is a pdf device. Just use pdf() like postscript() or whatever. If you have an old R (say it ain't so...), look into the bitmap device: bitmap() does the postscript + epstopdf bit for you if you select type = "pdfwrite". (from Brian Ripley)

If you just need to test your computer setup, Bill Simpson offered this. Here is a plot printing example

 x<-1:10
 y<-x
 plot(x,y)
 dev.print(width=5,height=5, horizontal=FALSE)

A default jpeg is 480x480, but you can change that:

 jpeg(filename="plot.jpg" width = 460, height = 480, pointsize = 12, quality = 85)

The format of png use is the same.

As of R1.1, the dev.print() and dev.copy2eps() will work when called from a function, for example:

 > ps <- function(file="Rplot.eps", width=7, height=7, ...) {
        dev.copy2eps(file=file, width=width, height=height, ...)
   }
 > data(cars)
 > plot(cars)
 > ps()

That didn't work before, but it does work now.

In r-help's list, I've been one of the countless new users to be stumped by the problem of copying a screen device to a file format. At the current time, on Unix, all I can say is that either you resize the display device before doing the graphs so it matches (or is smaller than) the output device, or be careful that the output command specifies width and height (and probably pointsize) to match the screen device. Even then it may not be perfect.

The mismatch of "size" between devices even comes up when you want to print out an plot. This command will print to a printer:

 > dev.print(height=6, width=6, horizontal=FALSE)

You might want to include pointsize=20 or whatever so the text is in proper proportion to the rest of your plot.

One user observed, "Unfortunately this will also make the hash marks too big and put a big gap between the axis labels and the axis title...", and in response Brian Ripley observed: "The problem is your use of dev.print here: the ticks change but not the text size. dev.copy does not use the new pointsize: try

 x11(width=3, height=3, pointsize=8)
 x11(width=6, height=6, pointsize=16)
 dev.set(2)
 plot(1:10)
 dev.copy()

Re-scaling works as expected for new plots but not re-played plots. Plotting directly on a bigger device is that answer: plots then scale exactly, except for perhaps default line widths and other things where rasterization effects come into play. In short, if you want postscript, use postscript() directly.

The rescaling of a windows() device works differently, and does rescale the fonts. (I don't have anything more to say on that now)

**5.3 プロット出力を別個のファイルに保存するには  
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

 postscript(file="test3.%d.ps",onefile=FALSE) 
gives files called test3.1.ps, test3.2.ps and so on. (from Thomas Lumley)

**5.4 用紙サイズをコントロールする
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 options(papersize="letter")

whereas ps.options is only for the postscript device

**5.5 R のグラフを文章に統合する: Tex と EPS   
(20/06/2001 Paul Johnson <pauljohn@ukans.edu>)

Advice seems to be to output R graphs in a scalable vector format, eps on Linux or eps or metafile on windows. If you are using Tex for document prep, on linux use "tetex" and on windows it has an implementation called http://www.tile.net/listserv/fptex.html and another called Miktex www.miktex.org I've installed Miktex and its pretty good.

Keith Reckdahl's Using EPS in LaTeX documents is freely available on the CTAN sites. (Brian Ripley)

For a description of EPSF and various other vector and bitmap formats, see the Encyclopedia of Graphics File Formats from O'Reilly at http://www.oreilly.com/catalog/gffcd/ (from Joel West)

I personally swear by the LaTeX Companion and the LaTeX Graphics companion, both published by Addison Wesley. (from James Marca)

Thanks to Prof. Ripley for the hint, maybe someone other is interested, so I will shortly outline what "to use pstricks with LaTeX" means:

 -get PSfrag (e.g. www.dante.de)
 - get graphics (same url)
 - create your ps-graphic in R and define "tags" e.g.

        
        postscript("myfile.ps")
        plot(..., xlab="rho", ylab="wrho")
        dev.off()
        system("ps2epsi myfile.ps")

 add

 
        \usepackage{graphics}
        \usepackage{psfrag}

to your LaTeX-File

and now we reach the climax:

        \begin{figure}
        \begin{center}
        \psfrag{rho}{$\varrho$}         <-- this replaces rho with \varrho
        \psfrag{wrho}{$W(\varrho)$}     <-- this replaces wrho with W(\varrho)
        \includegraphics{myfile.epsi}
        \caption{wonderful R plotting \label{fig}}
        \end{center}
        \end{figure}

run latex and dvips and enjoy :-) (from Torsten Hothorn)

If you need to create a preview inside an eps file, you can do it with GSView (available free).

**5.6 "Snapshot" グラフとそれらの閲覧   
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

Ordinarily, the graphics device throws away the old graph to make room for the new one.

Have a look to ?recordPlot to see how to keep snapshots. (Doo myGraph<-recordPlot() to save snap, then replayPlot(myGraph) to see it again);

On Windows you can choose "history"-"recording" in the menu and scroll through the graphs using the PageUp/Down keys. (from Uwe Ligges)

**5.7 scatterplot の行列の上三角部分を選択する
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

You only need to specify the points to be plotted:

 ## upper triangle of the correct size:
 temp <- upper.tri(mmtop94.2, diag = TRUE) 
 z <- mmtop94.2[temp]  # values
 y <- nrow(mmtop94.2) + 1 - row(mmtop94.2)[temp]  # row indices
 x <- col(mmtop94.2)[temp]  # col indices
 scatterplot3d(x, y, z, type="h") 

(from Uwe Ligges)

**5.8 密度関数をプロットする (例えば正規分布)   
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

 x<-seq(-3,3,.1) plot(x,dnorm(x,0,1), type="l")

(Bill Simpson)

**5.9 エラーバー付きでプロット   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Here is how to do it.

 x<-c(1,2,3,4,5)
 y<-c(1.1, 2.3, 3.0, 3.9, 5.1)
 ucl<-c(1.3, 2.4, 3.5, 4.1, 5.3)
 lcl<-c(.9, 1.8, 2.7, 3.8, 5.0)
 plot(x,y, ylim=range(c(lcl,ucl)))
 arrows(x,ucl,x,lcl,length=.05,angle=90,code=3) 

or
 segments(x,ucl,x,lcl)

(from Bill Simpson)


**5.10 推定密度関数付きのヒストグラム   
(14/08/2000 Paul E Johnson <pauljohn@ukans.edu>)

If you want a density estimate and a histogram on the same scale, I suggest you try something like this:

 > IQR <- diff(summary(data)[c(5,2)])
 > dest <- density(data, width = 2*IQR)  # or some smaller width, maybe,
 > hist(data, xlim = range(dest$x), xlab = "x", ylab = "density",
          probability = TRUE)    # <<<--- this is the vital argument
 > lines(dest, lty=2)

(from Bill Venables)

diffはLagged Differencesを返す関数ですので、上記ではIQRが負になりエラーが出ます。
 IQR <- diff(summary(data)[c(2,5)])
とすべきです。又、欠損値(NA)が含まれているエラーが出ますので、
 dest <- density(data, width = 2*IQR,na=T)
としてNAを排除してください。

**5.11 いくつもの線プロットを重ね描きするには？
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Use par("new"=TRUE) to stop the previous output from being erased, as in

 tmp1<-plot(acquaint~T,type='l', ylim=c(0,1),ylab="average
    proportion",xlab="PERIOD",lty=1,pch=1,main="")
 par("new"=TRUE)
 tmp2<-plot(harmony~T,type='l', ylim=c(0,1),ylab="average
 proportion",xlab="PERIOD",lty=2,pch=1,main="")
 par("new"=TRUE)
 tmp3<-plot(identical~T,type='l', ylim=c(0,1),ylab="average
 proportion",xlab="PERIOD",lty=3,pch=1,main="")

Note the par() is used to overlay "high" level plotting commands, while "lower" commands like abline() will automatically complement an existing graph and par() is not needed for them.

Here's another idea: form a new data.frame and pass it through as y:

 plot(cbind(ts1.ts,ts2.ts),xlab="time", ylab="", plot.type="single")

or better something like:

 plot(cbind(ts1.ts,ts2.ts),plot.type="single",
       col=c("yellow","red"),lty=c("solid","dotted")) #colours and patterns

It can also be helpful to contrast 'c(ts1.ts,ts2.ts)' with 'cbind(ts1.ts,ts2.ts)'. (from Guido Masarotto)

For time series data, there is a special function for this:

 ts.plot(ts1.ts, ts2.ts) # same as above

See help on plot.ts and ts.plot (from Brian D. Ripley)

Here's an example for histograms:

 > hist(rnorm(100, mean = 20, sd =12), xlim=range(0,100), ylim=range(0,50))
 > par(new = TRUE)
 > hist(rnorm(100, mean = 88, sd = 2), xlim=range(0,100), ylim=range(0,50))

The label at the bottom is all messed up. You can put the option xlab="" to get rid of them.

Here's another example that works great for scatterplots

 > x1<-1:10
 > x2<-2:12
 > y1<-x1+rnorm(length(x1))
 > y2<-.1*x2+rnorm(length(x2))
 > plot(x1,y1,xlim=range(x1,x2), ylim=range(y1,y2), pch=1)
 > points(x2,y2,pch=2)

(from Bill Simpson)


**5.12 グラフの行列を作る  
(18/06/2001 Paul Johnson <pauljohn@ukans.edu>)

The standard way to array several images across a page is the par(mfrow=) or par(mfcol=) option, as in:

 par(mfrow=c(2,3))

And then the next 6 plot/image commands will be laid out in a 2 row x 3 column arrangement.

The layout option can give a powerful set of controls for that kind of thing as well, e.g. (from Paul Murrell):

 layout(matrix(c(1, 0, 2), ncol=1),
           heights=c(1, lcm(5), 1))
 plot(x)
 box("figure", lty="dashed")
 plot(x)
 box("figure", lty="dashed")

I think you have a data with a like:
 xx <- data.frame(y=rnorm(100), x1=as.factor(round(runif(100,1,4))),  
           x2=as.factor(round(runif(100,1,4))) ) attach(xx) by(y,list(x1,x2),plot) 
 by(y,list(x1,x2),function(x)print(x)) xn <- as.integer(x1) 
 ## either use coplot or  
 par(mfrow=) approach: coplot(y~xn|x2) 
 ##or you use a loop: 
 par(mfrow=c(2,2)) for( i in unique(as.integer(x1))) 
   plot.default(x2 [as.integer(x1)== i] ,  y[as.integer(x1)==i ] ,
  main=paste("Code:",i) ) 
 ##of course there might be ways to use tapply, lapply etc.

(from Peter Malewski. Per Peter Dalgaard's advice, I've replaced usage of codes with as.integer. pj)

And to label the whole page, use the "mtext" function

**5.13 線と棒グラフを結合する   
(07/12/2000 Paul Johnson <pauljohn@ukans.edu>)

David James was kind enough to help me out and enlighten me to the fact that the x-scales used by barplot are independent of those from my data. He also sent me a function which I include below (with a minor modification). This does exactly what I was looking for. Thanks!

 xbarplot <- function(y, col=par("col"), border = par("fg"), gap=gap, ...)
 {
   ny <- length(y)
   x <- seq(0,ny)+0.5
   n <- length(x)
   space <- gap * mean(diff(x))
   old <- par(xaxt = "n")
   on.exit(par(old))
   plot(range(x, na.rm=T), range(y, na.rm=T), bty="n",
      xlab="", ylab = "", type = "n", ...)
   rect(x[-n]+space/2, rep(0,length(y)), 
       x[-1]-space/2, y, col = col, border = border)
 }

(from Dirk Eddelbuettel)


**5.14 回帰散布図: グラフに当てはめ回帰直線を加える   
(17/08/2001 Paul Johnson <pauljohn@ukans.edu>)

suppose you have

 reg <-lm( y ~ x)
 plot(x,y)

then do

 abline(reg)

**5.15 散布図中のプロット文字を制御する
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Plots show dots:

 plot(rnorm(100),rnorm(100),pch=".") (from Peter Malewski)

Plots with numbers indicating values of a third variable:

 text(tx, ty, label = ch, col = "red", bg = "yellow", cex = 3)

If you specify pch, only the first character is taken as the symbol for your point.

Matt Wiener suggested creating a color vector and then using the value of another value to control both the lable and the color:

 > col.vec <- c("black", "red", "blue")
 > text(chem.predict[,1:2], labels = km$cluster, color =
          col.vec[km$clusters])

**5.16 散布図: プロット文字を制御する (men vs women, 等)   
(11/11/2002 Paul Johnson <pauljohn@ku.edu>)

Remember pch uses integers to select plotting characters.

 x <- 1:10
 y <- 1:10
 res <- -4:5
 plot(x, y, pch = ifelse(res < 0, 20, 1))

If true, use character 20. Else use character 1. Experiment with different numbers there.

Then note you can create a vector of integers to control symbols, as in

 g <- 15:24plot (x,y, pch= g)

(From post by Uwe Ligges (11/7/2002)

Here's another cool example from Roger Bivand (11/7/2002),

 > x <- rnorm(25)
 > y <- rnorm(25)
 > z <- rnorm(25)
 > pchs <- c(1, 20)
 > pchs[(z < 0)+1]
 [1] 20 20 20 20 1 1 20 20 1 20 20 20 20 20 1 20 1 ...
 > plot(x, y, pch=pchs[(z < 0)+1])
 > text(x, y, labels=round(z, 2), pos=1, offset=0.5)

Roger adds, "This "cuts" the z vector at zero, using the convenient slight-of-hand that TRUE and FALSE map to integers 1 and 0, and thus gives the pch argument in plot() or points() a vector of values of indices of the pchs vector. More generally, use cut() to break a numeric vector into class intervals (possibly within ordered())."

**5.17 サイズと色を修正した散布図
(12/11/2002 Paul Johnson <pauljohn@ku.edu>)

 test<-c(2,6,4,7,5,6,8,3,7,2) 
 plot(test,type="n",main="Color/size test plot",ylab="Size scale",xlab="Colorscale") 
 colsiz<-function (yvec) { 
   ymin <- min(yvec) 
  cexdiv <- max(yvec)/2 for (i in 1:length(yvec)) {
     nextcex <- (yvec[i] - ymin)/cexdiv + 1   
  par(cex = nextcex) 
     points(i, yvec[i], type = "p", col = i) 
  }
 }
 colsiz(test)


Note that the size here ranges from 1->2. You can change that by fiddling withthe calculation of 'cexdiv'.
(from Jim Lemon)

Paul Murrell posted this nice example (11/7/2002):

 x <- rnorm(50)
 y <- rnorm(50)
 z <- rnorm(50)
 pch <- rep(1, 50)
 pch[z < 0] <- 20
 cex <- (abs(z)/max(abs(z))) * 4.9 + 0.1
 plot(x, y, pch=pch, cex=cex)

**5.18 散布図：三番目の変数に関してサイズを調整する
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

Have a look at:

 ?symbols

e.g.:

 symbols(1:10,1:10,circles=1:10, inches=0.1)

(from Uwe Ligges)

**5.19 散布図: 点を結ぶ線を滑らかにする   
(02/06/2003 Paul Johnson <pauljohn@ku.edu>)

Rado Bonk asked how to smooth a line connecting some points. One answer was, "You may be interested in spline(). For example:

 x <- 1:5
 y <- c(1,3,4, 2.5,2)
 plot(x, y)
 sp <- spline(x, y, n = 50)
 lines(sp)

Roger Peng


**5.20 回帰散布図: 推定値をプロットに加える   
(18/06/2001 Paul Johnson <pauljohn@ukans.edu>)

I don't know how to get the estimate of the line printed, but I have seen one example of how to get the R-square:

I used this to indicate the R-squared value on a regression plot:

 text(5, 0.6, as.expression(substitute(R^2 == r, list(r=round(R2.the,3)))))

where R2.the was computed a few lines above. (from Emmanuel Paradis)

Or

 that <- 1
 plot(1:10)
 title(substitute(hat(theta) == that, list(that=that)))

(from Brian Ripley)


**5.21 軸の制御: チックマーク、チックマークをなくす、数、等
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

Ticks, but no numbers:

  plot(1:10, xaxt="n")
  axis(1, labels=FALSE)

This leaves off the x and y titles. You still have the numerical labels next to each tick.

 plot(x,y,ann=FALSE)

If you want a lot of control you can do

 plot(x,y,ann=FALSE,axes=FALSE)
 box()  
 axis(side=1,labels=TRUE)
 axis(side=2,labels=TRUE)
 mtext("Y", side=2, line=2)
 mtext("X", side=1, line=2)

do ?axis (from Bill Simpson)

Control range of axes through the graph command itself. This shows the "image" function:

 > x <- 1:50  
 > y <- 1:50
 > z <- outer(x,y)
 > image(z)
 > image(z, xlim=c(0.2,0.5))
 > image(z, xlim=c(0.2,0.5), ylim=c(0,0.5))

The same works with contour(). (from Kaspar Pflugshaupt)


**5.22 軸: ラベルの回転   
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

At one time, apparently par(las=) would do this, as in

 par(las=2) or par(las=3)

to make the 90 degrees to axes labels.

More recently, Paul Murrell has observed:

"The drawing of xlab and ylab ignores the par(las) setting.

...You can override this "sensible" default behaviour if it annoys you. For example, you can stop plot() drawing xlab and ylab by something like plot(ann=F) or plot(xlab="", ylab="") AND you can draw them yourself using mtext(), which does listen to par(las).

A couple of examples of what I mean ...

    par(mfrow=c(2,2), mar=c(5.1, 4.1, 4.1, 2.1), las=0)
    plot(0:10, 0:10, type="n")
    text(5, 5, "The default axes")
    box("figure", lty="dashed")
    par(las=1)
    plot(0:10, 0:10, type="n")
    text(5, 5, "The user says horizontal text please\n\nbut R knows better !")
    box("figure", lty="dashed")
    par(las=1, mar=c(5.1, 6.1, 4.1, 2.1))
    plot(0:10, 0:10, type="n", ann=F)
    mtext("0:10", side=2, line=3)
    mtext("0:10", side=1, line=3)
    text(5, 5, "The user fights back !")

(posted 6/11/2001)


**5.23 軸: 軸に整形した日付を加える
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

Try package "date", available at CRAN.

Example for plotting:

 library(date)
 plot.date(.....) 
 ## if x-values are dates, you can also use plot(...) after
 library(date)

(from Uwe Ligges)

**5.24 軸: プロット中の軸を逆転する   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

In R1.1.1, there is a bug, but in later versions Ross Ihaka says do the following:

 x <- rnorm(100)
 y <- rnorm(100)
 plot(x, y, xlim=rev(range(x)), ylim=rev(range(y)))

and get the result you expect.

Ben Bolker said, at this time (R1.1.1), you have to do something like

 x <- rnorm(100)
 y <- rnorm(100)
 plot(max(x)-x,max(y)-y,axes=FALSE)
 xlabvals <- pretty(x,5)
 ylabvals <- pretty(y,5)
 axis(side=1,at=max(x)-xlabvals,labels=xlabvals)
 axis(side=2,at=max(y)-ylabvals,labels=ylabvals)

I think it's time to put this one in the FAQ ... (although, again, I don't know if I would have found it just searching the archives if I hadn't remembered that it was there in the first place).

Another person said try (if you don't want axes):

 plot(x, -y, axes=FALSE)
 axis(1)
 axis(2)

**5.25 軸: 日付によるラベル付き軸   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

I have to plot the data set like the following. 
 > x <- c("05/27/00","06/03/00","08/22/00",...)
 > y <- c(10,20,15,...)
You should try the package "date", available at CRAN. Example:

 library(date)
 x <- as.date(c("05/27/2000","06/03/2000","08/22/2000"))
 y <- c(10,20,15)
 plot(x,y)

Uwe Ligges

Here is one using chron.

 library(chron)
 x <- dates(c("05/27/00","06/03/00","08/22/00"))
 y <- c(10,20,15)
 plot(x, y)

which will work if you were in the USA and your first string means 2000 May 27, 2000-05-27 in ISO 8601 date format. (R 1.2.0 will have facilties to read those as well as US and European shorthands.) (from Brian D. Ripley)

**5.26 軸: 軸ラベルの上付き添字   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 plot(c(1:10), ylab= expression(x^"++")) (from Peter Dalgaard)

**5.27 軸: 位置の補正   
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

Question: When I plot them default x axis takes place on the bottom of the chart and the y axis takes place on the left of the chart. Is it possible to make axes pass from the origin.

Maybe this helps:

 plot(...., xaxt="n", yaxt="n")
 axis(1, pos=0)
 axis(2, pos=0)

Uwe Ligges

**5.28 散布図に "エラー矢印" を加える   
(30/01/2001 Paul Johnson <pauljohn@ukans.edu>)

I think using arrows(..., code=3, angle=90, ...) is quite simple, e.g.:

        x <- rnorm(10)
        y <- rnorm(10)
        se.x <- runif(10)
        se.y <- runif(10)
        plot(x, y, pch=22)
        arrows(x, y-se.y, x, y+se.y, code=3, angle=90, length=0.1)
        arrows(x-se.x, y, x+se.x, y, code=3, angle=90, length=0.1)


The first arrows() draws the error bars for y, and the second one for x, 'code=3' draws a head at both ends of the arrow, 'angle=' is the angle of the head with the main axis of the arrow, and 'length=' is the length of the head. You can also add usual graphic parameters (col, lwd, ...).

Emmanuel Paradis

**5.29 時系列: 一つのグラフに複数の線をプロットする
(06/09/2000 Paul Johnson <pauljohn@ukans.edu>)

To do the two in the same graph:

 plot(x)
 lines(y)

or

 points(y)

or

 matplot(cbind(x,y),type="l")

To do separate graphs one above the other:

 par(mfrow=c(2,1))
 plot(x)
 plot(y)

You can do other things like set colors, ranges, axis labels, etc.

(from Jon Baron 12/29/2001)

**5.30 時系列: 当てはめ値と実際のデータをプロット
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 > library(ts) 
 > data(LakeHuron) 
 > md <- arima0(LakeHuron, order=c(2,0,0), xreg=1:98)  plot(LakeHuron) 
 > lines(LakeHuron-md$resid,col="red")

(from Adrian Trapletti)

**5.31 プロットにテキストを挿入
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

I want this in a plot: "The average temperature is 23.1 degrees."

 text(x,y,paste("The average temperature is ", variable, "degrees"))

(from Thomas Lumley)

**5.32 はみ出る変数をプロット
(07/12/2000 Paul Johnson <pauljohn@ukans.edu>)

Graphs blow up if a variable is unbounded, so if you are plotting tan(x), for example:

You probably want to plot something like 
 pmax(-10,pmin(10,tan(x))) 
or 
 ifelse(abs(tan(x))>10,NA,tan(x))

(from Thomas Lumley)

**5.33 動的に生成された内容/数式マークアップを持つラベル
(16/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Don Wingate wrote "More generally, what I need to do is dynamically create a vector of math expressions (using the powers of string manipulation), which can then be used as the textual values for a plot function."

Uve Ligges answered: What about something like:

   numbers <- 1:2  # the dynamical part
   my.names <- NULL
   for(i in numbers) 
      my.names <- c(my.names, 
         eval(as.expression(substitute(expression(x[i]^2), list(i=i)))))
   dotplot(1:2, labels = my.names)


**5.34 プロットのタイトルに数式等の複雑なものを使う 
 (11/11/2002 Paul Johnson <pauljohn@ku.edu>)

Use substitute(). Something like

 e<-substitute(expression(paste("Power plot of ", x ^ alpha, " for ",
         alpha == ch.a)), list(ch.a=formatC(alpha,wid=1)))
 plot(x, x^alpha, main=e)

(from Peter Dalgaard)

Suppose you have

 > that <- 1
 > plot(1:10)

and you need to specify "that" in the title.

 expression(paste(hat(theta),'= ',that)))

Generally, to get an expression either use parse on your pasted character string or substitute on an expression. The neatest is

 title(substitute(hat(theta) == that, list(that=that)))

(note it is == not =)

(from Roger Kroenker, from Brian Ripley)

Want a variable value in the title, as in "Data at the 45o North:

 lat<-45
 plot(1,main=substitute("Data at "*lat*degree*" North",list(lat=lat)))

(from Thomas Lumley)

You can use different way: First export plot for R as an (e)ps file and than use LaTeX psfrag package to add text and formulae. (from Jan Krupa)

I want a Greek letter subscripted in the title, as in

[Greek] gamma [subscript]1 = [Roman] Threshold 1

Try:

  plot(1,1,main=expression(gamma[1]=="Threshold 1"))

(from Thomas Lumley)

Try 
 help(plotmath)

**5.35 三番目の変数の欠損状態を示す色分けした点を散布図に使う
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 plot(x,y,color=ifelse(is.na(z),"red","black"))  #(from Peter Dalgaard)
 plot(x[!is.na(z)],y[!is.na(z)],xlim=range(x),ylim=range(y)) 
 points(x[is.na(z)],y[is.na(z)],col=2) (from Ross Darnell)

**5.36 lattice: その他の例
//misc examples   
(12/11/2002 Paul Johnson <pauljohn@ku.edu>)

The lattice library came along recently, I've not explored it much. But here is a usage example, emaphsizing the aspect setting from Paul Murrell:

 library(lattice)
 # 10 rows
 x <- matrix(rnorm(100), ncol=10)
 lp1 <- levelplot(x, colorkey=list(space="bottom"),  aspect=10/10)
 print(lp1)
 # 5 rows -- each row taller than in previous plot
 x <- matrix(rnorm(50), ncol=5)
 lp2 <- levelplot(x, colorkey=list(space="bottom"),aspect=5/10)
 print(lp2)

This one from Deepayan Sarkar (deepayan@stat.wisc.edu) is especially neat.

 library(lattice)
 xyplot(rnorm(100) ~ rnorm(100) | gl(2, 50),
       strip = function(factor.levels, ...) {
 strip.default(factor.levels = expression(1 * degree, 2 * degree),
  ...)
 })

**5.37 三次元の散布図
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Type

 ?persp
 ?contour
to see about features in the base package.

There is a scatterplot3d package on CRAN.

See also:

 http://www.statistik.uni-dortmund.de/leute/ligges.htm (from Uwe Ligges)

See also:

Get Xgobi (an X windows graphics library) and the R contrib package "xgobi". The web page for Xgobi is http://lib.stat.cmu.edu/general/XGobi/ This does 3-D graphs, rotations, etc. Very nice. Mainly for Unix/X users.

**5.38 値を反映した線種を用いた三次元等高線図
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

Suppose you want a contour plot solid contour lines for positive values and dotted contour lines for negative values.

The trick is to specify the levels and to use the add= argument.

        x <- seq(-1, 1, length=21)
        f <- function(x,y) (x^2 + y^2) / 2 - 0.5
        z <- outer(x, x, f)
        contour(z, levels = seq(-0.5, -0.1, by = 0.1), lty = "dotted")
        contour(z, levels = 0, lty = "dashed", add = TRUE)
        contour(z, levels = seq(0.1, 1, by = 0.1), add = TRUE)

(from Ross Ihaka)

**5.39 グラフのアニメーション
(13/08/2000 Paul Johnson <pauljohn@ukans.edu>)

It is quite easy to do with ImageMagick (www.imagemagick.org), which can be installed on most OSes. I tried this simple sequence and it worked beautifully.

In R, create 15 histograms:

 > for(i in 1:5) {
 + jpeg(paste("fig", i, ".jpg", sep = ""))
 + hist(rnorm(100))
 + dev.off()
 + }

Then from the command line (I tried it using Linux, though it should be the same on any platform):

 % convert -delay 50 -loop 50 fig*.jpg animated.gif

This created animated.gif, a nice animation of my sequence of files.You can control the timing of the animation by playing with -delay and -loop. (from Matthew R. Nelson, Ph.D. )


**5.40 グラフのユーザー領域の背景色を余白と異なったものにする
(06/09/2000 Paul Johnson <pauljohn@ukans.edu>)

Use "par" to find out where the boundaries/axes are, and "rect" to draw the picture ... the 6th plot in demo(graphics) does the same thing.

For example:

 ## code for coloring background
 x <- runif(10) 
 y <- runif(10)
 plot(x,y,type="n") 
 u <- par("usr") 
 rect(u[1],u[3],u[2],u[4],col="magenta") 
 points(x,y)

(from Ben Bolker)

(pj: see next because it includes a usage of do.call)

I think one is stuck with things like

 plot(2:10,type="n")
 do.call("rect",c(as.list(par("usr")[c(1,3,2,4)]),list(col="pink")))
 points(2:10)

or for the middle line, somewhat simpler
 bb <- par("usr") rect(bb[1],bb[3],bb[2],bb[4], col="pink")

(from Peter Dalgaard )

**5.41 知ってて良かったと思う気の効いたグラフ知識(その他)   
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

1. example(rug). 'nuff said.
2. jitter

**5.42 実際に動くグラフコード幾つか (その他)   
(18/06/2001 Paul Johnson <pauljohn@ukans.edu>)

*** 点プロット、直線上書き、凡例、PS 出力
 x <- c(1,2,3,4)
 y1 <- c(1,2,3,4)
 y2 <- c(2,2,2,2)
 Fred <- c(1,2)
 postscript(file="d:/Bob/Papers/IFM/try2.ps") 
 plot(x,y1, type="l") 
 lines(x,y2,lty="33")  
 legend(1,4, c("y1","y2"), lty=Fred) 
 graphics.off()

(from Anon.)

*** サイン・こサイン関数のグラフ。layout 関数の使用例

 layout.my <- function (m, n) {
     par(mfrow = c(m, n))
 }
 x <- 0:12566 / 1000 # Range from 0 to 4*pi
 layout.my( 1, 2 ) 
 plot( sin(x), type = "l", xaxs = "i", yaxs = "i", axes = F, 
          xlab = "x", ylab = "sin(x)", main = "Y = sin(x), x = [ 0, 720 ]" ) 
 axis( 2, at = seq( -1, 1, by=1 ),las = 2 ) box(lty="dotted") abline( h = 0, lwd = 1 )
 plot( cos(x), type = "l", xaxs = "i", yaxs = "i", axes = F, 
          xlab = "x", ylab = "cos(x)", main = "Y = cos(x), x = [ 0, 720 ]" ) 
 axis( 2, at = seq( -1, 1, by=1 ),las = 2 ) 
 box(lty="dotted") 
 abline( h = 0, lwd = 1 )

(from Ko-Kang Wang)

***正多角形のプロット

Below is a function that generates regular polygons, filled or
borders, of n sides (n>8 => circle), with "diameter" prescribed
in mm, for use alone or with apply.

 ngon <- function (xydc, n=4, type=1) 
   # draw or fill regular polygon 
   # xydc a four element vector with 
   # x and y of center, d diameter in mm, and c color 
   # n number of sides of polygon, n>8 => circle 
   # if n odd, vertex at (0,y), else midpoint of side 
   # type=1 => interior filled, type=2 => edge 
   # type=3 => both 
   { u <- par("usr") 
     p <- par("pin") 
     d <- as.numeric(xydc[3]) 
     inch <- d/25.4 
     rad <- inch*((u[2]-u[1])/p[1])/2 
     ys <- inch*((u[4]-u[3])/p[2])/2/rad 
     if (n > 8) n <- d*4 + 1 
     th <- pi*2/n 
     costh <- cos (th) 
     sinth <- sin (th) 
     x <- y <- rep (0,n+1) 
     if (n %% 2) { x[1] <- 0 y[1] <- rad } 
        else { x[1] <- -rad*sin(th/2) y[1] <- rad*cos(th/2) }  
     for (i in 2:(n+1)) { 
         xl <- x[i-1] 
         yl <- y[i-1] 
         x[i] <- xl*costh - yl*sinth 
         y[i] <- xl*sinth + yl*costh 
     } 
     x <- x + as.numeric(xydc[1]) 
     y <- y*ys + as.numeric(xydc[2]) 
     if (type %% 2) polygon (x,y,col=xydc[4],border=0) 
     if (type %/% 2) lines (x,y,col=xydc[4]) invisible() }

(from Denis White)

***自前の軸
 plot(1:1000, rnorm(1000), axes=FALSE) # plot with no axes 
 axis(1, at = seq(0, 1000, by = 100)) # custom x axis 
 axis(2) # default y axis 
 box() # box around the plot

(from Ross Ihaka)


***プロットの中心に軸を持つ三次関数を描く

  #  Draw the basic curve (limits were established by trial and error).
  #  The axes are turned off blank axis labels used.
  curve(x * (2 * x - 3) * (2 * x + 3), from = -1.85, to = 1.85, 
    xlim = c(-2, 2), ylim = c(-10, 10), axes = FALSE, xlab = "", ylab = "")
  # Draw the axes at the specified positions (crossing at (0, 0)). 
  # The ticks positions are specified (those at (0,0) are omitted). 
  # Note the use of las=1 to produce horizontal tick labels on the 
  # vertical axis.
  axis(1, pos = 0, at = c(-2, -1, 1, 2)) 
  axis(2, pos = 0, at = c(-10, -5, 5, 10), las = 1)
  # Use the mathematical annotation facility to label the plot.
  title(main = expression(italic(y==x * (2 * x - 3) * (2 * x + 3))))

(from Ross Ihaka)

***3次元曲面を等高線プロットの様に塗分ける、例えば大きな z に対しては黒、ちいさな z に対しては白

 # Create a simple surface f(x,y) = x^2 - y^2
  nx <- 21
  ny <- 21
  x <- seq(-1, 1, length = nx)
  y <- seq(-1, 1, length = ny)
  z <- outer(x, y, function(x,y) x^2 - y^2)


***多面体の各面の隅の値を平均し [0,1] 間の値にスケール化
 # We will use this to select a gray for colouring the facet.
 hgt <- 0.25 * (z[-nx,-ny] + z[-1,-ny] + z[-nx,-1] + z[-1,-1]) 
 hgt <- (hgt - min(hgt))/ (max(hgt) - min(hgt))

*** 各面の色を指定した曲面を描く
 persp(x, y, z, col = gray(1 - hgt), theta = 35) 
 persp(x, y, z, col = cm.colors(10)[floor(9 * hgt + 1)], theta = 35)

(from Ross Ihaka)

*** polygon 関数使用例
 y <- c(1.84147098480790, 1.90929742682568, 1.14112000805987,
          0.243197504692072, -0.958924274663133,0.720584501801074,
         1.65698659871879, 1.98935824662338, 1.41211848524176,
         0.45597888911063)
 plot(y,ylim=c(0.0000000001,2),type="l",log="y")
 x <- 1:10 temp <- which(y<0) 
 polygon(x[-temp], y[-temp], col="red")

Uwe Ligges

*** matplot による並置出力
 (Peter Dalgaard, June 11, 2001)

 p <- 1:3 
 u <- matrix(c(1,1,1,2,2,2,3,3,3),3,) 
 r <- p*u X11(height=3.5,width=7) 
 par(mfcol=c(1,3),cex=1.5,mex=0.6, mar=c(5,4,4,1)+.1) 
 matplot(p,r,type="b",main="A",col="black") 
 matplot(log(p),log(r),type="b",main="B",col="black") 
 r <- p+u 
 matplot(p,r,type="b",main="C",col="black")

//////////////////////////////////////
*6 通常の統計的雑用
//////////////////////////////////////

**6.1 表集計   
(12/02/2002 Paul Johnson <pauljohn@ku.edu>)

A crosstab facility xtabs() was introduced in R1.2. Read all about it in help. table is there too.

These give counts, not percentages. There are special functions for that. Here's code that creates a table and then calls "prop.table" to get percents on the columns:
 > x<- c(1,3,1,3,1,3,1,3,4,4) 
 > y <- c(2,4,1,4,2,4,1,4,2,4) 
 > hmm <- table(x,y) > prop.table(hmm,2) * 100 

If you want to sum the column, there is a function "margin.table()" for that, but it is just the same as doing a sum manually, as in:

 apply(hmm, 2, sum)

which gives the counts of the columns.
If you have an old R, here is a "do it yourself" crosstab

 count <- scan()65 130 67 54 76 48 100 111 62 34 141 130 47 116 105 100 191 104
 s <- c("Lo", "Med", "Hi")
 satisfaction <- factor( rep(c(s,s), rep(3,6)),  levels = c("Lo","Med","Hi"))
 contact <- factor( rep(c("Low","High"), rep(9,2)), levels = c("Low","High"))
 r <- c("Tower", "Flat", "House")
 residence <- factor(rep(r,6))
 tapply(count,list(satisfaction,contact,residence),sum)


**6.2 t 検定
(18/07/2001 Paul Johnson <pauljohn@ukans.edu>)

Index the vector by group:

        t.test(x[sex == 1], x[sex == 2])

should do the trick. You could try:

        summary(aov(sex~group))

which is equivalent but assumes equal variances. (from Mark Myatt)

The power of the test can be calculated:

 power.t.test()

see help(power.t.test)

**6.3 正規性の検定   
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

Use shapiro.test(x) for testing whether x comes from a normal distribution. X must be a vector or variable.

**6.4 データの一部分の解析(副グループの要約もしくはモデル化)   
(06/04/2001 ? <?>)

Question: I want summary information by subgroup.

Answer: This can be done with tapply, or the by() function, which is a "wrapper" (that means "simplifying enhancement") of tapply.

Use like so. If you have groups defined by a factor A, to get summaries of variables x through z in dataframe "fred",

 >attach(fred)
 >by(fred[,x:z], A, summary)

You can make a list of factors for a multidimensional set of groups, and many different functions can be put in place of "summary". See the by documentation.

Before by, we did it like this, and you may need it for some reason.

Get the average of a variable "rt" for each subgroup created by combinations of values for 3 factors.

 tapply(rt,list(zf, xf, yf), mean, na.rm=T)

alternatively,

 ave(rt, zf, xf, yf, FUN=function(x)mean(x,na.rm=T))

(from Peter Dalgaard)


/////////////////////////////////////////////////////////
*7 モデルの当てはめ (回帰タイプの事柄)
/////////////////////////////////////////////////////////

**7.1 回帰モデルの指定のチップス
(12/02/2002 Paul Johnson <pauljohn@ku.edu>)

1. Rather than

 lm(dataframe$varY~dataframe$varX)
, do

 lm(varY~varX,data=dataframe)

2. To fit a model with a "lagged" variable, note thereis no built in lag function for ordinary vectors. lag in the ts packagedoes a different thing. So consider:

 mylag<-function(x,d=1) {
    n<-length(x)
    c(rep(NA,d),x)[1:n]
 }
 lm(y~mylag(x))

**7.2 要約メソッド。出力オブジェクトから結果を取り出す
(17/02/2002 Paul Johnson <pauljohn@ku.edu>)

Brian Ripley: By convention summary methods create a new object of class print.summary.foo which is then auto-printed.

Basically, you find out what's in a result, then grab it. Suppose a fitted model is "fm3DNase1":

 > names(summary(fm3DNase1))
 [1] "formula"      "residuals"    "sigma"        "df"           "cov.unscaled"
 [6] "correlation"  "parameters"  
 > summary(fm3DNase1)$sigma[1] 0.01919449
 > summary(fm3DNase1)$parameters
 Estimate Std. Error t value Pr(>|t|)
 Asym 2.345179 0.07815378 30.00724 2.164935e-13
 xmid 1.483089 0.08135307 18.23027 1.218523e-10
 scal 1.041454 0.03227078 32.27237 8.504308e-14

# for the standard errors:

 > summary(fm3DNase1)$parameters[,2]
 Asym xmid scal
 0.07815378 0.08135307 0.03227078

(from Rashid Nassar)

The t-values are

 summary(fm3DNase1)$coefficients[, "t value"]

and the standard errors are

 summary(fm3DNase1)$coefficients[, "Std. Error"]

You can also use

 vcov(fm3DNase1)

to get the estimated covaraince matrix of the estimates.

**7.3 特定の因子の各水準毎に係数を計算
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

grp is a factor. THen:

 > lm(y ~ grp:x, data=foo)
 Call: lm(formula = y ~ grp:x, data = foo)
 Coefficients: (Intercept) grpa.x grpb.x grpc.x -0.80977 0.01847 0.13124 0.14259   

Martyn

**7.4 回帰モデルの当てはめの比較(係数の一部がゼロかどうかの F 検定)   
(14/08/2000 Paul E Johnson <pauljohn@ukans.edu>)

  summary(lm.sml <- lm(y ~ x1))
  summary(lm.big <- lm(y ~ x1 + x2 + x3 + x4))
  anova(lm.sml, lm.big)

now will do a test of b2 = b3 = b4 = 0

Look at

 demo(lm.glm)

where models "l1" and "l0" are compared that way. (from Martin Maechler)

Here's another one along the same lines. Bill Venables (Feb.2,00) said the philosophy is "You fit a larger model that contains the given model as a special case and test one within the other." "Suppose you wonder if a variable x Suppose you have only one predictor with repeated values, say x, and you are testing a simple linear regression model. Then you can do the test using

 inner.mod <- lm(y ~ x, dat)
 outer.mod <- lm(y ~ factor(x), dat)
 anova(inner.mod, outer.mod)

Test for lack of fit done. If you have several predictors defining the repeated combinations all you need do is paste them together, for example, and make a factor from that. (from Bill Venables)

Suppose your data frame x has some column, say, ID, which identifies the various cases, and you fitted

 fit1 <- lm(y ~ rhs, data=df)

Now do

 fit2 <- lm(y ~ factor(ID), data=df)
 anova(fit1, fit2, test="F")

e.g.

 set.seed(123)
 df <- data.frame(x = rnorm(10), ID=1:10)[rep(1:10, 1+rpois(10, 3)), ]
 df$y <- 3*df$x+rnorm(nrow(df))
 fit1 <- lm(y ~ x, data=df)
 fit2 <- lm(y ~ factor(ID), data=df)
 anova(fit1, fit2, test="F")

there is an R package lmtest on CRAN, which is full of tests for linear models. (from Brian Ripley)

The models with factor() as the indep variable are the most general because they individually fit each category, whereas using the variable X assumes linearity (from me!)

Here's another example. "I want to know how much variance of pre/post-changes in relationship satisfaction can be explained by changes in iv1 and iv2 (dv ~ iv1 + iv2), by changes in iv3 and iv4 (dv ~ iv3 + iv4) and I want to know wether a model including all of the iv's explains a significant amount of variance over these two models (dv ~ iv1 + iv2 + iv3 + iv4)."

 model.0  <- lm(dv ~ 1)
 model.A  <- lm(dv ~ iv1 + iv2)
 model.B  <- lm(dv ~ iv3 + iv4)
 model.AB <- lm(dv ~ iv1 + iv2 + iv3 + iv4)
 anova(model.AB, model.A, model.0) 
 anova(model.AB, model.B, model.0)

(from Ragnar Beer)

**7.5 predict() を用いてモデルから予測値を得る
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

 > lot <- c(30,20,60,80,40,50,60,30,70,60)
 > hours <- c(73,50,128,170,87,108,135,69,148,132)
 > z1 <- lm(hours~lot)
 > new <- data.frame(lot=80)
 > predict(z1,new,interval="confidence",level=0.90)
      fit      lwr      upr
 [1,] 170 166.9245 173.0755

Please note, there is a difference between "confidence" and "prediction" in this command. One is the confidence interval of the point estimate, the other is the confidence interval for a prediction about an individual case. I "think"!

**7.6 多項式回帰
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Use the I() function to protect the ^2 and ^3 from being evaluated as part of the formula, ie.

 formula = Response ~ Var1 + I(Var1^2) + I(Var1^3)

(from Douglas Bates)

I was doing this in a practical class yesterday. There is another way,

 Response ~ poly(Var1, 3)

which fits the same cubic model by using orthogonal polynomials. The fitted values will be the same, but the coefficients are those of the orthogonal polynomials, not the terms in the polynomial. You might like to compare the two approaches. (from Brian D. Ripley)

**7.7 回帰式から F 統計量の "p" 値を計算
(13/08/2000 Paul Johnson <pauljohn@ukans.edu>)

a$fstatistic returns the actual f-test and df, but not the p-value. Get a linear model, "grab" the f part of the output, use the f distribution (pf).

 foo <- lm(y~x)
 a <- summary(foo)
 f.stat <- a$fstatistic
 p.value <- 1-pf(f.stat["value"],f.stat["numdf"],f.stat["dendf"])

(from Guido Masarotto)

**7.8 段階的回帰/分散分析における当てはめの比較 (F 検定) 
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Begin with:

 data(stackloss)
 fm <- lm(stack.loss ~ ., data=stackloss)
 anova(fm)

Add more models:

 fm2 <- update(fm1, . ~ . - Water.Temp)
 fm3 <- update(fm2, . ~ . - Air.Flow)
 anova(fm2, fm1)
 anova(fm3, fm2)

Note that the SSqs are all the same, but the sequential table compares them to the residual MSq, not to the next-larger model (from Brian D. Ripley)

**7.9 傾きと切片シフトの有意性検定 (Chow test?)   
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

Mark M. Span asked this: I have a four-parameter, biexponential model. What I want to evaluate is whether the parameter values differ between (groups of) tasks.

 
 myfunc    <-formula(x ~ a*exp(-b*age) + (c*exp(d*age)) )

both x (representing response time) and age are variables in the current scope. At the moment I fit the model for each task seperately. But now I cannot test the parameters for equality.

Answer: Use a model parametrizing a,b,c by groups and one without, and use anova to compare the models. A worked example is in V&R3, page 249. (from Brian Ripley)

**7.10 非線形モデルを推定したい?   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Use the nls library

      library(nls)
        ... get data
        model.fit <- nls(y ~ exp(x))
        plot(model.fit)


**7.11 疑似ファミリーとそれへの引数の渡しかた
(12/11/2002 Paul Johnson <pauljohn@ku.edu>)

(from Halvorsen) I have the following simple function:

 testcar <- function(pow) {      
   ob <-glm(Pound~CG+Age+Vage, 
                 data=car,weights=No,subset=No>0,family=quasi(link=power(pow),
                 var=mu^2))           
   deviance(ob) 
 }

But trying to run it gives: 
 > testcar(1/2)
 Error in power(pow) : Object "pow" not found

A more reliable version (replacement) is

 eval(substitute(glm(Pound~CG+Age+Vage,data=car,
   weights=No subset=No>0,  family=quasi(link=power(pow),var=mu^2)),
   list(pow=pow)))

(from Thomas Lumley)

**7.12 共分散行列の計算
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

 var(X)

or, if you want the covariance matrix of a fitted model, don't forget you can take the results from summary() as described above.:

Also note package MASS has a generic function vcov to pull out variance-covariance estimates of coefficient estimates, so you can do

 > vcov(fm3DNase1)
          Asym      xmid      scal
 Asym 0.006108 0.0062740 0.0022720
 xmid 0.006274 0.0066183 0.0023794
 scal 0.002272 0.0023794 0.0010414

too. As it's generic, it works for most fitted model objects.

**7.13 二項分布変数に対する信頼区間の推定
(13/08/2000 Paul Johnson <pauljohn@ukans.edu>)

In R, I can use

 library(ctest)
 binom.test(15, 20)

to get a confidence interval, but under the assumption of an infinite population. (from RINNER Heinrich)

To get an estimate corrected for a finite population, the stubs must be in phyper:

 uniroot(function(x)phyper(15,x,100-x,20)-0.025,,0,100)$root
 uniroot(function(x)phyper(14,x,100-x,20)-0.975,,0,100)$root

(from Peter Dalgaard)

**7.14 推定モデルから標準誤差行列を保存する
(13/08/2000 Paul Johnson <pauljohn@ukans.edu>)

More generally, the vcov function in the MASS package extracts If you need variance-covariance matrices, try the MASS library's vcov function. Or, to just save the diagnoal "standard errors" of coefficients, do

 library(MASS) 
 stdErrVector<-sqrt(diag(vcov(lm.D9)))

Or, the hard way,

 > example(lm)
 > summary(lm.D9)
 > names(.Last.value)
 > summary(lm.D9)$coeff
 > summary(lm.D9)$coeff[, "Std. Error"]


**7.15 出力中の有効桁数を制御する
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

 > summary(c(123456,1,2,3), digits=7) 
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
1.00 1.75 2.50 30865.50 30866.25 123456.00

and you get what you want. In most cases the number of "significant digits" printed out by summary methods in R is truncated to 3 or 4. (from Robert Gentleman)

**7.16 分散分析と "アンバランス" もしくは欠損データ   
(13/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Question: In a split-plot design if there is a missing subunit summary gives me a table with two rows for the same factor, one in the error within section and one in the section using error between units. With no data missing the table is "normal". How does one interpret the table when data is missing?, or is it that aov cannot cope with missing values in this case?

Answer (from Brian Ripley): aov() does cope, but the conventional analysis is indeed as you describe. Factors do have effects in more than one stratum. For example, consider the oats example in Venables & Ripley (1999). If I omit observation 70, I get

 > summary(oats.aov)
 Error: B 
          Df Sum of Sq  Mean Sq   F Value     Pr(F) 
 Nf         1    841.89  841.890 0.2242563 0.6605042
 Residuals  4  15016.57 3754.142                    
 Error: V %in% B 
         Df Sum of Sq Mean Sq F Value Pr(F) 
 Nf 1 1100.458 1100.458 1.747539 0.2187982 
 V 2 1156.821 578.410 0.918521 0.4335036 
 Residuals 9 5667.471 629.719
 Error: Within
 Df Sum of Sq Mean Sq F Value Pr(F) 
 Nf 3 19921.29 6640.431 37.23252 0.0000000 
 Nf:V 6 408.96 68.160 0.38217 0.8864622 
 Residuals 44 7847.41 178.350

so Nf appears in all three strata, not just the last one. The `recovery of intra-block information is needed'.

If you have an unbalanced layout (e.g. with a missing value) use lme to fit a model. V&R do this example in lme too.

**7.17 多重分散分析
(06/09/2000 Paul Johnson <pauljohn@ukans.edu>)

 > aov(cbind(y1,y2,y3)~x)

**7.18 分散の均一性の検定
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

Bartlett's test, I like (personally). SPSS now uses Levine's test, some say incorrectly and nonrobustly. R's offers the Fligner-Killeen (median) test, fligner.test() in ctest.

The test "does an anova on a modified response variable that is the absolute value of the difference between an observation and the median of its group (more robust than Levene's original choice, the mean)". (from Brian Ripley)

Incidentally, if you insist on having Levine's calculation, Brian Ripley showed how: "

  >Levene <- function(y, group)
    {
       group <- as.factor(group)  # precautionary
       meds <- tapply(y, group, median)
       resp <- abs(y - meds[group])
       anova(lm(resp ~ group))[1, 4:5]
   }
 > data(warpbreaks) 
 > attach(warpbreaks) 
 > Levene(breaks, tension) 
 F value Pr(>F) group 2.818 0.06905

I could (and probably would) dress it up with a formula interface, but that would obscure the simplicity of the calculation."

**7.19 非線形モデルの推定に nls を使う
(14/08/2000 Paul E Johnson <pauljohn@ukans.edu>)

 x <- runif(100)
 w <- runif(100)
 y <- x^2.4*w^3.2+rnorm(100,0,0.01)
 plot(x,y)
 plot(w,y)
 library(nls)
 fit <- nls(y~x^a*w^b,start=list(a=2,b=3))
 l.est <- lm(log(y) ~ log(x)+log(w)-1)
 fit <-   nls(y~x^a*w^b,start=list(a=coef(l.est)[1],b=coef(l.est)[2]))

Error is a fairly common result of starting with parameter values that are far away from the true values, or have very different scales, so that the numeric-derivative part of nls fails. Various solutions are to do an approximate least-squares regression to start things off (also the "self-starting" models in the nls package), or to provide an analytic derivative. (all from Ben Bolker)

**7.20 nls の使用とそのグラフ化
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

If you use the nls function in R to fit the model in the parameterization that Guido described, you can graphically examine the profile likelihood with

 pr <- profile(nlsObject)
 plot(pr)

For an example, try

 library(nls)
 example(plot.profile.nls)

(from Douglas Bates)

**7.21 -2 Log(尤度) と仮説検定
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

This is about the polr package, but the general point is that to compare likelihood models, you have to use your head and specify the comparison. Brian Ripley said:

As in the example on the help page

     options(contrasts=c("contr.treatment", "contr.poly"))
     data(housing)
     house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
     house.plr0 <- polr(Sat ~ 1, weights = Freq, data = housing)

Note that -2 LOG L is not actually well-defined as it depends on the grouping of the data assumed, and so is not a statistic. I assume that you want differences of that quantity, as in

 house.plr0$deviance - house.plr$deviance
 169.7283

**7.22 繰り返し測定をもつロジスティック回帰
(02/06/2003 Paul Johnson <pauljohn@ku.edu>)

Hiroto Miyoshi described an experiment with 2 groups and pre/post measurement of a dichotomous dependent variable. Frank Harrell had this excellent response:

"There are several ways to go. GEE is one, random effects models another. One other approach is to install the Hmisc and Design packages (http://hesweb1.med.virginia.edu/biostat/s) and do (assume id is the unique subject identifier):

 f <- lrm(y ~ x1 + x2*x3 + ..., x=T, y=T) # working independence model
 g <- robcov(f, id)         # cluster sandwich variance adjustment
 h <- bootcov(f, id, B=100) # cluster bootstrap adjustment
 summary(g)  # etc.

**7.23 ロジット
//Logit   
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

Compute the Odds Ratio and its confidence intervall, from a logistic model (glm(.......,family=binomial....).

I show a simple function to do this in my introductory notes available from: http://www.myatt.demon.co.uk

Basically it is:

        lreg.or <- function(model)
          {
          lreg.coeffs <- coef(summary(salex.lreg))
          lci <- exp(lreg.coeffs[ ,1] - 1.96 * lreg.coeffs[ ,2])
          or <- exp(lreg.coeffs[ ,1])
          uci <- exp(lreg.coeffs[ ,1] + 1.96 * lreg.coeffs[ ,2])
          lreg.or <- cbind(lci, or, uci)        
          lreg.or
          }

(from Mark Myatt)

**7.24 時系列: 基礎   
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

You have to load the ts library! try

 library(ts) 
 example(acf)

There are other packages with time series components. Try

 help.search("time")

to see whatever you have installed. Under my packages section of this document, I keep a running tally of packages that have time series stuff in them

Many time series procedures want to operate on a "time series" object, not raw numbers or a data frame. You can create such an object for "somedataframe" by:

 myTS <- ts(as.matrix(somedataframe), start=c(1992,1), frequency=12)

Note start and frequence designate time attributes.

**7.25 時系列: 雑多な例
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

This uses the ts package's filter(): A simple exponential weighted moving average procedure would be

 ewma <- function (x, lambda = 1, init = 0)
 {
   y <- filter(lambda*x, filter=1-lambda, method="recursive", init=init)
   return (y)
 }

Using ewma as a forecast function is known as exponential smoothing. Try, e.g.:

 x <- ts(diffinv(rnorm(100)))
 y <- ewma(x,lambda=0.2,init=x[1])
 plot(x)
 lines(y,col="red")

(from Adrian Tripletti)

**7.26 SAS と同値な手順
(14/08/2000 Paul E Johnson <pauljohn@ukans.edu>)

I need a better list here, but:

1. glm in R is not PROC GLM in SAS. In R the "G" is for "generalized", cf. GLIM or GENMOD in SAS. The SAS GLM (General Linear Model) functionality comes via lm() and friends in R. (from Peter Dalgaard)

**7.27 その他。実際的な glm コード
(08/12/2000 Paul Johnson <pauljohn@ukans.edu>)

 x1 <- seq(from=-1,to=2, length=20)
 x2 <- (rep(1:5,4))/4
 x <- cbind(1,x1,x2)
 y <- c(0,1,1,1,1,1,0,0,1,1,1,0,1,0,1,1,1,1,1,1)
 glm.fit(x,y,fam=binomial(link=logit))$coeff

(from Vipul Devas) Leave off the $coeff to see all output.

/////////////////////////////////////////////////
*8 パッケージ：どこを探したら良いのか
/////////////////////////////////////////////////

**8.1 カテゴリカルデータと多変量モデル
//Categorical Data and Multivariate Models   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

See Jim Lindsey's package "ordinal": http://alpha.luc.ac.be/~plindsey/publications.html

And also Proportional Odds Logistic Regression in the MASS package

**8.2 散布図を通る滑らかな曲線を描く(Lowess 等)
//Plot a smooth curve through a scatterplot (Lowess, etc)   
(14/08/2001 Paul Johnson <pauljohn@ukans.edu>)

 library(modreg)

Users report confustion about the difference between lowess() in S and loess in R. Martin Maechler advised us, "loess() by default is not resistant/robust where as lowess() is. ... I would recommend using loess(....., family = "sym")"

**8.3 階層的／混合線形モデル
//Hierarchical/Mixed linear models.   
(18/06/2001 Paul Johnson <pauljohn@ukans.edu>)

User asked to estimate what we call a "contextual regression," a model of individuals, many of whom are in the same subsets (organizations, groups, etc). This is a natural case for a mixed regression model

Jos? C. Pinheiro and Douglas M. Bates, Mixed Effect Models in S and S-Plus (Springer,2000).

The nlme package for R has a procedure lme that estimates these models.

See also Jim Lindsey's repeated measurements package (available at http://alpha.luc.ac.be/~lucp0753/rcode.html).

**8.4 時系列用のツール？
//Time Series tools?   
(06/03/2003 Paul Johnson <pauljohn@ku.edu>)

Some timeseries tools are distributed with R itself, in the ts library. type

 library(ts)
 ?ts
 ?arima0

DSE:Paul Gilbert said "My dse package handles multiple endogenousvariables (y's) and multiple input or conditioning variables (which are often calledexogenous variables, but sometimes are not truly exogenous). Integrating models arehandled but you might need to consider what estimation technique to use. I have notimplemented any integration tests (volunteers?). DSE also handles state space models.It does not implement time varying models or non-linear models, although it wouldprovide a valuable framework for anyone who would like to do that." Look on CRAN!Or check at Paul's site, http://www.bank-banque-canada.ca/pgilbert.

For unevenly spaced time series data, take a look atLindsey's growth and repeated libraries: www.luc.ac.be/~jlindsey/rcode.html (from Jim Lindsey)

In the "nlme" package, there is a continuous AR(1) class, corCAR1, whichshould be used in the case when the times are unequally spaced andmeasured on a continuous scale. (from Jose Pinheiro). The usage of both lme and nlme is described inJose C. Pinheiro and Douglas M. Bates, Mixed-Effects Models in S and S-PLUS,Springer,2000.

**8.5 Durbin-Watson 検定   
(02/06/2003 Paul Johnson <pauljohn@ku.edu>)

The car and lmtest packages both have functions for Durbin-Watson tests. (John Fox)

**8.6 打ち切り回帰
//Censored regression   
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

You can do a censored regression with the survival5 library

 library(survival5)
 survreg(Surv(y,c), dist="normal")

where y is the censored dep-var and c is 1 if observed and 0 otherwise. (from Dan Powers)

**8.7 微分方程式を推定する
//Estimate differential equation models?   
(08/12/2000 Paul Johnson <pauljohn@ukans.edu>)

-you can write your models using your favourite ode solver library (limited C or Fortran) and then dyn.load() the model, so you can use it inside R. It is not too difficult, and if you are used to a certain library where you have your models, is probably the easiest way.
-there is a preliminar porting to R of the nls2 library, which is a library that makes nonlinear regression from differential equation models. It is a bit difficult to install, and is also not easygoing, but the nonlinear regression works well. In that library there is a way to actually compile a differential equation model and integrate it using the lsode routine from ODEPACK. So if you want to see how to plug a ode solver with R in a more general way you can start having a look at it in http://www-bia.inra.fr/J/AB/nls2/welcome.html (from Jesus Maria Frias Celayeta)

**8.8 R から SQL データベースへアクセスするツール
//Tools to access SQL databases from R   
(30/01/2001 Paul Johnson <pauljohn@ukans.edu>)

Consider

 http://rpgsql.sourceforge.net/

Brian Ripley said "RODBC works well with MySQL, in my experience better than RMySQL."

There is a package RMySQL.

and other packages in CRAN's contrib/Devel.

**8.9 判別分析
//Discriminant Analysis   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

The VR bundle of packages has linear and quadratic DA (lda, qda) in the MASS package, neural nets in nnet, and several other methods including k nearest neighbour in the class package. You find it in the contrib section of CRAN.

Classification trees can be fitted using both the tree and rpart packages. (Friedrich Leisch)

**8.10 対数線形モデル用のツール
\\Tools for log linear models   
(13/08/2000 Paul Johnson <pauljohn@ukans.edu>)

loglin() in the base package and loglm() in package MASS (from Thomas Lumley)

**8.11 分布のパラメータを推定するためのツール
\\Tools to estimate parameters of distributions   
(13/08/2000 Paul Johnson <pauljohn@ukans.edu>)

The gnlr and gnlr3 functions in my gnlm library at www.luc.ac.be/~jlindsey/rcode.html will do this for a wide selection of distributions.... The gnlr function in my gnlm library will fit quite a variety of different overdispersed Poisson- and binomial-based distributions (i.e. phi different from one) using their exact likelihoods. (from Jim Lindsey)

Maximize the likelihood directly. There are worked examples in the Venables & Ripley on-line exercises and answers. Now for the gamma there are some shortcuts, as the MLE of the mean is the sample mean, and gamma.shape.glm in package MASS will fit the shape parameter by MLE (from Brian D. Ripley)

The log-likelihood function for the parameters alpha and beta given the data x is sum(dgamma(x, shape = alpha, scale = beta, log = TRUE)) To determine the maximum likelihood estimates, simply put the negative of the log-likelihood into the optim function.

A worked-out example is given as xmp06.12 in the Devore5 library (available in the contrib section at CRAN). Use

 library(Devore5)
 help(xmp06.12)      # shows the description with comments
 example(xmp06.12)   # run the example to see the result

(from Douglas Bates)

**8.12 ブートストラップ用ルーティン
(06/04/2001 Paul Johnson <pauljohn@ukans.edu>)

CRAN should have bootstrap and boot, two packages worth looking at. Most people say that "boot" is better, more actively maintained.

**8.13 ロバスとな回帰用のツール
(14/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Robust Regression is available in S, its name is rreg.

There's better robust regression in the VR bundle of packages.

 >library(MASS)
 >help(rlm)
 >library(lqs)
 >help(lqs)

(from Thomas Lumley

You could try function rlm in package MASS (in the VR bundle). That has an MM option, and was originally written to work around problems with the (then) version of lmRobMM in S-PLUS. It uses lqs in package lqs, and that is substantailly faster than the ROBETH-based code (as I started with that route). (from Brian D. Ripley)

**8.14 不等分散性の推定
(07/12/2000 Paul Johnson <pauljohn@ukans.edu>)

hmctest() in package lmtest performs the Harrison-McCabe-Test against heteroskedasticity. (from Torsten Hothom)

**8.15 不等分散の Garch/時系列 のモデリング用ツール
(08/12/2000 Paul Johnson <pauljohn@ukans.edu>)

The contrib package tseries (by Adrian Trapletti) contains Garch tools and also:

 NelPlo                   Nelson-Plosser Macroeconomic Time Series
 garch                    Fit GARCH Models to Time Series
 get.hist.quote           Download Historical Finance Data
 jarque.bera.test         Jarque-Bera Test
 na.remove                NA Handling Routines for Time Series

It can also do White's test, not the general heteroskedasticity test, but a more specific White statistic which tests for neglected nonlinearity in the conditional mean, essentially for a regression equation or linear time series model.

Adrian Trapletti, author of tseries package, also added "One very simple way to test conditional heteroskedasticity (time series setup) is to use the Box.test (package ts) on the squared values of a time series as

 Box.test(x^2, lag = 1, type="Ljung")


**8.16 Lowess   
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

R-base has "lowess"

The package "modreg" has "loess". The latter is a linear modeling-alike tool,and it can be used with other R-base tools for dealing with linear models, like the residuals function:

 > l <- loess(y1~x1)
 > plot(l)
 > r <- residuals(l)

r is now a variable holding residuals.

**8.17 曲線／曲面当てはめ用の局所回帰
(08/12/2000 Paul Johnson <pauljohn@ukans.edu>)

http://cm.bell-labs.com/cm/ms/departments/sia/project/locfit/index.html

**8.18 積分のためのツール 
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

The int function in my rmutil library will handle this. www.luc.ac.be/~jlindsey/rcode.html (from Jim Lindsey)

**8.19 Box-Cox モデル   
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 boxcox() in MASS

**8.20 更新された V&R パッケージ (polynom 等)   
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

If you want the current version of polynom, it is in the scripts for V&R's `S Programming', at http://www.stats.ox.ac.uk/pub/MASS3/Sprog/ and mirrors. The version discussed in that book (and in those scripts) is considerably later than that on CRAN (or the S versions on which that is based). (from Brian D. Ripley

Note, you download "Sprog.tar.gz", unzip to find directories for the chapters, then in there you find R packages you you can install in the regular way.

**8.21 R 用の GIS パッケージ   
(06/09/2000 Paul Johnson <pauljohn@ukans.edu>)

Try the GIS package GRASS:

 http://www.baylor.edu/~grass/statsgrasslist.html

from which the package described in Computers & Geosciences (2000), 26, pp. 1043-1052, may be accessed. There is a mailing-list too, and more progress is described in work with Markus Neteler, who is coordinating GRASS 5.0 development. The current package is GRASS_0.1-4, and is at:

 http://reclus.nhh.no/gc00/GRASS/GRASS_0.1-4.tar.gz

Please contact me if this may be helpful - the package is under development, and needs user feedback! (from Roger Bivand Roger.Bivand@nhh.no)

See also a recent paper in Computers & Geosciences:

TI: Using the R statistical data analysis language on GRASS 5.0 GIS data base files AU: R.S. Bivand SO: Computers & Geosciences v.26, no.9/10, pp. 1043-1052. PY: 2000

Paulo Justantiano pointed out there is also:

- the package "spatial" is included with MASS
- the package "sgeostats" is available at the CRAN web site (for geostatistical analysis only)
- "splancs" is available at ftp://reclus.nhh.no/pub/R/splancs-2.01-1.tar.gz (for point process only)
- "geoR" is available at: http://www.maths.lancs.ac.uk/~ribeiro/geoR.html (for geostatistical analysis only)

Rainer Hurling added, have a look at

 http://reclus.nhh.no/gc00/gc009.htm

The GIS software 'GRASS' (http://www.geog.uni-hannover.de/grass/index2.html) works fine with database system 'PostgreSQL' (http://www.postgresql.org) and 'R' (http://www.R-project.org). For linking between GRASS and PostgreSQL see http://www.geog.uni-hannover.de/grass/projects/postgresqlgrass and between GRASS and R see ftp://reclus.nhh.no/pub/R .

**8.22 多カテゴリ従属変量モデル
//Multicategory dependent Variable Models   
(08/12/2000 Paul Johnson <pauljohn@ukans.edu>)

Brian Ripley said:

Depends on what you mean (multinomial?). There is what I understand by multinomial logistic in function multinom in package nnet in the VR bundle, and nnet there can do several other variants on this theme.

Most such models are not hard to program and fit by a general-purpose optimizer like optim().

One alternative I've not tried:

David Clayton said:

You can also fit these models using the gee() package. The method is not ML, but there are certain other advantages to this method.

See

 http://www.mrc-bsu.cam.ac.uk/pub/publications/submitted/david_clayton/ordcat2.ps

for details.

**8.23 NLME パッケージ: S+ の lme の役目を果たす   
(08/12/2000 Paul Johnson <pauljohn@ukans.edu>)

See "nlme", available at CRAN.

**8.24 区間的打ち切りデータ
(08/12/2000 Paul Johnson <pauljohn@ukans.edu>)

If you want parametric models, the gnlr and gnlr3 functions in my gnlm library will do all this: www.luc.ac.e/~jlindsey/rcode.html

**8.25 多次元尺度法 
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

See cmdscale in the package "mva" and sammon and isoMDS in "MASS". There are some more in multiv, or at least more of the same. (from Brian D. Ripley)

/////////////////////////////////////////////////////
*9 R の主たる提供物中にないウェッブ情報源
/////////////////////////////////////////////////////

**9.1 心理学用の R    
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

Notes on the use of R for psychology experiments and questionnaires by Jonathan Baron and Yuelin Li at

 http://www.psych.upenn.edu/~baron/rpsych.htm and  
 http://www.psych.upenn.edu/~baron/rpsych.pdf and
 http://www.psych.upenn.edu/~baron/rpsych.tex

It is intended for students and others who are doing research in psychology. What makes it "for psychology" is that it tries to cover the major statistical methods that psychologists use (particularly those who study humans).

Prof. Baron made a nice "reference card" as well: 
  http://www.psych.upenn.edu/~baron/refcard.pdf

**9.2 生物学用の R    
(11/08/2000 ? <?>)

 http://www.stat.Berkeley.EDU/users/terry/zarray/Html/Rintro.html

**9.3 S その他の学習用素材
(14/02/2001 Paul Johnson <pauljohn@ukans.edu>)

"S Poetry" is available online at http://www.seanet.com/%7Epburns/Spoetry/

http://www.insightful.com/resources/doc/default.html Splus guides - quite useful, I think, for general approaches especially (free downloads)

http://www.insightful.com/resources/biblio.html list of Splus books - too many!

http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf guide to Splus for SAS users (or anyone, really)

and the other documents in CRAN contributed documents

Also, I'd check here, the R site for contributed docs, periodically: http://cran.r-project.org/other-docs.html

**9.4 Pat Altham の misc. S+ ワークシート   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 http://www.statslab.cam.ac.uk/~pat

**9.5 Keith Worsley の一般化線形モデルコース用の R スクリプト
//scripts for his Generalized Linear Models course   
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

These are useful!

 http://euclid.math.mcgill.ca/keith/

**9.6 Mark Myatt の R notes  
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

Mark Myatt offers some notes and example datasets that I found useful: http://www.myatt.demon.co.uk/ He offers everything in one big bundle called "Rex.zip"

**9.7 オンラインの計量経済学素材:   
(15/08/2000 Paul Johnson <pauljohn@ukans.edu>)

Hans Ehrbar's online course http://www.econ.utah.edu/ehrbar/ecmet.pdf

Jim Lindsey's R code for Introductory Statistics: A Modelling Approach. Oxford University Press (1995) and Models for Repeated Measurements. Oxford University Press (1999, Second Edition) http://alpha.luc.ac.be/~jlindsey/books.html See also the instructor's manual: www.luc.ac.be/~jlindsey/publications.html

**9.8 R の電子メイルリストアーカイブ   
(06/09/2000 Paul Johnson <pauljohn@ukans.edu>)

OK, you probably find this in CRAN, but I kept forgetting:

 http://www.ens.gu.edu.au/robertk/R/help/00b/subject.html

//////////////////////////////////////
*10 R の作業スペース
//////////////////////////////////////

**10.1 R コードを書く、保存する、実行する
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

The easiest thing is to use some editor, Emacs or Winedit, write your R code, and then run it into R, and if it fails, change it in the editor, run it it again. Emacs with ESS installed will make this work fine. I've heard of plugins for other editors, did not try them, though.

If you don't like that, you can type commands directly into the R window and they have some handy things.

Up and down arrows cycle through the history of commands.

If you type in a function and want to revise it:

fix(f) or g <- edit(f) should bring up an editor.

I personally don't see any point in this, though, since you should just use an editor.

**10.2 .RData, .RHistory. Help または混乱?   
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

When you quit R, since version 1 or so, it asks if you want to save the workspace. If you say yes, it creates a file .RData. The next time you start R in this working directory, it opens all those objects in .RData again. Type "objects()" to see.

If you want that, OK. If you don't, say NO at quit time, or erase the .RData file before starting R.

**10.3 R オブジェクトのセーブとロード
(31/12/2001 Paul Johnson <pauljohn@ukans.edu>)

If you try

 > a <- c(1,2,3)
 > save(a, file="test.RData")
 > rm(a)
 > load("test.RData")
 > a
 [1] 1 2 3

(from Peter Dalgaard, 12/29/2001)

**10.4 オブジェクトの解析／用法の備忘録
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

 objects() #lists all objects
 attributes(anObject)
 str(anObject) #-- diagnostic info on anObject
 page(anObject) #-- pages through anObject
 rm(anObject) #-- removes an object
 rm(list=ls()) #-- removes all objects

**10.5 名前中のパターンによるオブジェクトの消去
(31/12/2001 ? <?>)

Suppose you want to remove objects beginning with lm (lm1, lm2, lm3 etc).

 rm(list=objects(pattern="^lm.*")) (from Kjetil Kjernsmo)

Don't do this to remove all objects with names like g01, g02, etc: rm(list=ls(pat="g*")). That removes everything!

To remove objects whose names have "g" anywhere in the string, use rm(list=ls(pat="g")), or names starting with "g" one can use rm(list=ls(pat="^g")).

The "^" and "$" can be used to delimit the beginning and ending of a string, respectively. (from Rashid Nassar)

**10.6 毎日の作業日誌を作成・保管する
(31/12/2001 ? <?>)

consider sink("filename")

The file .Rhistory contains commands typed on Unix systems.

Emacs users can run R in a shell and save everything in that shell to a file.

Ripley added: But if you are on Windows you can save the whole console record to a file, or cut-and-paste sections. (And that's what I do on Unix, too.)

**10.7 Rprofile をカスタマイズする  
(31/12/2001 ? <?>)

In your HOME directory, the file ~/.Rprofile. For example, if you think the stars on significance tests are stupid, follow Brian Ripley's example:

 options(show.signif.stars=FALSE)
 ps.options(horizontal=FALSE)

There is a system-wide Rprofile you could edit as well in the distribution


///////////////////////////////////////////////////
*11 OS  とのインタフェイス
////////////////////////////////////////////////////

**11.1 システムに対する作業ディレクトリ変更命令
(22/11/2000 Paul Johnson <pauljohn@ukans.edu>)

Commands to the operating system: system("insert command here between quotes").

For the particular case of changing directories use getwd/setwd [I think system() spawns its own shell, so changing directories with system() isn't persistent] . For example:

 > getwd()   ## start in my home directory
 [1] "/home/ben"
 > system("cd ..")  ## this executes OK but ...
 > getwd()
 [1] "/home/ben"    ## I'm still in the same working directory
 > setwd("..")      ## this is the way to change working directory
 NULL
 > getwd()          ## now I've moved up one
 [1] "/home"

(from Ben Boker)

**11.2 システム時刻を得る 
(30/01/2001 Paul Johnson <pauljohn@ukans.edu>)

 > date()
 [1] "Tue Jan 30 10:22:39 2001"
 > Sys.time()
 [1] "2001-01-30 10:21:29"
 > unclass(Sys.time())
 [1] 980871696

This is the number of seconds since the epoch (Jan 1, 1970). I use it as a "time stamp" on files sometimes.


**11.3 あるファイルが存在するかどうかチェック   
(11/08/2000 Paul Johnson <pauljohn@ukans.edu>)

The function file.exists() will do this

**11.4 ファイルを名前もしくは名前の一部で探す（正規表現マッチング）
(14/08/2001 Paul Johnson <pauljohn@ukans.edu>)

Do ?list.files().

list.files() gives back a vector of character strings, one for each file matching the pattern. Want file names ending in .dat? Try:

 myDat <- list.files(pattern="\\.dat$")

Then iterate over this myDat object, using the elements myDat[[i]] if you do something to each file.

Please note patterns in R are supposed to be regular expressions, not ordinary shell wildcards. Regular expressions are fairly standard in unix, but each version has its wrinkles. To see about the grep command, or about apropos, where regexp is used, do:

 ?grep  
#or
 ?apropos

Martin Maechler spelled out a few of the basics for me:

 "."  means  ``any arbitrary character''
 "*"  means  ``the previous [thing] arbitrary many (0,1,...) times''
              where [thing] is a single character (or something more general
                                                   for advanced usage)


**11.5 R パッケージをインストールする
(30/01/2001 Paul Johnson <pauljohn@ukans.edu>)

 install.packages()

will download a package and install it!

It will get it from the network where you say, as in

 install.packages("VR",lib="c:/r/library",CRAN="http://lib.stat.cmu.edu/R/CRAN")

If you are in the Operating System, not in an R session, you can install a package you've downloaded with

 R INSTALL packageName

This works even if packageName is zipped, i.e., packageName.tar.gz will be accepted as an argument.

If you are not "root", tell R where to install the package directly from CRAN:

 install.packages('tree', '~/.Rlibs')

Omitting the 2nd argument causes the routine to issue a warning and try to install in the system package location, which is often what you want, provided you have the required write access.

And, from the operating system shell, there is an equivalent. If you want to install the library in a private collection you can add the -l flag to specify the location of the library. You should then establish an environment variable R_LIBS as a colon-separated list of directories in which to look for libraries.

The call to R INSTALL would look like

 R INSTALL /usr/apps/tree -l ~/.Rlibs (from Douglas Bates)

**11.6 メモリ 
(30/01/2001 Paul Johnson <pauljohn@ukans.edu>)

R-1.2 implements a new memory approach that automatically resizes the workspace. To read more about it, type

 ?Memory

or

 help(Memory)

You can also look at section 7.1 of the R FAQ (http://cran.r-project.org/doc/FAQ/R-FAQ.html).

I'm deleting all info about how to manually allocate memory. It is no longer recommended by the R team.

**11.7 スクリプト化された R の実行プログラムに環境変数を引数として渡す
(30/01/2001 Paul Johnson <pauljohn@ukans.edu>)

Suppose you want an R program to be called in a script, and you absolutely must pass an argument to the R program to tell it what data to use or what variable name to use, etc.

Use Sys.getenv() to retrieve value you have set in your shell.

For a C shell,

 $ setenv R_INFILE foobar

or BASH

 $ export R_INFILE=foobar

Then in your R script grab that from the environment

 > Sys.getenv("R_INFILE")
 R_INFILE 
 "foobar" 
 > sink(Sys.getenv("R_INFILE"))

(from Thomas Lumley )

**11.8 minitab の  "como" を覚えている？ 代わりにスクリプトを使おう！
on minitab? Use script for that!   
(13/08/2001 ? <?>)

Most unix systems have a program called script, which can keep a snapshot of text input and output. If you type

 $ script filename
 >R
 ----do whatever R you want
 >quit()
 $exit

The $ are for the shell prompt.

///////////////////////////////////////////////////////////////////////////
*12 とんでもない R のトリック：それ無しでは生きていけない
//////////////////////////////////////////////////////////////////////////

**12.1 R ファイル中の箇所をコメントアウト
(15/08/2000 ? <?>)

Put # at beginning of line, or

"With a good code editor none of these are any problem. There is no reason to write many lines of comments in R code: that's what the .Rd file is for. And if you want to temporarily delete a block, if(0) { ... } will work, as indeed will temporarily deleting in an editor." (from Brian D. Ripley)

Don't forget ESS has the ability to comment-out a highlighted region:

 M-x comment-region

**12.2 R のヘルプページから出力するために example() を使う
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

See "example" at end of output in help?

Try

 >?ls
 >example(ls)

Graphs go by too fast? do

 par(ask=TRUE)

and try example(ls) again.

**12.3 ヘルプを使う
(13/08/2001 Paul Johnson <pauljohn@ukans.edu>)

 help(function)  # same as ?function
 help.start() # fires up Netscape browser of html pages
 help.search() #searches in all available functions from all loaded packages

Don't forget that help.search takes regular expressions (as its help describes)

 > help.search(".*trans.*")
Help files with name or title matching `.*trans.*':


 boxcox(MASS) 
 "Box-Cox Transformations for Linear Models gilgais(MASS) 
 Line Transect of Soil in Gilgai Territory heart(MASS) 
 The Stanford heart transplant data

Thomas Lumley

There is also an "apropos" function, similar in nature to emacs. If you have a vague idea of what you want, try it. Want to find how to list files? Do:

 apropos("files")

Often you don't know the keyword to look for.

**12.4 もはや必要ないライブラリを取り除く
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

Use detach(package:...), for instance:

 > search()
 [1] ".GlobalEnv"    "package:ctest" "Autoloads"     "package:base" 
 > detach(package:ctest)
 > search()
 [1] ".GlobalEnv"   "Autoloads"    "package:base"

(from Emmanuel Paradis)


////////////////////////////////////////
*13 私が面白いと思ったその他の事柄
////////////////////////////////////////////

**13.1 表現式の内部で変数の名前をリストする
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

 > all.names(expression(sin(x+y)))
 [1] "sin" "+"   "x"   "y"
 > all.names(functions=FALSE,expression(sin(x+y)))
 [1] "x" "y"
 > all.vars(functions=FALSE,expression(sin(x+y)))
 [1] "x" "y"
 > all.vars(expression(sin(x+y)))
 [1] "x" "y"

all.vars() works on a formula, terms, or expression object.

 > all.vars((~j(pmin(3,g(h(x^3*y))))))
 [1] "x" "y"

**13.2 スクリプトの内部での R 環境
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

This function does not remove objects correctly

 > testLoadSeveralHDF <- function(numFiles)
   {  
       for (i in 0:(numFiles-1)) 
        { filename <- paste("trial",i,".hdf",sep="") 
          print(filename)
          hdf5load(filename) 
          rm(list = ls(pat="^g")) 
       }
   }

Doug Bates said:

Add envir = .GlobalEnv to both the ls and the rm calls. That is

 rm(list = ls(pat = "^g", envir = .GlobalEnv), envir = .GlobalEnv)

**13.3 微分 
(10/04/2001 Paul Johnson <pauljohn@ukans.edu>)

first and second derivative of the expression wrt x

 > D(expression(x^2),"x")
 > D(D(expression(x^2),"x"),"x")

The real R FAQ 7.6 explains "How can I get eval() and D() to work?"