Basic R: rows that contain the maximum value of a variable

File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.” Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements: df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20)) df.orig # vars obs1 obs2 # 1 A 1 11 # 2 B 2 12 # 3 C 3 13 # 4 D 4 14 # 5 E 5 15 # 6 A 6 16 # 7 B 7 17 # 8 C 8 18 # 9 D 9 19 # 10 E 10 20 Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this: # use aggregate to create new data frame with the maxima df.agg <- aggregate(obs1 ~ vars, df.orig, max) # then simply merge with the original df.max <- merge(df.agg, df.orig) df.max # vars obs1 obs2 # 1 A 6 16 # 2 B 7 17 # 3 C 8 18 # 4 D 9 19 # 5 E 10 20 This also works using min() and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame. With thanks to this mailing list thread. Filed under: programming, R, research diary, statistics
Source: What You're Doing Is Rather Desperate - Category: Bioinformaticians Authors: Tags: programming research diary statistics Source Type: blogs