Basic R: rows that contain the maximum value of a variable
File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.”
Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements:
df.orig <- data.frame(vars = rep(LETTERS[1:5], 2), obs1 = c(1:10), obs2 = c(11:20))
df.orig
# vars obs1 obs2
# 1 A 1 11
# 2 B 2 12
# 3 C 3 13
# 4 D 4 14
# 5 E 5 15
# 6 A 6 16
# 7 B 7 17
# 8 C 8 18
# 9 D 9 19
# 10 E 10 20
Now, let’s say we want only the rows that contain the maximum values of obs1 for A – E. In bioinformatics, for example, we might be interested in selecting the microarray probeset with the highest sample variance from multiple probesets per gene. The answer is obvious in this trivial example (6 – 10), but one procedure looks like this:
# use aggregate to create new data frame with the maxima
df.agg <- aggregate(obs1 ~ vars, df.orig, max)
# then simply merge with the original
df.max <- merge(df.agg, df.orig)
df.max
# vars obs1 obs2
# 1 A 6 16
# 2 B 7 17
# 3 C 8 18
# 4 D 9 19
# 5 E 10 20
This also works using min() and, I guess, using any function that returns a single value per variable mapping to a value in the original data frame.
With thanks to this mailing list thread.
Filed under: programming, R, research diary, statistics
Source: What You're Doing Is Rather Desperate - Category: Bioinformaticians Authors: nsaunders Tags: programming research diary statistics Source Type: blogs