您可以使用该microbenchmark程序包执行“亚毫秒级的准确表达表达计时”。
在此示例中,我们将data.table基于特定条件比较六个等效表达式的速度,以更新组中的元素。
进一步来说:
一个data.table有3列:id,time和status。对于每个id,我想查找具有最大时间的记录-然后,如果该记录的状态为true,则如果时间> 7,则要将其设置为false
library(microbenchmark) library(data.table) set.seed(20160723) dt <- data.table(id = c(rep(seq(1:10000), each = 10)), time = c(rep(seq(1:10000), 10)), status = c(sample(c(TRUE, FALSE), 10000*10, replace = TRUE))) setkey(dt, id, time) ## create copies of the data so the 'updates-by-reference' don't affect other expressions dt1 <- copy(dt) dt2 <- copy(dt) dt3 <- copy(dt) dt4 <- copy(dt) dt5 <- copy(dt) dt6 <- copy(dt) microbenchmark( expression_1 = { dt1[ dt1[order(time), .I[.N], by = id]$V1, status := status * time < 7 ] }, expression_2 = { dt2[,status := c(.SD[-.N, status], .SD[.N, status * time > 7]), by = id] }, expression_3 = { dt3[dt3[,.N, by = id][,cumsum(N)], status := status * time > 7] }, expression_4 = { y <- dt4[,.SD[.N],by=id] dt4[y, status := status & time > 7] }, expression_5 = { y <- dt5[, .SD[.N, .(time, status)], by = id][time > 7 & status] dt5[y, status := FALSE] }, expression_6 = { dt6[ dt6[, .I == .I[which.max(time)], by = id]$V1 & time > 7, status := FALSE] }, times = 10L ## specify the number of times each expression is evaluated ) # Unit: milliseconds # expr min lq mean median uq max neval # expression_1 11.646149 13.201670 16.808399 15.643384 18.78640 26.321346 10 # expression_2 8051.898126 8777.016935 9238.323459 8979.553856 9281.93377 12610.869058 10 # expression_3 3.208773 3.385841 4.207903 4.089515 4.70146 5.654702 10 # expression_4 15.758441 16.247833 20.677038 19.028982 21.04170 36.373153 10 # expression_5 7552.970295 8051.080753 8702.064620 8861.608629 9308.62842 9722.234921 10 # expression_6 18.403105 18.812785 22.427984 21.966764 24.66930 28.607064 10
输出显示此测试expression_3中最快。
参考文献
data.table-添加和修改列
data.table-data.table中的特殊分组符号