Skip to content

speed can be improved #12

@alexyfyf

Description

@alexyfyf

Especially in doco counting, maybe the aggregate function can be replaced with something else.

aggregated_df <- aggregate(

Use data.table package will be faster, but will add dependency.

Here is some test on a count table with 174k transcripts and 44k DoCo.

dim(merged_df)
# [1] 174278   3853
length(unique(merged_df$DoCo))
# [1] 44099

# use merge
system.time(aggregated_df <- aggregate(
  merged_df[, sample_columns],
  by = list(DoCo = merged_df$DoCo),
  FUN = sum
))
# user  system elapsed 
# 938.484  21.817 950.460 

# use dplyr
system.time(aggregated_df2 <- merged_df %>%
              group_by(DoCo) %>%
              summarise(across(all_of(sample_columns), sum)))
# user  system elapsed 
# 457.294  25.740 482.254 

# use data.table
system.time(dt <- as.data.table(merged_df))
# user  system elapsed 
# 1.238   0.028   1.234 
system.time(aggregated_dt <- dt[, lapply(.SD, sum), by = DoCo, .SDcols = sample_columns])
# user  system elapsed 
# 17.612   0.084   2.258
system.time(aggregated_df3 <- data.frame(aggregated_dt))
# user  system elapsed 
# 0.471   0.000   0.472

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions