The ana package provides a comprehensive toolkit for exploratory data analysis in R, featuring automatic variable analysis, visualization, and survival time calculations. The package includes both Chinese interface functions and optimized survival analysis utilities.
Many users encounter the error Error in library(ana) : there is no package called 'ana' after installing from GitHub. This typically occurs because:
- The package lacks proper R package structure
- Missing DESCRIPTION file
- No proper namespace configuration
# Download the repository
# Then source the main file directly:
source("path/to/ana.R")
# For survival time functions, source individually as needed:
source("path/to/a_survival_time.R")
source("path/to/a_survival_time_table.R")
# etc.Create a proper package structure first:
ana/
โโโ DESCRIPTION
โโโ NAMESPACE
โโโ R/
โ โโโ ana.R
โ โโโ a_survival_time.R
โ โโโ a_survival_time_table.R
โ โโโ a_survival_time_C++.R
โ โโโ a_survival_time_large.R
โ โโโ a_survival_time_ram.R
โโโ man/Create DESCRIPTION file:
Package: ana
Type: Package
Title: Data Analysis and Visualization Tools
Version: 1.0.0
Author: Your Name
Maintainer: Your Name <your.email@example.com>
Description: Comprehensive toolkit for exploratory data analysis with automatic
variable analysis, visualization, and survival time calculations.
License: GPL-3
Encoding: UTF-8
LazyData: true
Imports:
haven,
ggplot2,
dplyr,
scales,
knitr,
rmarkdown,
data.table,
parallel,
foreach,
doParallel,
Rcpp
RoxygenNote: 7.2.3
Create NAMESPACE file:
export(ana)
export(alook)
export(avar)
export(a_survival_time_improved)
export(a_survival_time_dt)
export(a_survival_time_rcpp)
export(a_survival_time_parallel)
export(a_survival_time_memory)
Then install:
# Install from local directory
install.packages("path/to/ana", repos = NULL, type = "source")
# Or use devtools
devtools::install("path/to/ana")The package requires the following R packages:
haven: For reading SPSS/Stata/SAS filesggplot2: For visualizationdplyr: For data manipulationscales: For plot scalesknitr: For reportingrmarkdown: For document generation
data.table: For optimized data.table versionRcpp: For C++ optimized versionparallel,foreach,doParallel: For parallel processing version
Install all dependencies:
install.packages(c("haven", "ggplot2", "dplyr", "scales",
"knitr", "rmarkdown", "data.table",
"Rcpp", "parallel", "foreach", "doParallel"))Performs comprehensive descriptive statistics on specified variables.
# Analyze specific variables
ana(mydata, "var1", "var2", "var3")
# Analyze all suitable variables (auto-selects up to 10)
ana(mydata)Features:
- Automatic type detection (numeric, factor, character, logical, date)
- Missing value analysis
- Descriptive statistics for numeric variables
- Frequency tables for categorical variables
- Handles special values (Inf, NaN)
Creates automatic visualizations based on variable types.
# Visualize specific variables
alook(mydata, "var1", "var2")
# Visualize all suitable variables
alook(mydata)Visualizations:
- Missing value bar charts
- Histograms for numeric variables
- Box plots with jitter for small datasets
- Bar charts for categorical variables
Provides a comprehensive overview of all variables in the dataset.
avar(mydata)Output includes:
- Variable type classification
- Missing value summary
- Data quality score
- Single-value variable detection
- Data quality recommendations
The package includes multiple optimized versions of survival time calculation:
The main survival time calculation function with comprehensive error handling.
# Basic usage
result <- a_survival_time_improved(data,
id_var = "patient_id",
time = "visit_time",
event = "event_occurred",
check_order = TRUE,
verbose = TRUE)Parameters:
data: Data frame containing the survival dataid_var: Column name for individual/subject IDtime: Column name for time variableevent: Column name for event indicator (0/1)check_order: Check temporal ordering of observationsverbose: Print warnings and messageshandle_negative_time: Handle negative time values
5-10x faster for large datasets using data.table.
# Requires data.table package
library(data.table)
result <- a_survival_time_dt(data,
id_var = "patient_id",
time = "visit_time",
event = "event_occurred")10-20x faster using Rcpp for maximum performance.
# Requires Rcpp package
library(Rcpp)
result <- a_survival_time_rcpp(data,
id_var = "patient_id",
time = "visit_time",
event = "event_occurred")For very large datasets, uses multiple CPU cores.
# Automatically detects cores
result <- a_survival_time_parallel(data,
id_var = "patient_id",
time = "visit_time",
event = "event_occurred",
n_cores = 4) # OptionalFor memory-constrained environments, uses optimized data types.
result <- a_survival_time_memory(data,
id_var = "patient_id",
time = "visit_time",
event = "event_occurred")# Load sample data
data(iris)
# Basic analysis of specific variables
ana(iris, "Sepal.Length", "Species")
# Visual analysis
alook(iris, "Sepal.Length", "Species")
# Analyze all variables
avar(iris)# Create sample survival data
survival_data <- data.frame(
patient_id = rep(1:100, each = 5),
visit_time = rep(0:4, 100) + runif(500, -0.1, 0.1),
event = rbinom(500, 1, 0.1)
)
# Calculate survival times
result <- a_survival_time_improved(survival_data,
id_var = "patient_id",
time = "visit_time",
event = "event")
# For large datasets, use optimized versions
result_fast <- a_survival_time_dt(survival_data,
id_var = "patient_id",
time = "visit_time",
event = "event")Solution 1: Use source() instead of library()
source("ana.R")
# Now functions are available directly
ana(mydata)Solution 2: Create proper package structure (see Installation section)
Solution: Manually install all dependencies
# Check and install missing packages
required_packages <- c("haven", "ggplot2", "dplyr", "scales",
"knitr", "rmarkdown", "data.table", "Rcpp")
missing_packages <- required_packages[!required_packages %in% installed.packages()[,"Package"]]
if(length(missing_packages) > 0) {
install.packages(missing_packages)
}Solution: Ensure Rcpp is properly installed and configured
# Test Rcpp
install.packages("Rcpp")
library(Rcpp)
evalCpp("2 + 2") # Should return 4Solution: Windows requires special handling for parallel processing
# On Windows, use:
library(parallel)
cl <- makeCluster(detectCores() - 1)
# ... your code ...
stopCluster(cl)| Function | Relative Speed | Best Use Case |
|---|---|---|
a_survival_time_improved |
1x (baseline) | Small datasets, full features |
a_survival_time_dt |
5-10x | Medium to large datasets |
a_survival_time_rcpp |
10-20x | Large datasets, speed critical |
a_survival_time_parallel |
Varies | Very large datasets, multi-core systems |
a_survival_time_memory |
0.8x | Memory-constrained environments |
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
GPL-3
- The package automatically loads required dependencies on attach
- Chinese characters in function output may require proper encoding settings
- For best performance with large datasets, use the optimized survival time functions
- The visualization functions (alook) work best with datasets under 10,000 rows
If you use this package in your research, please cite:
@Manual{,
title = {ana: Data Analysis and Visualization Tools for R},
author = {Author Name},
year = {2024},
note = {R package version 1.0.0},
url = {https://github.com/username/ana},
}