chairemobilite · tahini · Oct 22, 2024
diff --git a/scripts/rAnalyses/README.md b/scripts/rAnalyses/README.md
@@ -0,0 +1,18 @@
+This directory contains `R` scripts that can be run on various evolution data.
+
+## How to use
+
+These scripts are not standalone scripts that can 'just' be run to get some data. They are provided to help data and metadata analyses of evolution-based survey results. They may not apply to any data, they may get outdated.
+
+It is commented `Rmd` data, where each code block explains what it does. Each script can be downloaded independently or copied and fine-tuned to everyone's need.
+
+## Pre-requisite
+
+These scripts require a working version of `R` on your system. You may [install `R` in vscode](https://code.visualstudio.com/docs/languages/r) or use [`RStudio`](https://posit.co/download/rstudio-desktop/) for that.
+
+Some `R` packages are also required by the scripts:
+
+* dplyr: A Grammar of Data Manipulation
+* ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics
+* tidyr: Tidy Messy Data
+* stringr
diff --git a/scripts/rAnalyses/logAnalysis.Rmd b/scripts/rAnalyses/logAnalysis.Rmd
@@ -0,0 +1,187 @@
+---
+title: "Response analyses"
+output: html_document
+date: "2024-10-17"
+---
+
+# Log analyses
+
+This R script uses the interview logs export to see where participant abandon the survey of which questions take the most time to answer.
+
+```{r}
+# Clear the environment
+rm(list = ls())
+```
+
+```{r}
+# Import required libraries
+library(dplyr) # A Grammar of Data Manipulation
+library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
+library(tidyr) # Tidy Messy Dat
+library(stringr)
+```
+
+## Load data
+
+```{r}
+folder <- "/path/to/data/folder/"
+logFileName <- "interviewLogs_responses_20241024.csv"
+```
+
+Load the interview log data, the file here should be the one with the participant responses only, without the values, generated by the task `exportInterviewLogs.task.js --participantResponsesOnly`.
+```{r}
+filePath <- file.path(folder, logFileName)
+interviewLogs <- read.csv(filePath, sep = ",", header = TRUE)
+summary(interviewLogs)
+```
+
+Replace the uuids in the field names by the 'any' string so they can be grouped together
+```{r}
+# Define the regex pattern for UUIDs
+uuid_pattern <- "[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}"
+# Define the replacement string
+replacement <- "any"
+# Perform the replacement
+interviewLogs$modifiedFields <- str_replace_all(interviewLogs$modifiedFields, uuid_pattern, replacement)
+
+```
+
+Remove the non-participant responses from the modifiedFields. They are those that contains an underscore just after a '.' (`._`). Sort them to make sure the order is always the same
+```{r}
+logsWithoutMetadata <- interviewLogs %>%
+  rowwise() %>%
+  mutate(
+    fields = str_split(modifiedFields, "\\|") %>% 
+             unlist() %>% 
+             .[!str_detect(., "\\._")] %>% 
+             paste(collapse = "|")
+  ) %>%
+  filter(fields != "")
+```
+
+Group the rows such that there is only one row per id/timestamp pair. Their fields are concatenated, then only unique values and kept and sorted.
+```{r}
+oneRowPerTs <- logsWithoutMetadata %>%
+  group_by(id, uuid, timestamp) %>%
+  summarise(
+    fields = paste(fields, collapse = "|"),
+    .groups = 'drop'
+  ) %>%
+  ungroup %>%
+  rowwise() %>%
+  mutate(
+    fields = str_split(fields, "\\|") %>% 
+             unlist() %>% 
+             unique() %>%
+             sort() %>% 
+             paste(collapse = "|")
+  )
+```
+
+## Abandon analysis
+
+Get the rows with the maximum timestamp for each interview ID, concatenating the modified fields if there are multiple rows with the maximum timestamp. The fields are then sorted alphabetically to make sure that unordered fields are recognized as the same.
+
+The resulting table will contain the last fields that were answered by the participant, with the timestamp.
+```{r}
+lastAnsweredResponse <- oneRowPerTs %>%
+  group_by(id) %>%
+  filter(timestamp == max(timestamp)) %>%
+  ungroup() %>%
+  separate_rows(fields, sep = "\\|")
+```
+
+Count the number of times each field was the last answered by the participant and order by descending count.
+```{r}
+countFieldsLastAnswered <- lastAnsweredResponse %>%
+  group_by(fields) %>%
+  summarise(count = n()) %>% 
+  arrange(desc(count))
+
+# Print the resulting data frame
+print(countFieldsLastAnswered)
+```
+
+Save this data to file
+
+```{r}
+filePath <- file.path(folder, "lastResponses.csv")
+write.csv(countFieldsLastAnswered, file=paste(filePath, sep = ','), row.names=FALSE)
+```
+
+## Answer time analysis
+
+Add a column containing the differential between the current timestamp and the previous one for the row
+```{r}
+
+# Ensure the timestamp column is in a proper date-time format
+oneRowPerTs$timestamp <- as.POSIXct(oneRowPerTs$timestamp)
+
+# Process the data frame
+timesToAnswer <- oneRowPerTs %>%
+  group_by(id) %>%
+  arrange(id, timestamp) %>%
+  mutate(timestamp_diff = timestamp - lag(timestamp)) %>%
+  ungroup() %>%
+  filter(!is.na(timestamp_diff)) %>%
+  separate_rows(fields, sep = "\\|")
+
+```
+
+
+```{r}
+# Calculate statistics for each value of the 'fields' column
+statistics <- timesToAnswer %>%
+  group_by(fields) %>%
+  summarise(
+    count = n(),
+    mean_diff = mean(timestamp_diff, na.rm = TRUE),
+    median_diff = median(timestamp_diff, na.rm = TRUE),
+    sd_diff = sd(timestamp_diff, na.rm = TRUE),
+    min_diff = min(timestamp_diff, na.rm = TRUE),
+    max_diff = max(timestamp_diff, na.rm = TRUE)
+  )
+
+# Print the resulting statistics data frame
+print(statistics)
+```
+
+Save those stats to file
+
+```{r}
+filePath <- file.path(folder, "statisticsPerResponse.csv")
+write.csv(statistics, file=paste(filePath, sep = ','), row.names=FALSE)
+```
+
+## Interview duration analysis
+
+For the old interviews (pre 2025), the log timestamps are the best data to use to get the interview duration, as actual startedAt and completedAt timestamps may come either from the browser or the server, while the logs' timestamp authority is always the server, so we calculate the interview durations from the first and last per interview.
+```{r}
+# Calculate the interview duration
+interviewDurations <- oneRowPerTs %>%
+    group_by(id, uuid) %>%
+    summarise(
+        duration = max(timestamp) - min(timestamp),
+        .groups = 'drop'
+    )
+
+# Print the resulting data frame
+summary(interviewDurations)
+```
+
+```{r}
+
+# Keep the record with the maximum timestamp_diff value for each interview id
+maxDiffPerInterview <- timesToAnswer %>%
+    group_by(id) %>%
+    filter(timestamp_diff == max(timestamp_diff)) %>%
+    ungroup()
+
+# Merge interviewDurations and maxDiffPerInterview on the uuid field
+mergedData <- merge(interviewDurations, maxDiffPerInterview, by = "uuid")
+
+# Print the resulting data frame
+summary(mergedData)
+
+```
+