Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scripts: Add R scripts for interview logs analysis #801

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions scripts/rAnalyses/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
This directory contains `R` scripts that can be run on various evolution data.

## How to use

These scripts are not standalone scripts that can 'just' be run to get some data. They are provided to help data and metadata analyses of evolution-based survey results. They may not apply to any data, they may get outdated.

It is commented `Rmd` data, where each code block explains what it does. Each script can be downloaded independently or copied and fine-tuned to everyone's need.

## Pre-requisite

These scripts require a working version of `R` on your system. You may [install `R` in vscode](https://code.visualstudio.com/docs/languages/r) or use [`RStudio`](https://posit.co/download/rstudio-desktop/) for that.

Some `R` packages are also required by the scripts:

* dplyr: A Grammar of Data Manipulation
* ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics
* tidyr: Tidy Messy Data
* stringr
187 changes: 187 additions & 0 deletions scripts/rAnalyses/logAnalysis.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
---
title: "Response analyses"
output: html_document
date: "2024-10-17"
---

# Log analyses

This R script uses the interview logs export to see where participant abandon the survey of which questions take the most time to answer.

```{r}
# Clear the environment
rm(list = ls())
```

```{r}
# Import required libraries
library(dplyr) # A Grammar of Data Manipulation
library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
library(tidyr) # Tidy Messy Dat
library(stringr)
```

## Load data

```{r}
folder <- "/path/to/data/folder/"
logFileName <- "interviewLogs_responses_20241024.csv"
```

Load the interview log data, the file here should be the one with the participant responses only, without the values, generated by the task `exportInterviewLogs.task.js --participantResponsesOnly`.
```{r}
filePath <- file.path(folder, logFileName)
interviewLogs <- read.csv(filePath, sep = ",", header = TRUE)
summary(interviewLogs)
```

Replace the uuids in the field names by the 'any' string so they can be grouped together
```{r}
# Define the regex pattern for UUIDs
uuid_pattern <- "[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}"
# Define the replacement string
replacement <- "any"
# Perform the replacement
interviewLogs$modifiedFields <- str_replace_all(interviewLogs$modifiedFields, uuid_pattern, replacement)

```

Remove the non-participant responses from the modifiedFields. They are those that contains an underscore just after a '.' (`._`). Sort them to make sure the order is always the same
```{r}
logsWithoutMetadata <- interviewLogs %>%
rowwise() %>%
mutate(
fields = str_split(modifiedFields, "\\|") %>%
unlist() %>%
.[!str_detect(., "\\._")] %>%
paste(collapse = "|")
) %>%
filter(fields != "")
```

Group the rows such that there is only one row per id/timestamp pair. Their fields are concatenated, then only unique values and kept and sorted.
```{r}
oneRowPerTs <- logsWithoutMetadata %>%
group_by(id, uuid, timestamp) %>%
summarise(
fields = paste(fields, collapse = "|"),
.groups = 'drop'
) %>%
ungroup %>%
rowwise() %>%
mutate(
fields = str_split(fields, "\\|") %>%
unlist() %>%
unique() %>%
sort() %>%
paste(collapse = "|")
)
```

## Abandon analysis

Get the rows with the maximum timestamp for each interview ID, concatenating the modified fields if there are multiple rows with the maximum timestamp. The fields are then sorted alphabetically to make sure that unordered fields are recognized as the same.

The resulting table will contain the last fields that were answered by the participant, with the timestamp.
```{r}
lastAnsweredResponse <- oneRowPerTs %>%
group_by(id) %>%
filter(timestamp == max(timestamp)) %>%
ungroup() %>%
separate_rows(fields, sep = "\\|")
```

Count the number of times each field was the last answered by the participant and order by descending count.
```{r}
countFieldsLastAnswered <- lastAnsweredResponse %>%
group_by(fields) %>%
summarise(count = n()) %>%
arrange(desc(count))

# Print the resulting data frame
print(countFieldsLastAnswered)
```

Save this data to file

```{r}
filePath <- file.path(folder, "lastResponses.csv")
write.csv(countFieldsLastAnswered, file=paste(filePath, sep = ','), row.names=FALSE)
```

## Answer time analysis

Add a column containing the differential between the current timestamp and the previous one for the row
```{r}

# Ensure the timestamp column is in a proper date-time format
oneRowPerTs$timestamp <- as.POSIXct(oneRowPerTs$timestamp)

# Process the data frame
timesToAnswer <- oneRowPerTs %>%
group_by(id) %>%
arrange(id, timestamp) %>%
mutate(timestamp_diff = timestamp - lag(timestamp)) %>%
ungroup() %>%
filter(!is.na(timestamp_diff)) %>%
separate_rows(fields, sep = "\\|")

```


```{r}
# Calculate statistics for each value of the 'fields' column
statistics <- timesToAnswer %>%
group_by(fields) %>%
summarise(
count = n(),
mean_diff = mean(timestamp_diff, na.rm = TRUE),
median_diff = median(timestamp_diff, na.rm = TRUE),
sd_diff = sd(timestamp_diff, na.rm = TRUE),
min_diff = min(timestamp_diff, na.rm = TRUE),
max_diff = max(timestamp_diff, na.rm = TRUE)
)

# Print the resulting statistics data frame
print(statistics)
```

Save those stats to file

```{r}
filePath <- file.path(folder, "statisticsPerResponse.csv")
write.csv(statistics, file=paste(filePath, sep = ','), row.names=FALSE)
```

## Interview duration analysis

For the old interviews (pre 2025), the log timestamps are the best data to use to get the interview duration, as actual startedAt and completedAt timestamps may come either from the browser or the server, while the logs' timestamp authority is always the server, so we calculate the interview durations from the first and last per interview.
```{r}
# Calculate the interview duration
interviewDurations <- oneRowPerTs %>%
group_by(id, uuid) %>%
summarise(
duration = max(timestamp) - min(timestamp),
.groups = 'drop'
)

# Print the resulting data frame
summary(interviewDurations)
```

```{r}

# Keep the record with the maximum timestamp_diff value for each interview id
maxDiffPerInterview <- timesToAnswer %>%
group_by(id) %>%
filter(timestamp_diff == max(timestamp_diff)) %>%
ungroup()

# Merge interviewDurations and maxDiffPerInterview on the uuid field
mergedData <- merge(interviewDurations, maxDiffPerInterview, by = "uuid")

# Print the resulting data frame
summary(mergedData)

```

Loading