paper_jasa.tex

\documentclass[12pt]{article}
\usepackage{times}

\RequirePackage{natbib}
\usepackage{amsmath, amssymb, amsthm, array, graphicx,asa}

\graphicspath{{images/}}

\usepackage{setspace, url}
\doublespacing

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
\setlength{\textwidth}{6.5in}
\setlength{\topmargin}{-0.4in}
\setlength{\textheight}{9in}
\evensidemargin 
\oddsidemargin

\newtheorem{thm}{Theorem}[section]
\newtheorem{dfn}{Definition}[section]
\newtheorem{cor}{Corollary}[thm]
\newtheorem{con}{Conjecture}[thm]
\newtheorem{lemma}[thm]{Lemma}

\pdfminorversion=4 % as instructed by JASA file upload


\begin{document}

% Article top matter
\title{Validation of Visual Statistical Inference, Applied to Linear Models}
\author{{Mahbubul Majumder, Heike Hofmann, Dianne Cook}
\thanks{Mahbubul Majumder is a PhD student (e-mail: mahbub72@gmail.com) , Heike Hofmann is an Associate  Professor and Dianne Cook is a Professor in the Department of Statistics and Statistical Laboratory, Iowa State University, Ames, IA 50011-1210. This research is supported in part by the National Science Foundation Grant \# DMS 1007697.}}
\date{\vspace{-.7in}}
%\date{\today}  %\today is replaced with the current date
\maketitle

\begin {abstract}  
Statistical graphics play a crucial role in exploratory data analysis, model checking and diagnosis. The lineup protocol enables statistical significance testing of visual findings, bridging the  gulf between exploratory and inferential statistics.  In this paper inferential methods for statistical graphics are developed further by refining the terminology of visual inference, and framing the lineup protocol in a context that allows direct comparison with conventional tests in scenarios when a conventional test exists. This framework is used to compare the performance of the lineup protocol against conventional statistical testing in the scenario of fitting linear models. A human subjects experiment is conducted using simulated data to provide controlled conditions. Results suggest that the lineup protocol performs comparably with the conventional tests, and expectedly out-performs them when data is contaminated, a scenario where assumptions required for performing a conventional test are violated. Surprisingly, visual tests have higher power than the conventional tests when the effect size is large. And, interestingly, there may be some super-visual individuals who yield better performance and power than the conventional test even in the most difficult tasks.

{\bf Keywords: \sf statistical graphics, lineup, non-parametric test, data mining, visualization, exploratory data analysis, practical significance, effect size} 
\end {abstract}


\section{Introduction} 


Statistical graphics nourish the discovery process in data analysis by revealing unexpected things,  finding structure that was not previously anticipated,  or orthogonally by contrasting prevailing hypotheses. The area of graphics is often associated with exploratory data analysis, which was pioneered by \cite{tukey:eda} and is particularly pertinent in today's data-rich world where discovery during data mining has become an important activity. Graphics are also used in many places where numerical summaries simply do not suffice: model checking, diagnosis, and in the communication of findings. 

Several new developments in graphics research have been achieved in recent years. Early studies on evaluating how well statistical plots are perceived and read by the human eye \citep{cleveland:1984}, have been repeated and expanded \citep{simkin:1987,spence:1991, heer:2010} with findings supporting the original results. The research by \cite{heer:2010} used subjects recruited from Amazon's Mechanical Turk \citep{turk} for their studies. This body of work provides a contemporary framework for evaluating new statistical graphics.  In a complementary direction, new research on formalizing statistical graphics with language characteristics makes it easier to abstractly define, compare and contrast data plots. \cite{wilkinson:1999} developed a grammar of graphics that is enhanced by \cite{hadley:2009}. These methods provide a mechanism to abstract the way data is mapped to graphical form. Finally, technology advances make it simple and easy for everyone to draw plots of data, and particularly the existence of software systems, such as R \citep{R}, enable making beautiful data graphics that can be tightly coupled with statistical modeling.

However, measuring the strength of patterns seen in plots, and differences in individual perceptual ability, is something that is difficult and perhaps handicaps graphics  use among statisticians, where measuring probabilities is of primary importance. This has also been addressed in recent research. \citet{buja:2009} proposes a protocol that allows the testing of discoveries made from statistical graphics. This work represents a major advance for graphics, because it bridges the gulf between conventional statistical inference procedures and exploratory data analysis. One of the protocols, the lineup, places the actual data plot among  a page of plots of null data, and asks a human judge to pick the plot that is different. Figure \ref{fig:test_category} shows an example lineup. Which plot do you think is the most different from the others? (The position of the actual data plot is provided in Section \ref{sec:category}.) Wrapped in a process that mirrors conventional inference, where there is an explicit, a priori, null hypothesis, picking the plot of the data from the null plots represents a rejection of that null hypothesis. The null hypothesis typically derives from the task at hand, or the type of plot being made. The alternative encompasses all possible antitheses, all types of patterns that might be detected in the actual data plot, accounting for all possible deviations from the null without the requirement to specify these ahead of time. The probability of rejection can be quantified, along with Type I, and Type II error, and $p$-value and power can be defined and estimated. 


\begin{figure}[htp]
   \centering
       \includegraphics[width=0.95\textwidth]{plot_turk1_300_10_12_1.pdf}
       \caption{Lineup plot ($m=20$) using side-by-side boxplots for testing $H_0: \beta_k=0$. One of these plots is the plot of the actual data, and the remaining are null plots, produced by simulating data from a null model that assumes $H_0$ is true. Which plot is the most different from the others, in the sense that there is the largest shift or location difference between the boxplots? (The position of the actual data plot is provided in Section \ref{sec:category}.)}
       \label{fig:test_category}
\end{figure}


The protocol has only been informally tested until now. In the work described in this paper, the lineup protocol is compared head to head with the equivalent conventional test. Specifically, the lineup is examined in the context of a linear model setting, where we are determining the importance of including a variable in the model. This is not the envisioned environment for the use of the lineup -- actually it is likely the worst case scenario for visual inference. The intended use of lineups is where there is no existing test, and unlikely ever to be any numerical test. The thought is though, that the conventional setting provides a benchmark for how well the lineup protocol works under controlled conditions, and will provide some assurance that they will work in scenarios where there is no benchmark. Testing is done based on a human-subjects experiment using Amazon's Mechanical Turk \citep{turk}, using simulation to provide controlled conditions for assessing lineups. The results are compared with those of the conventional test. 


The paper is organized as follows. Section \ref{sec:visual_test} defines terms as used in visual inference, and describes how to estimate the important quantities from experimental data. The effect of the lineup size and number of observers on the power of the test is discussed in Section \ref{sec:size}. Section \ref{sec:regression} focuses on the application of  visual inference to linear models.  Section \ref{sec:simulation} describes  three  user studies based on simulation experiments conducted to compare the power of the lineup protocol with the equivalent conventional test and  Section \ref{sec:results} presents an analysis of the resulting data. 


\section{Definitions and Explanations for Visual Statistical Inference} \label{sec:visual_test} 

An illustration of the lineup protocol in relation to conventional hypothesis testing is presented in Table \ref{tbl:compare}. Both methods start from the same place, the same set of hypotheses. The conventional test statistic is the $t$-statistic, where the parameter estimate is divided by its standard error. In the lineup protocol, the test statistic is a plot of the data. Here, side-by-side boxplots are used, because the variable of interest is categorical and takes just two values. In conventional hypothesis testing the value of the test statistic is compared with all possible values of the sampling distribution, the distribution of the statistic if the null hypothesis is true. If it is extreme on this scale then the null hypothesis is rejected. In contrast in visual inference, the plot of the data is compared with a set of plots of  samples drawn from the null distribution. If the actual data plot is selected as the most different, then this results in rejection of the null hypothesis.

\begin{table*}[hbtp]
\caption{Comparison of visual inference with conventional inference.}
\centering 
\begin{tabular}{llll} 
\hline % \hline
  & Conventional Inference &  Lineup Protocol \\ %[0.5ex] % inserts table %heading 
\hline
  Hypothesis & $H_0: \beta=0$ vs $H_1: \beta > 0$& $H_0: \beta=0$ vs $H_1: \beta > 0$\\
 & \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} & \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} \\
				  
 Test statistic & $T(y)=\frac{\hat{\beta}}{se(\hat{\beta})}$ & $T(y)=$ \begin{minipage}[h]{1cm} \begin{center} \scalebox{0.45}{\includegraphics{stat_category.pdf}} \end{center} \end{minipage} \\
				 
 & \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} & \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} \\
				 
 Sampling Distribution & $f_{T(y)}(t); $\begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.55}{\includegraphics{stat_mathematical_test.pdf}} \end{center} \end{minipage} & $f_{T(y)}(t); $ \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.32}{\includegraphics{lineup_category_small.pdf}} \end{center} \end{minipage} \\
 & \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} & \begin{minipage}[h]{2.5cm} \begin{center} \scalebox{0.2}{\includegraphics{down_arrow.pdf}} \end{center} \end{minipage} \\
 Reject $H_0$ if & actual $T$ is extreme & actual plot is identifiable \\
\hline 
\end{tabular}
\label{tbl:compare}
\end{table*}	


In general, we define $\theta$ to be a population parameter of interest, with $\theta \in \Theta$, the parameter space. Any null hypothesis $H_0$ then partitions the parameter space into $\Theta_0$ and $\Theta_0^c$, with $H_0: \theta \in \Theta_0$ versus $H_1: \theta \in \Theta_0^c$. A test statistic, $T(y)$, is a function that maps the sample into a numerical summary, that can be used to test the null hypothesis. The hypothesis test maps the test statistic into \{0, 1\}, based on whether $T(y)$ falls into the acceptance region, or the rejection region, respectively. $T(y)$ is assessed relative to null values of this statistic $T(y_0)$, the possible values of $T$ if $\theta \in \Theta$.  

For visual inference, unlike in the conventional hypothesis test, the statistic is not a single value, but a graphical representation of the data chosen to display the strength of the parameter of interest, $\theta$. When the alternative hypothesis is true, it is expected that the plot of the actual data, the test statistic, will have visible feature(s) consistent with $\theta \in \Theta_0^c$, and that visual artifacts will not distinguish the test statistic as different when $H_1$ is not true. We will call a plot with this property a {\it visual statistic} for $\theta$. More formally, 

\begin{dfn} \label{dfn:test}
A \textbf{visual test statistic}, $T(.)$, is a function of a sample that produces a plot. $T(y)$ maps the actual data to the plot, and we call this the \textbf{(actual) data plot}, and $T(y_0)$ maps a sample drawn from the null distribution into the same plot form. These type of plots are called \textbf{null plots}. 
\end{dfn}

\noindent Ideally, the visual test statistic is defined and constructed using the grammar of graphics \citep{wilkinson:1999,hadley:2009}, consisting of type and specification of aesthetics, necessary for complete reproducibility. The visual test statistic is compared with values $T(y_0)$ using a lineup, which is defined as:

\begin{dfn}\label{dfn:lplot}
A \textbf{lineup} is a layout of $m$ randomly placed visual statistics, consisting of 
\begin{itemize}\itemsep-10pt
\item $m-1$ statistics, $T(y_0)$, simulated from the model specified by $H_0$  (null plots) and 
\item the test statistic, $T(y)$, produced by plotting the actual data, possibly arising from $H_1$.
\end{itemize}
\end{dfn}

\noindent The $(m-1)$ null plots are % can be 
members of the sampling distribution of the test statistic assuming that the null hypothesis is true. If $H_1$ is true, we expect this to be reflected as a feature in the test statistic, i.e. the plot of the data, that makes it visually distinguishable from the null plots. A careful visual inspection of the lineup by %\strikethrough{an} 
independent observers follows;  observers are asked to point out the plot most different from the lineup. If the test statistic is identified in the lineup, this is considered as evidence against the null hypothesis.  This leads us to a definition for the $p$-value of a lineup: under the null hypothesis, each observer has a $1/m$ chance of picking the test statistic from the lineup.  For $K$ independent observers, let $X$ be  the number of observers picking the test statistic from the lineup. Under the null hypothesis $X \sim \text{Binom}_{K, 1/m}$,  therefore: 

\begin{dfn}\label{dfn:pvalue}
The $p$-value of a lineup of size $m$ evaluated by $K$ observers is  given as 
$$
P(X \ge x) = 1 - \text{Binom}_{{K, 1/m}} (x-1) = \sum_{i=x}^{K} {K \choose i} \left(\frac{1}{m}\right)^{i} \left(\frac{m-1}{m}\right)^{K-i}
$$

\noindent with $X$ defined as above, and $x$ is the number of observers selecting the actual data plot. 
\end{dfn}
\noindent Note that for $x=0$ the $p$-value becomes, mathematically, equal to 1. It might make more sense from a practical point of view to think of the $p$-value as
being larger than $P(X \ge 1)$ in this situation. By increasing either $m$ or $K$, the value at a higher precision can be determined. Table \ref{pvalue} shows $p$-values for different numbers of observers for lineups of size $m = 20$. %It can be seen that, if the null hypothesis is true, it is unlikely that multiple observers would pick the actual data plot.

\begin{table}[htp]
\caption{Possible $p$-values for different numbers of observers, $K$, for fixed size $m = 20$ lineups.}
\begin{center}
\scalebox{.9}{
\begin{tabular}{|rrr|rrr|rrr|rrr|rrr|}\hline
$K$ &  $x$ & $p$-value & $K$ &  $x$ & $p$-value & $K$ &  $x$ & $p$-value & $K$  & $x$ & $p$-value & $K$  & $x$ & $p$-value\\ \hline 
1 &  1 & 0.0500 & 2 &  1 & 0.0975 & 3 & 1 & 0.1426 & 4 & 1 & 0.1855 & 5 & 1 & 0.2262 \\%\cline{1-3}
&&& 2 &  2 & 0.0025 & 3 & 2 & 0.0073 & 4 & 2 & 0.0140 & 5 & 2 & 0.0226 \\
&&& &&& 3 & 3 & 0.0001 & 4 & 3 & 0.0005 & 5 & 3 & 0.0012 \\%\cline{1-6}
&&&  &    &           &   &        && 4 & 4 & $< 0.0001$ & 5 & 4 & $< 0.0001$ \\\hline
\end{tabular}
}
\end{center}
\label{pvalue}
\end{table}

\begin{dfn}\label{dfn:visualtest}
The \textbf{visual test}, $V_{\theta}$  of  size $m$ and significance level $\alpha$, is defined to  
\begin{itemize}\itemsep-10pt
\item \textbf{Reject} $H_0$ if out of $K$ observers at least $x_{\alpha}$ correctly identify the actual data plot, and
\item \textbf{Fail to reject} $H_0$  otherwise. 
\end{itemize}
where $x_{\alpha}$ is such that $P(X \ge x_{\alpha}|H_0) \le \alpha$. 
\end{dfn}

\noindent
Associated with any test there is the risk of  Type I or II errors, which for visual inference are defined as follows: 

\begin{dfn}\label{dfn:error}
The \textbf{Type I error} associated with visual test $V_{\theta}$ is the probability of rejecting $H_0$ when it is true; the probability for that is $P(X \ge x_{\alpha})$, which is controlled by $\alpha$. The \textbf{Type II error} is the probability of failing to identify the actual data plot, when $H_0$ is not true, $P( X <  x_{\alpha})$.
\end{dfn}

\noindent Because $X$ takes only discrete values we can not always control exactly for $\alpha$.  
For example, when there is only one observer, $1/m$ is the minimal value at which we can set $\alpha$. It can be set to be smaller, even arbitrarily small, by increasing $K$, the number of observers. 
Type II error is harder to calculate, as is usually the case. In visual inference, individual abilities need to be incorporated to calculate Type II error. 
Here, we need to estimate the probability that an observer sees the actual data plot as different, when it really is different. This involves understanding the individual's visual skills. Thus, let $X_i$ be a binary random variable with $X_i = 1$, if individual $i \ (=1, \dots , K)$ identifies the actual data plot from the lineup, and $X_i = 0$ otherwise. Let $p_i$ be the probability that individual~$i$ picks out the actual data plot. If all individuals have the same ability, with the probability, $p$, for picking out the actual data plot, then $X = \sum_i X_i$ has distribution $\text{Binom}_{K, p}$, and we can estimate $p$ by $\hat{p} = x/K$, where $x$ is the number of observers (out of $K$), who pick out the actual data plot. 


If there is evidence for individual skills influencing the probability $p_i$, then $X_i \sim Binom_{1, p_i}$ and $X$ is a sum of independent Bernoulli random variables with different success rates $p_i$. This makes the distribution of $X$ a Poisson-Binomial by definition (see \citet{butler93} for details). Ways to estimate $p_i$ will be  discussed in the following sections.


\begin{dfn} \label{dfn:power}
The \textbf{power} of a visual test, $V_{\theta}$, is defined as the probability to reject the null hypothesis for a given parameter value $\theta$:
    \begin{equation*}
      \text{Power}_V(\theta)= Pr(\text{Reject } H_0 \mid \theta) 
    \end{equation*}
\end{dfn}

\noindent An important difference between conventional and visual testing is that lineups will depend on observers' evaluation. Thus $X$, the number of observers who identify the actual data plot from the lineup, affects the estimation of power  and the power is estimated by
\[
\widehat{Power}_{V} (\theta) = {Power}_{V, K} (\theta) = 1 - F_{X, \theta} (x_{\alpha} - 1).
\] 
Here $F_{X, \theta}$ is the distribution of $X$ and $x_\alpha$ is such that $P(X \ge x_{\alpha}) \le \alpha$. Note that the distribution  $F_X$ depends on which hypothesis is true:
under the null hypothesis, $X \sim Binom_{K, 1/m}$, leading to:
\[
Power_V(\theta, K)= 1 - Binom_{K, 1/m} (x_\alpha - 1).
\]
If the alternative hypothesis is true, with a fixed parameter value $\theta$, we can assume that an individual's probability to identify the data plot depends on the parameter value, and $X_i \sim Binom_{1, p_i(\theta)}$. Assessing an individual's skill to identify the actual data plot will require that an individual evaluates multiple lineups.


Power is an important consideration in deciding which test to use for solving a problem. Here we use it to compare the performance of the visual test with the conventional test, but in practice for visual inference it will mostly be important in choosing plots to use. Analysts typically have a choice of plots to make, and a myriad of possible options such as reference grids, for any particular purpose. This is akin to different choices of statistics in conventional hypothesis testing, for example, mean, median, or trimmed mean. One is typically better than another. For two different visual test statistics of the same actual data, one is considered to be better, if $T(y)$ is more easily distinguishable to the observer. Power is typically used to measure this characteristic of a test. 

\section{Effect of Observer Skills and Lineup Size}~\label{sec:size}

\subsection{Subject-specific abilities}~\label{sec:model}

Suppose each of $K$ independent observers gives evaluations on multiple lineups, and responses are considered to be binary random variable, $X_{\ell i} \sim  Binom_{1, p_{\ell i}}$, where $X_{\ell i} = 1$, if subject $i$ correctly identifies the actual data plot on lineup $\ell$, $1 \le \ell \le L$, and 0 otherwise. A mixed effects logistic regression model is used for $P(X_{\ell i} = 1) =  p_{\ell i} = E(X_{\ell i})$, accommodating both for different abilities of observers as well as differences in the difficulty of lineups.

The model can be  fit as:
\begin{equation} \label{eqn:mixed}
g( p_{\ell i} )= W_{\ell i} \delta +  Z_{\ell i}  \tau_{\ell i},  
\end{equation}

\noindent where $g(.)$ denotes the {\it logit} link function $g(\pi)=\log(\pi) - \log(1-\pi); 0 \le \pi \le 1$.
$W$ is a design matrix of covariates corresponding to specifics of lineup $\ell$ and subject $i$, and $\delta$ is the vector of corresponding parameters. Covariates could include  demographic information of individuals, such as age, gender, education level etc.,  as well lineup-specific elements, e.g. effect size or difficulty level.
%
$Z_{\ell i}$,  $1 \le i \le K$, $1 \le \ell \le L$,  is a design matrix corresponding to random effects specific to individual $i$ and lineup $\ell$; and 
$\tau$  is a vector of independent normally distributed random variables $\tau_{\ell i}$ with  variance matrix $\sigma_\tau I_{KL \times KL}$. $\tau$ will usually include a component incorporating an individual's ability or skill to evaluate lineups. 
Note that $\tau_{\ell i }$ usually only includes a partial interaction; for a full interaction of subjects' skills and lineup-specific difficulty we would need replicates of the same subject evaluating the same lineup, which in practice is not feasible without losing independence.

\noindent The inverse {\it logit} link function, $g^{-1}(.)$, from  Equation \ref{eqn:mixed} leads to the estimate of the subject and the lineup specific probability of successful evaluation by a single observer as 
\begin{equation} \label{eqn:mixed_power}
\hat p_{\ell i} =  g^{-1}(W_{\ell i} \hat {\delta} +  Z_{\ell i}  \hat {\tau}_{\ell i}).
\end{equation}


\subsection{Lineup size, $m$}

The finite number $m-1$ of representatives of the null distribution used  as comparison against the test statistic, is a major difference between visual inference and the conventional testing. The choice of $m$ has an obvious impact on the test.

The following properties can only be derived for the situation of a fully parameterized  simulation study, as conducted in this paper. They allow for a direct comparison of lineup tests against the conventional counterparts, and also to identify properties relevant for a quality assessment of lineups when they are used in practical settings. Two assumptions are critical:
\begin{enumerate} \itemsep 0in
\item  the plot setup is structured in a way that makes it possible for an observer to identify a deviation from the null hypothesis,
\item an observer is able to identify the plot with the strongest `signal' (or deviation from $H_0$)  from a lineup.
\end{enumerate}
Evidence in support of the second assumption will be seen in the data from the study discussed in Section~\ref{sec:simulation}, the degree to which the first assumption is fulfilled is reflected by the power of a lineup. The better suited a design is for a particular task, the higher its power will be.

In order to compare the power of conventional and visual tests side-by-side, it is necessary to assume that we are in the controlled environment of a simulation with tests corresponding to a known parameter value $\theta \in R$ and associated distribution function $F_t$ of the test statistic. 

\begin{lemma}~\label{lemma}
Suppose $F_{|t|}(.)$ is the distribution function of an absolute value of $t$, the conventional test statistic. Suppose the associated test statistic is observed as  $t_{obs}$ with $p$-value $p_D$. 

The probability of picking the data plot from a lineup depends on the size $m$ of the lineup and the strength of the signal in the data plot. 
Under the above assumptions, the probability is expressed as:
\[
P(p_D < p_0) =  E\left[ (1 - p_D)^{m-1}\right]
\]
where $p_D$ is the $p$-value associated with the data in the test statistic, and $p_0$ is the minimum of all $p$-values in the data going into null plots.
\end{lemma}

\begin{proof} The proof and the details of the lemma are attached in the supplementary documents. 
\end{proof}

The above lemma allows two immediate conclusions for the use of lineups.
The probability that the observer correctly identifies the data plot is closely connected to the size of the lineup $m$, since the right hand side of the above equation decreases for larger $m$, the probability of correctly identifying the actual data plot decreases with $m$. Further we see, that the rate of this decrease depends strongly on the distribution of $p_D$ -- if the density of $p_D$ is very right skewed, the  expectation term on right hand side will be large and less affected by an increase in $m$. This can also be seen in Figure \ref{fig:pval_power}, which illustrates lemma \ref{lemma}.   Figure \ref{fig:pval_power} shows the probability of picking the actual data plot for lineups of different size: as $m$ increases we have an increased probability to observe a more highly structured null plot by chance. It can also be seen that for a $p$-value, $p_D$, of about 0.15 for the data plot, the signal in the plot is so weak, that it cannot be distinguished from null plots in a lineup of size $m=20$. 

\begin{figure}[htbp]  
   \centering
   \includegraphics[width=3in]{images/powerplot.pdf} 
   \caption{Probability that the data plot has the smallest probability in a lineup of size $m$. With increasing $p$-value the probability drops -- when it reaches $1/m$ a horizontal line is drawn to emphasize insufficient sensitivity of the test due to the lineup size. }
   \label{fig:pval_power}
\end{figure}

\section{Application to Linear Models} \label{sec:regression}

To make these concepts more concrete consider how this would operate in the linear models setting. Consider a linear regression model 
\begin{equation}\label{multi} Y_i = \beta_0 + \beta_1 X_{i1} + \beta_1 X_{i2} + \beta_3 X_{i1}X_{i2} + ... + \epsilon_i 
\end{equation}
where $\epsilon_i \stackrel{iid}{ \sim } N(0,\sigma^2)$, $i=1,2, .., n$. The covariates ($X_j, j=1,..,p$) can be continuous or discrete.

In this setting, there are many established graphics that are used to evaluate and diagnose the fit of a regression model (e.g. \citealt{cook:99}). Table \ref{tbl:stat_multiple} lists several common hypotheses related to the regression setting, and commonly used statistical plots that might be used as corresponding visual test statistics. For example, to examine the effect of variable $X_j$ on $Y$, we would plot residuals obtained from fitting the model without $X_j$ against $X_j$ or for a single covariate we may plot $Y$ against $X_j$ (cases 1-4 in Table~\ref{tbl:stat_multiple}). To assess whether the  assumptions of linearity is appropriate we would draw a plot of residuals against fitted values (case 5 in Table~\ref{tbl:stat_multiple}). For the purpose of comparing visual against conventional inference, we focus on cases 2 and 3, with a continuous and categorical explanatory variable, respectively. 


\begin{table*}[hbtp] 
\caption{Visual test statistics for testing hypotheses related to the model $Y_i = \beta_0 + \beta_1 X_{i1} + \beta_1 X_{i2} + \beta_3 X_{i1}X_{i2} + ... + \epsilon_i  $ } 
\centering 
\begin{tabular}{m{0.5cm}m{3cm}m{2cm}m{3cm}m{5.5cm}} 
\hline\hline 
Case & Null Hypothesis & Statistic & Test Statistic & Description \\ [0.5ex] % inserts table %heading 
\hline 
1 & $H_0: \beta_0=0$ & Scatter plot & \begin{minipage}[t]{3cm}  \scalebox{0.4}{\includegraphics{stat_intercept.pdf}}\end{minipage} & Scatter plot with least square line overlaid. For null plots   we simulate data from fitted null model. \\ 
2 & $H_0: \beta_k=0$ & Residual plot & \begin{minipage}[t]{3cm}   \scalebox{0.4}{\includegraphics{stat_beta_k.pdf}}\end{minipage} & Residual vs $X_k$ plots. For null plots   we simulate data from normal with mean 0 variance $\hat{\sigma}^2$. \\ 
3 & $H_0: \beta_k=0$ (for binary $X_k$) & Box plot & \begin{minipage}[t]{3cm} 	\scalebox{0.4}{\includegraphics{stat_category.pdf}} \end{minipage} & Box plot of residuals grouped by category of $X_k$. For null plots   we simulate data from normal with mean 0 variance $\hat{\sigma}^2$. \\
4 & $H_0: \beta_k=0$ (interaction of continuous and  binary $X_k$) & Scatter plot & \begin{minipage}[t]{3cm}  \scalebox{0.4}{\includegraphics{stat_interection.pdf}}  \end{minipage} & Scatter plot with least square lines of each category overlaid. For null plots   we simulate data from fitted null model.\\[1ex] % [1ex] adds vertical space 
5 & $H_0: X$ Linear & Residual Plot & \begin{minipage}[t]{3cm}  	\scalebox{0.4}{\includegraphics{stat_nonlinear.pdf}}  \end{minipage} & Residual vs predictor plots with loess smoother overlaid. For null plots   we simulate residual data from normal with mean 0 variance $\hat{\sigma}^2$. \\ 
6 & $H_0: \sigma^2=\sigma^2_0$ & Box plot & \begin{minipage}[t]{3cm} \scalebox{0.4}{\includegraphics{stat_sigma_box.pdf}}\end{minipage} & Box plot of standardized residual divided by  $\sigma^2_0$. For null plots   we simulate data from standard normal. \\ 				
7 & $H_0: \rho_{X,Y|Z}=\rho$ & Scatter Plot & \begin{minipage}[t]{3cm}  	\scalebox{0.4}{\includegraphics{stat_intercept.pdf}}  \end{minipage} & Scatter plot of residuals obtained by fitting partial regression. For null plots   we simulate data (mean 0 and variance 1) with specific correlation $\rho$. \\ 
8 & $H_0:$ Model Fits & Histogram & \begin{minipage}[t]{3cm}  \scalebox{0.4}{\includegraphics{stat_goodness_simple.pdf}}  \end{minipage} & Histogram of the response data. For null plots   we simulate data from fitted model. \\[1ex] % [1ex] adds vertical space 
9 & For  $p=1$ only: $H_0: \rho_{X,Y} =\rho$ & Scatter plot & \begin{minipage}[t]{3cm}  \scalebox{0.4}{\includegraphics{stat_intercept.pdf}}  \end{minipage} & Scatter plot with least square line overlaid. For null plots   we simulate data with correlation $\rho$.\\			
\hline 
\end{tabular} 
\label{tbl:stat_multiple} 
\end{table*} 

Suppose $X_k$ is a categorical variable with two levels, and we test the hypothesis $H_0:\beta_k=0$ vs $H_1: \beta_k \ne 0$. If the responses for the two levels of the categorical variable $X_k$ in the model are different, the  residuals from fitting the null model should show a significant difference between the two groups. For a visual test, we draw boxplots of the residuals conditioned on the two levels of $X_k$. If $\beta_k\ne 0$ the boxplots should show a vertical displacement. 


The conventional test in this scenario uses $T= \hat{\beta_k}/ se(\hat{\beta_k})$ and rejects the null hypothesis, if $T$ is extreme on the scale of a $t$ distribution with $n-p$ degrees of freedom.  It forms the benchmark upon which we evaluate the visual test. To calculate what we might expect for the power of the visual test, under perfect conditions, first assume that the observer is able to pick the plot with the smallest $p$-value from a lineup plot.  This leads to the decision to reject $H_0$ when $p_{D} < p_0$, where $p_{D}$ is the conventional $p$-value as details given in Lemma \ref{lemma}. Thus the expected probability to reject by a single observer ($K=1$) in this scenario is

\begin{equation}\label{power_exp} 
 p(\beta)=Pr(p_{D} < p_0)  \quad \text{for}  \quad \beta \ne 0
\end{equation}
Figure \ref{fig:power_expected} shows the power of the conventional test in comparison to the expected power of the visual test for different $K$ (number of observers), obtained using $p(\beta)$ from Equation \ref{power_exp}. Notice that the expected power of the visual test exceeds the power of the conventional test as $K$  increases, and when $\beta$ gets larger.   Conversely, visual power is below conventional power for parameter values close to the null hypothesis. This is even more pronounced for large number of observers. At the same time, the point of intersection between visual and conventional power approaches the value of the null hypothesis as the number of observers approaches infinity, leading to an asymptotically perfect power curve of zero in the null hypothesis and one for any alternative value. We observe this dichotomy of visual power  in power estimates based on the data collected from user experiments, too. It features prominently in  figure \ref{fig:power_loess_effect}. 


\begin{figure}[hbtp]
   \centering
       \scalebox{.6}{\includegraphics{power_expected_k.pdf}}
       \caption{Comparison of the expected power of a visual test of size $m=20$ for different $K$ (number of observers) with the power of the conventional test, for $n =100$ and $\sigma = 12$.  
       }
       \label{fig:power_expected}
\end{figure}


\section{Human Subjects Experiments with Simulated Data} \label{sec:simulation}

Three experiments were conducted to evaluate the effectiveness of the lineup protocol relative to the equivalent test statistic used in the regression setting. The first two experiments have  ideal scenarios for conventional testing, where we would not expect the lineup protocol to do better than the conventional test. The third experiment is a scenario where assumptions required for the conventional test are violated, and we would expect the lineup protocol to outperform the conventional test. (Data and lineups used in the experiments are available in the supplementary material.)

After many small pilot studies with local personnel, it was clear that some care was needed to set up the human subjects experiments. It was best for a observer or a subject to see a block of 10 lineups with varying difficulty, with a reasonable number of ``easy'' lineups. The explanations about each experiment (below) includes an explanation of how the lineups were sampled and provided to the subjects.

Participants for all the experiments were recruited through Amazon's online web service, Mechanical Turk  \citep{turk}. A summary of the data obtained for all three experiments are shown in Table \ref{tbl:summary}. Participants were asked to select the plot they think best matched the question given, provide a reason for their choice, and say how confident they are in their choice. Gender, age, education and geographic location of each participant are also collected. %The data was collected through the web. 
For each of the experiments one of the lineups was used as a test plot (easy plot) which everyone should get correct, so that a measure of the quality of the subjects effort could be made. Note that no participant was shown the same lineup twice.


\subsection{Discrete covariate}\label{sec:category}

The experiment is designed to study the ability of human subjects to detect the effect of a single categorical variable $X_2$ (corresponding to parameter $\beta_2$) in a two variable ($p=2$) regression model (Equation \ref{multi}). Data is simulated using a range of values of $\beta_2$ or slopes as shown in Table \ref{tbl:experiment_params}, two different sample sizes ($n=100, 300$) and two standard deviations of the error ($\sigma=5, 12$). The range of $\beta_2$ values was chosen so that estimates of the power would produce reasonably continuous power curves, comparable to that calculated for the theoretical conventional test. Values were fixed for other regression parameters, $\beta_0 = 5$,  $\beta_1=15$, and the values for $X_1$ were randomly generated from a Poisson $(\lambda=30)$ distribution, which is almost Gaussian. Three data sets were generated for each of the parameter values shown in Table \ref{tbl:experiment_params} resulting in 60 different ``actual data sets'', and thus, 60 different lineups. For each lineup, the null model was fit to the actual data set to obtain residuals and parameter estimates. The actual data plot was drawn as  side-by-side boxplots of the residuals (Table \ref{tbl:stat_multiple}, case 3). The 19 null data sets were generated by simulating from  $N(0, {\hat{\sigma}}^2)$, and  plotted in the same way. The actual data plot was randomly placed among these null data plots to produce the lineup.  Figure \ref{fig:test_category} is an example of one of these lineups. It was generated for $n$=300, $\beta_2$=10 and $\sigma$=12. The actual data plot location is ($4^2-1$). For this lineup, 15 out of 16 observers picked the actual data plot. 

The number of evaluations required for each lineup to provide reasonable estimates of the proportion correct ($\hat p$) is determined by the variance of the number of correct evaluations. Suppose $\gamma$ denotes the conventional test power for each parameter combination shown in Table \ref{tbl:experiment_params}. Since the expected power of visual inference is very close to the power of conventional test (Figure \ref{fig:power_expected} with $K=1$) we consider $\gamma =p$.
For a given proportion $\gamma$ it is desired to have a margin of error (ME) less than or equal to 0.05. Thus we have $ME =1.96 \sqrt{ \gamma(1-\gamma) / n_{\gamma} } \le 0.05$ which gives us the estimation of minimum number of evaluations $$n_{\gamma} \geq \frac{\gamma(1-\gamma)}{(0.05/1.96)^2}.$$  

Each subject viewed at least 10 lineups with the option to evaluate more. Depending on the parameter combinations we group the lineups in different difficulty levels  as easy, medium, hard and mixed (actual numbers are given in the supplementary material). For each difficulty level a specific number of lineups was randomly picked  for evaluation. This number is chosen so that total number of evaluations for each lineup for that group exceed the threshold $n_{\gamma}$. To satisfy this plan we needed to recruit at least 300 subjects. 

\begin{table}[hbtp]
\caption{Combination of parameter values, $\beta_2$,  $n$ and $\sigma$, used for the simulation experiments.} 
\centering
\scalebox{.9}{
\begin{tabular}{c c c c c}
% creating 10 columns
\hline\hline

% inserting double-line
& & \multicolumn{3}{c}{Slope ($\beta$)} \\
 \cline{3-5}

Sample size  & Error SD  &  \multicolumn{1}{c} {Experiment 1}  & \multicolumn{1}{c} {Experiment 2}  & \multicolumn{1}{c} {Experiment 3} \\
 ($n$) &   ($\sigma$) & Discrete covariate & Continuous covariate & Contaminated data 
\\ [0.5ex]
\hline
% inserts single-line

% Entering 1st row
&  \phantom{0}5 & 0, 1,  3, 5, 8  & 0.25, 0.75, 1.25, 1.75, 2.75 & 0.1, 0.4, 0.75, 1.25, 1.5, 2.25\\[-1ex]
\raisebox{1.5ex}{100} &12
& 1, 3, 8, 10, 16  & 0.5, 1.5, 3.5, 4.5, 6 &\\[1ex]

% Entering 2rd row
&  \phantom{0}5 & 0, 1, 2, 3, 5  & 0.1, 0.4, 0.7, 1, 1.5&\\[-1ex]
\raisebox{1.5ex}{300} & 12
& 1, 3, 5, 7, 10  & 0, 0.8, 1.75, 2.3, 3.5 &\\[1ex]
% [1ex] adds vertical space
\hline
% inserts single-line
\end{tabular}
}
\label{tbl:experiment_params}
\end{table} 

  
\subsection{Continuous covariate}  \label{sec:continuous}

This experiment is very similar to the previous one, except that there is a single continuous covariate and no second covariate (Equation \ref{multi} with $p=1$), following the test in Table \ref{tbl:stat_multiple}, case~2. Data is simulated with two sample sizes ($n=100, 300$), two standard deviations of the error ($\sigma=5, 12$), a variety of slopes ($\beta$), as given in Table  \ref{tbl:experiment_params}. We arbitrarily set $\beta_0 = 6$ and values for $X_1$ are simulated from $N(0,1)$. For each combination of parameters, at least three different actual data sets are produced, yielding a total of 70 lineups. 

The actual data plot is generated by making a scatterplot of $Y$ vs $X_1$ with the least squares regression line overlaid. To produce the null plots in the lineup null data was simulated from $N(X \hat{\beta}, {\hat{\sigma}}^2$) and plotted using the same scatterplot method as the actual data. 
To select 10 lineups for a subject, each combination of sample size ($n$) and error SD ($\sigma$) is given a difficulty value based on the slope ($\beta$) parameters. For the smallest slopes the difficulty is 4 (hardest) and for the largest slopes the difficulty is 0 (easiest). 
Figure \ref{fig:test_continuous} shows an example lineup for this experiment from difficulty level 4.  This lineup is generated using a sample size ($n$) of 100, slope ($\beta$) of 1.25 and and error SD ($\sigma$) of 5. The actual data plot location is $(2^2+1)$. None of the 65 observers picked the actual plot while 46 observers picked plot 18 which has the lowest $p$-value among all the plots in this lineup.

For each combination of sample size and standard deviation, each participant is given five randomly selected lineups, one of each difficulty level. Another set of four lineups is chosen from a second tier of selected combinations of sample size and standard deviation, with difficulty levels 0 to 3.  A last lineup was randomly selected from a set of  lineups with difficulty level 0. The order in which the lineups are shown to participants is randomized. 


\begin{figure}[p]
   \centering
       \includegraphics[width=0.95\textwidth]{plot_turk2_100_25_5_3.pdf}
       \caption{Lineup plot ($m=20$) using scatter plots for testing $H_0: \beta_k=0$ where covariate $X_k$ is continuous. One of these plots is the plot of the actual data, and the remaining are null plots, produced by simulating data from a null model that assumes $H_0$ is true. Which plot is the most different from the others, in the sense that there is the steepest slope? (The position of the actual data plot is provided in Section \ref{sec:continuous}.)}
       \label{fig:test_continuous}
\end{figure}

\subsection{Contaminated data} \label{sec:contamination}

The first two simulation experiments use data generated under a normal error model, satisfying the conditions for conventional test procedures. In these situations there exists a test, and there would, in general, be no need to use visual inference. The simulation is conducted in the hope that the visual test procedure,  will at least compare favorably with the conventional test -- without any ambition of performing equally well.
 This third simulation is closer to the mark for the purpose of visual inference. The assumptions for the conventional test are violated by contaminating the data. The contamination makes the estimated slopes effectively 0, yet the true value of slope parameter is not.  The data is generated from the following model:

\[
  Y_i = \left\{
  \begin{array}{l l}
    \alpha+\beta X_i + \epsilon_i  & \quad  X_i \sim N(0,1) \quad  i =1,...,n\\
    \lambda+ \eta_i & \quad X_i \sim N(\mu,1/3) \quad  i=1, ...,n_c\\
  \end{array} \right.
\]
where $\epsilon_i \stackrel{iid}\sim N(0,\sigma)$, $\eta_i \stackrel{iid}\sim N(0,\sigma/3)$ and $\mu = -1.75$. $n_c$ is the size of the contaminated data. For the experiment we consider $n=100$ and $n_c=15$ producing actual data with $115$ points. Further, $\alpha=0$, $\lambda=10$, and $\sigma$ is chosen to be approximately 3.5, so that error standard deviation across both groups of the data is $5$. A linear model (Equation \ref{multi} with $p=1$  and intercept $\beta_0=0$) is fit to the contaminated data. This experiment  follows the test in Table \ref{tbl:stat_multiple}, case 2. The actual data plot shows a scatterplot of the residuals vs $X_1$, and the null  plots are scatterplots of null data generated by plotting simulated residuals from $N(0, {\hat{\sigma}}^2$)  against $X_1$. 

Experiment three consists of a total of 30 lineups, made up of five replicates for each of the six slopes as shown in Table \ref{tbl:experiment_params}. We use the slope directly as a measure of difficulty, with difficulty = 0 for the largest slope and difficulty = 5 for the smallest slope.  Subjects were exposed to a total of ten lineups, with two lineups from each of the difficulty levels 0 through 3, and one  lineup each from levels 4 and 5.


An example lineup for slope $\beta=0.4$ is shown in Figure \ref{fig:test_contaminated}.  Can you pick which plot is different? The actual data plot location is $(3^2-2^3)$ and 13 out of 31 observers picked the actual plot.


\begin{figure}[p]
   \centering
       \includegraphics[width=0.95\textwidth]{plot_turk3_100_40_5_5.png}
       \caption{Lineup plot ($m=20$) using scatter plots for testing $H_0: \beta_k=0$ where covariate $X_k$ is continuous but the inclusion of some contamination with the data spoils the normality assumption of error structure. One of these plots is the plot of the actual data, and the remaining are null plots, produced by simulating data from a null model that assumes $H_0$ is true. Which plot is the most different from the others, in the sense that there is the steepest slope? (The position of the actual data plot is provided in Section \ref{sec:contamination}.)}
       \label{fig:test_contaminated}
\end{figure}


\section{Results} \label{sec:results}

\subsection{Data Cleaning}  \label{sec:data_cleaning}

Amazon Mechanical Turk workers are paid for their efforts, not substantially, but on the scale of the minimum wage in the USA. Some workers will try to maximize their earnings for minimum effort, which can affect the results from the data. For example, some workers may simply randomly pick a plot, without actively examining the plots in the lineup. For the purpose of identifying these participants and cleaning the data, we use one of the very easy lineups that everybody was exposed to as a {\it reference lineup} and  take action based on a subject's answer to this reference:
if the subject failed to identify the actual data plot on the reference lineup, we remove all of this subject's data from the analysis. If the answer on the reference lineup is correct, we remove the answer for this lineup from the analysis, but keep all of the remaining answers. Table~\ref{tbl:summary} tabulates the number of subjects, genders and lineups evaluated after applying the data screening procedure. 


\begin{table}[hbtp]
\caption{Number of subjects, gender, total lineups seen and distinct lineups for all three experimental data sets. Note that in some of the lineups the number of male and female participants does not add up to the total number of participants due to missing demographic information. }
\begin{center}
\begin{tabular}{ccrrcc}
   \hline \hline
 Experiment & Subject & Male & Female & Responses &Lineup\\ 
    \hline
1 & 239 & 121 & 107 & 2249 &  60 \\ 
  2 & 351 & 185 & 164 & 3636 &  70 \\ 
  3 & 155 & 103 &  52 & 1511 &  29 \\  
   \hline
\end{tabular}
\end{center}
\label{tbl:summary}
\end{table}

\subsection{Model fitting}
 For each parameter combination, {\it effect} $E$ is derived as 
%\[
$E=\sqrt {n} \cdot \beta/\sigma.$
%\]
\noindent
The model in Equation \ref{eqn:mixed} is fit using $E$ as the only fixed effect covariate without intercept, i.e. 
 $W_{\ell i} = E_{\ell i}$. Instead of fitting an intercept, we make use of a fixed  offset of $\log(0.05/0.95)$ so that the estimated power has a fixed lower limit at 0.05 (Type-I error) when $E=0$. Different skill levels of subjects are accounted for  by allowing subject-specific random  slopes for  effect ($E$). 

For experiment 3 we do fit intercepts: both a fixed  and subject-specific random effects, since forcing power to be fixed at 0.05 for $E=0$ is not required by the experimental design. 

For computation we use package {\tt lme4} \citep{lme4:2011} and software R 2.15.0 \citep{R}. $p$-value calculations are based on asymptotic normality.

Table \ref{tbl:model_par} shows the parameter estimates of the mixed effects model of the subject-specific variation. The fixed effects estimates indicate that for all experiments  the proportion of correct responses increases as the effect increases. This effect is less pronounced  for experiment 3. The subject-specific variability is smaller for experiment 1, and relatively large for experiment 3. 

\begin{table}[hbtp]
\caption{Parameter estimates of model in Equation \ref{eqn:mixed}. Estimates are highly significant with $p$-value $<$  0.0001 for all three experiment data.}
\begin{center}
\begin{tabular}{cr@{.}lcc}
  \hline \hline
 &  \multicolumn{3}{c} {Fixed effect}  & Random effect\\
 \cline{2-4}
 Experiment & \multicolumn{2}{l}{Estimate}  &Std. error & Variance\\
  \hline
  1 & 0&39 & 0.0094 & 0.0080 \\ 
  2 & 1&21 &  0.0197 &  0.0443 \\ 
  3 & 0&59 (Intercept)  &   0.1668 & 1.9917\\ 
     & 0&21 (Slope)    &  0.0511     &  0.0245\\ 
     &-0&78 (correlation) & & \\
   \hline
\end{tabular}
\end{center}
\label{tbl:model_par}
\end{table}

\subsection{Power comparison}

 Figure \ref{fig:power_loess_effect} shows an overview of estimated power  against effect for the three experiments.  Responses from each experiment are summarized by effect size and represented as dots,  with size indicating the number of responses. A loess  fit to the data gives an estimate of  the observed proportion correct $\hat p(E)$ for different effect sizes, with grey bands indicating simultaneous bootstrap confidence bands \citep{buja}. $\hat p(E)$ is considered to be the power for $K=1$ and it is used to obtain power for $K=5$. For comparison, the dashed lines show the corresponding power curves of the conventional tests. It is encouraging to see that visual inference mirrors the power vs effect relationship of conventional testing, in experiments 1 and 2. In experiment 3 the power of the visual test exceeds that for the conventional test, as expected. For larger values of $K$ estimated power exceeds the power of conventional test.
Note that for effect $E=0$, the power is close to 0.05 (Type-I error) for both experiments 1 and 2, making the fixed offset a reasonable assumption.  


\begin{figure}[hbtp]
   \centering
       \scalebox{0.55}{\includegraphics{power_loess_effect_k.pdf}}
       \caption{Power in comparison to effect for the three experiments. Points indicate subject responses, with size indicating count. Responses are 1 and 0 depending on the success or failure respectively to identify the actual plot in the lineup. The loess curve (continuous line) estimates the observed proportion correct (power for $K=1$), and surrounding bands show simultaneous bootstrap confidence band. Observed proportion is used to obtain power for $K=5$. Conventional test power is drawn as a dashed line. For experiment 3, conventional power is based on the slopes of the non-contaminated part of the data. Power of the conventional test for contaminated data is shown by cross marks.}
       \label{fig:power_loess_effect}
\end{figure}

Results for experiment 3 are quite different. This is the situation where we expect to see the potential of visual inference, and indeed we do: the power of visual inference is always high, and much higher than the conventional test at small effect sizes. There is no actual conventional power in this situation, because assumptions are violated.
The dashed line shows conventional power based on uncontaminated data, whereas the cross marks show effective power based on the coefficient estimated from the contaminated data.


Results of experiment 3 are curious insofar, as  power of the visual test is  largely independent of effect size. However, these results are based on correct identification of the actual data plot, regardless of reason. Although subjects were asked to select the plot that exhibited the highest association between the two variables, they might have cued in on the cluster of contaminated data. This will be explored further in Section \ref{sec:TypeIII}. 

\subsection{Subject-specific variation}
 
Subject-specific proportion correct $\hat p_i(E)$ is obtained using Equation \ref{eqn:mixed_power} and it is used to obtain power for $K=5$. Figure  \ref{fig:power_mixed_subject} shows power curves for both the overall experiment and subject-specific variations. The thick continuous line shows overall estimated power, the thinner lines correspond to subject-specific power curves. For comparison, the  dashed lines show power curves of the conventional test. Subject-specific power is quite different between the three experiments. In experiment 2 subjects performed similarly, and substantially better than the conventional test. In experiment 1 there is more variability between subjects, with some doing better than the conventional test on large effects. In experiment 3 there is the most subject-specific variation. Some subjects performed substantially better than the conventional test, and on average the visual test was better. 

\begin{figure}[hbtp]
   \centering
       \scalebox{0.60}{\includegraphics{power_mixed_subject_k.pdf}}
       \caption{Subject-specific  power for $K=5$ obtained using the subject-specific proportion correct estimated from model \ref{eqn:mixed}. The corresponding power curve for conventional test (dashed line) is shown for comparison. The overall estimated average power curve is shown (light blue).}
       \label{fig:power_mixed_subject}
\end{figure}

\subsection{Estimating the $p$-value in the real world} 


In the real setting, where visual inference is to be useful, there will be no conventional test $p$-values. Assessing the strength of perceived structure is a critical component of visual inference. In experiments 1 and 2, there is a $p$-value associated with the actual data plot in each lineup. As the $p$-value increases the proportion of correct responses falls (Figure \ref{fig:pval_pcorrect}), which is evidence of direct association between proportion of correct responses and conventional test $p$-values. For  $p$-values  larger than 0.15 it is very uncommon for subjects to correctly identify the actual data plot in the lineup.

\begin{figure*}[hbtp]
   \centering
       \scalebox{0.6}{\includegraphics{p_val_prop_correct.pdf}}
       \caption{Proportion of correct responses decreases rapidly with  increasing $p$-values. For $p$-values above 0.15 it becomes very unlikely that observers identify the actual plot. The theoretical justification of this is shown in Figure \ref{fig:pval_power}. }
       \label{fig:pval_pcorrect}
\end{figure*}

From the experimental data the visual $p$-values are estimated based on Definition \ref{dfn:pvalue}. Figure \ref{fig:pval_definition} displays resulting estimates for each lineup against the conventional $p$-value.  The pattern of visual $p$-values is interesting: for small $p$-values the visual estimates tend to be very small, while lineups with larger $p$-values result in very large visual estimate, giving a clear indication to reject $H_0$ or not. This is why we do not see lot of visual $p$-values  between 0.05 and 0.8 especially for experiment~2. This guides the researcher to make decision confidently while conventional tests with marginal $p$-values make the decision whether to reject or not harder. For visual tests this is not common.  

\begin{figure*}[hbtp]
   \centering
       \scalebox{0.6}{\includegraphics{p_val_definition.pdf}}
       \caption{Conventional test $p$-value ($p_D$) vs visual $p$-value obtained from the definition . Values are shown on square root scale. }
       \label{fig:pval_definition}
\end{figure*}

For experiment 3, we see that  the visual $p$ values are very small no matter what the conventional $p$-values are. This is expected as the conventional test loses its power to reject $H_0$ even when the alternative is true, whereas the visual test performs well.


\subsection{Do people tend to pick the lowest $p$-value?}

 One assumption made in order to evaluate the effect of lineup size in the calculations of visual $p$-value and signal strength was that subjects would tend to pick the plot in the lineup that had the strongest signal. In experiments 1 and 2, this corresponds to the plot with the smallest $p$-value.  We examine  the data collected from the first two experiments, to see if this assumption is, indeed, reasonable.  

Figure \ref{fig:P-val_log2} gives an overview of all selections in all lineups of experiments 1 and 2. Each panel of the figure corresponds to a single lineup. Each `pin' -- a short line topped by a dot -- corresponds to one plot in the lineup. The $x$-location of the pin shows the plot's $p$-value on a log scale, its height  is given by the number of observer choosing this plot.
Columns are ordered according to effect size as defined in section 6.2; rows show replicates for the same combination of parameters.

Red indicates the plot with the lowest $p$-value in the lineup. Blue indicates the plot of the actual data when it is different from that with the lowest $p$-value. In both experiments  people tended to select the plot with the lowest $p$-value. The results are clearer for experiment 2, that used a continuous covariate. But even when subjects did not pick the plot with the lowest $p$-value they tended to oscillate their choices between the several low $p$-value plots. So for most subjects, the assumption that they pick the plot with the smallest $p$-value would appear to be reasonable, and the actual power of the visual test should be close to the expected power.

\begin{figure*}[hbtp]
   \centering
      \includegraphics[width=0.8\textwidth]{p_val_log_counts-a.pdf}
       \includegraphics[width=0.8\textwidth]{p_val_log_counts-b.pdf}   
       \includegraphics[width=0.8\textwidth]{p_val_log_counts2-a.pdf}       
       \includegraphics[width=0.8\textwidth]{p_val_log_counts2-b.pdf}
       \caption{ Relative frequency of plot picks compared to other plots in the lineup plotted against the $p$-value (on log$_{10}$ scale) of each plot for all individual lineups of both experiment 1 and 2. Red indicates the plot with the lowest $p$-value, and blue indicates the actual data plot, when it is different from that with the lowest $p$-value. 
Columns are ordered according to effect size, with rows showing replicates of the same parameter combination on top of each other. Empty cells indicate combination of parameters that were not tested. Highest counts tend to be the plot in the lineup having the lowest $p$-value, more so for experiment 2 than 1.
    }   
       \label{fig:P-val_log2}
\end{figure*}

There are some noticeable exceptions to this rule. In experiment 1, when $\beta=0, n=100, \sigma=5, rep=1$ people overwhelmingly chose a plot with much larger $p$-value, similarly, for 
parameters $\beta=5, n=300, \sigma=12, rep=3$, people tended to pick the plot with the second smallest $p$-value. 
For several of these exceptions, along with several easy lineups, a follow up experiment was conducted using an eye-tracker to examine which patterns or features participants are cueing on in making their choices \citep{zhao:2012}.


\subsection{How much do null plots affect the choice?}\label{sec:null_choice}


Visual inference falls into the same framework as randomization tests, where the statistics from the data are compared with those from null data. Unlike randomization tests visual inference is constrained to make the comparison with just a few draws $(m-1)$ from the null distribution. How this small set of null plots influences the subjects' choice is important for understanding the reliability of visual inference. If the actual data plot is very different from all of the null plots, then the null plots should not have much influence on the choice. Measuring the difference, generally, between  plots is almost impossible. However,  in this controlled setting we can use $p$-values of the test statistic calculated on the data used in each plot as a proxy for similarity of structure between the plots. If there is a null plot with a small $p$-value, or one close to that of the actual data plot, we would expect that subjects have a harder time detecting the actual data plot.


\subsection{Type III error}~\label{sec:TypeIII}\label{sec:typeIII_error}


A little known error amongst statisticians is what was coined as Type III error in \citet{mosteller:48}. Type III errors are defined as the probability of  correctly rejecting  the null hypothesis but for the wrong reason. Experiment 3 is prone to this type of error. Participants were asked to identify the plot with the largest absolute slope. But the actual data plot featured a cluster of points, the contamination that made the conventional test  fail to see any trend. For the human eye this cluster of points is  as visible as the association between the remaining points, enabling the observer to identify the actual data plot by looking for the cluster instead of the slope. This would be considered  a Type III error because it leads to a correct rejection of the null hypothesis, but is not related to the value of the slope parameter. 

For visual inference, making a Type III, is not actually a problem. It is only a possibility in this experiment because we are working with known structure. In the real setting, we are excited to see observers detecting the actual data plot, and curious about how they detect it, with all possible reasons encapsulated in the alternative hypothesis. However, this highlights the importance of getting qualitative reasoning from observers  for their choices. 

\section{Conclusions}

This paper has demonstrated that statistical graphics can be used in statistical inference and validates the lineup protocol proposed by \citet{buja:2009}. Specific terminology was defined, and methods for obtaining the $p$-value and estimating the power of visual tests were introduced.  In order to calculate the theoretical power, it was assumed that observers will select the plot having the strongest signal in the lineup, and the experimental data suggests that for most observers this assumption holds.  Results from visual inference in the controlled setting of the simulation study are comparable to those obtained by conventional inference. Visual inference is intended to provide valid tests where no conventional test exists, and our experiments in a controlled scenario suggest that it will perform as expected in the intended applications. The power of a visual test increases with the number of observers, which interestingly, leads to a result that the theoretical power of visual test can be better than that of conventional tests.  

The lineup protocol operates similarly to statistical tests that have broad alternative hypotheses. If the null hypothesis is rejected, generally we can say that ``there is something there'' but not specifically what it is in the data that triggers the rejection. Follow-up questions on the reasons provide qualitative insight. In conventional testing, multiple comparisons are often done to refine and understand the test results, and perhaps some similar approaches might be developed for visual inference.  

The performance of subjects was quite varied, but consistent. No restrictions were placed on Turk workers, in terms of abilities. There were clearly some subjects who performed very badly, but it was very interesting to see that there were some super-observers, people who detected the actual data plot at a rate better than that of the power of the best conventional test. 
It would be interesting to see how well trained subjects might perform. Prior to the Turk experiments, we conducted pilot studies using local graphics experts and obtained good results, indicating that training in data visualization might be helpful for visual inference. Future work might explore this.

Visual inference has been successfully used in two practical applications: to evaluate the power of competing graphical designs \citep{heike:2012}, and to detecting signal presence in large $p$, small $n$ data \citep{niladri:2011}. It is hoped that the lineup protocol will prove to be  valuable in data mining applications, and exploratory analyses, where there are no existing gauges of statistical significance. 

\paragraph{Supplementary Material:} Proof of Lemma \ref{lemma}, details of data collection and cleaning, longer discussion of effect of null plots and Type III error.

\bibliographystyle{asa}
\bibliography{references}

\end{document}