An R
-package containing a collection of data sets and
R
-functions to accompagny and complement:
Abdi, H., & Beaton, D. (to appear, 2025).
Principal Component and Correspondence Analyses Using R.
New York: Springer Verlag.
In time, data4PCCAR
will also contain
the code used in the book
and the resulting figures.
data4PCCAR
is a collection of data sets
and R
-functions
that are used (or could have been used,
should have been used, or might have been used)
to illustrate Principal Component Analysis (PCA),
Correspondence Analysis (CA), and related methods
such as Multiple Correspondence Analysis (MCA),
and Multi-Dimensional Scaling (MDS).
To install data4PCCAR
from Github
use the function install_github()
from the package remotes
:
#install.packages("remotes")
#decomment the line above if the package `remotes` is not yet intalled
remotes::install_github("HerveAbdi/data4PCCAR") # install data4PCCAR
The average consumption of 4 types of alcoholic beverages in 22 European and neighboring countries. To be used to illustrate the use of CA and Hellinger analysis.
Wine experts and wine novices (blind) sorted red, rosé, and white wines. Can they make the difference? To be used for multidimensional scaling methods (MDS, Procrustean, or DISTATIS).
A set of 12 children and 10 adults picked up the color that best describes each of 9 pieces of music. To be used to illustrate CA and Partial Triadic CA (a variant of CA, a.k.a. PTCA).
A set of 133 (French) participants associated 6 colors to 6 vowels (data from Chastaing, 1961). To be used to illustrate CA.
A contingency table storing the causes of death as a function of age in the USA in 2001. To be used to illustrate CA.
The similarity (measured on a scale from 0 to 1000, with bigger numbers meaning more similar) between 15 psychometric instruments (i.e. scales) measuring depression.
Five tasters evaluated five beers on dimensions that they had previously chosen. These data can be used to illustrate, (among others) Multiple Factor Analysis (MFA), STATIS, or Distatis.
Five tasters evaluated (using a 10 point rating scale) how much they liked five beers. To be used to illustrate how to use MCA with "fuzzy coding" with rating scales.
Five (rather fictitious) wines are described
by their Sensory and Chemistry variables.
These wines are used to illustrate Varimax
rotation and
partial least square regression (PLSR).
Data from a flash profile where 5 different beers are evaluated by 6 different assessors each using their own descriptors. To be used to illustrate (among others) Multiple Factor Analysis, STATIS, or DiSTATIS.
How much 12 typical French families spent on different types of food (including Wine!) in the 1950's. To be used to illustrate non-normalized (a.k.a. covariance based) PCA.
Twenty-six world cuisines are described by their (82) cooking ingredients. To be used to illustrate CA.
A set of contingency tables showing the number of university graduates in the USA from 1966 to 2001 by fields. The grand contingency table gives the The 34 years * 8 fields results. The partial contingency tables subset the grand contingency table by gender (women vs men) and by levels (Bachelor, Master, and Ph.D.). To be used to illustrate CA, PTCA, and variations of these methods.
A data set storing the (simulated) results of 2,100 participants answering a Likert scale like questionnaire comprising 30 questions about mental imagery (called the Object-Spatial Imagery Questionnaire: OSIQ). Half of the questions concern mental imagery for object and the other half ot the questions concern mental imagery for spatial locations. To be used to illustrate PCA, Varimax rotation, and also MCA.
The punctuation used by 6 classical French Writers (Rousseau, Chateaubrian, Hugo, Zola, Proust, Giraudoux) plus the punctuation of one chapter of Abdi's Les Réseaux de Neurones. A classic example for CA.
The average data obtained from ten tasters who evaluated the intensity of twelve descriptors for six wheat-based beers. These data can be used to illustrate: un-normed (or normed) Principal Component Analysis (PCA) or Correspondence Analysis (CA).
The (230 by 240 = 55,200 pixels) pictures of six faces (3 men and 3 women). To be used to illustrate PCA or PLS on images.
The average data obtained from ten tasters who evaluated the intensity of four descriptors for sixteen chewing-gums. These data can be used to illustrate: normed Principal Component Analysis (PCA) or Multiple Correspondence Analysis (MCA).
A baby (quite fictitious) example: 20 red wines (10 French, 10 American) are rated by an expert on multiple dimensions. To be used to illustrate PCA.
A set of 36 wines (from Argentina, Canada, and the USA) are described by a variety of variables (descriptors, chemistry, and sensory). To be used to illustrate PCA with supplementary variables (could also be used for two table methods such as CCCA, PLSC), PLSCA, and PLSR).
Adds arrows to a factorial map
(best with a circle of correlation
as created, e.g., by addCircleOfCor
).
Used for loadings for PCA maps.
Computes the biplot coordinates of variables
for a PCA computed with ExPosition::epPCA
.
Adds a Circle of Correlation to a PCA-like map
of correlation produced by createFactorMap
et alia. Used for loadings
(when seen as correlation
between factors and variables)
or cosines for PCA maps.
A ggplot2
based function that
adds lines to join the levels of
the qualitative variables in a factorial
map of an MCA (e.g., computed with
ExPosition::epMCA
and created with
createFactortMap
). Used for MCA maps.
Creates (roughly) balanced factors from quantitative variables. Used to create nominal variables from quantitative variables for MCA.
Generates a bootstrap distribution of eigenvalues for a given (rectangular) matrix.
Creates a Bootstrap Cube for CCA
(Canonical Correlation Analysis):
Creates Bootstrap Cubes for the I and J sets
of a CCA
obtained from bootstraping the rows
of the two data-tables used for CCA.
Companion function of tepCCA
.
Note: This function is still in the beta stage.
Creates a Bootstrap Cube for PLSC (partial least square correlation): Creates Bootstrap Cubes for the I and J sets of a PLSC obtained from bootstraping the rows of the two data-tables used for PLSC.
Computes pseudo Bootstrap Ratios (BR) from pseudo-$F$ for variables in MCA. Used to create BR-like histogram comparable to BR such as computed in PCA.
Makes a random image
(with similar PCA loadings)
of a matrix of integer data
such as Likert scale data.
Uses two helper functions:
getRandomFS
and recode2Integer
:
Creates a vector of color names for the levels of an MCA from the color names of the variables. Used for creating mqps in MCA.
A helper function for the functions
perm4PLSC
and Boot4PLSC
,
computes the cross-product
of two conformable matrices
using various types of
normalization and centering.
An efficient routine to compute squared correlation and RV coefficient between 2 conformable matrices.
Computes contributions (or squared cosines) for (qualitative) variables from the contributions of the levels of the qualitative variables. Used to get contributions in MCA.
A very fast and bare-bone function that computes the eigenvalues (and possibly the row and column factor scores) of the Correspondence Analysis (CA) of a data matrix suitable for CA (i.e., a matrix whose all entries are non-negative).
A Varimax rotation on loadings and factor scores.
Also gives the pseudo-eigenvalues
and percentage of inertia.
Used post-PCA to get nice clean components.
epiVari
requires
an a-priori number of dimensions.
Used
for Correspondence Analysis
(CA), creates a Bootstrap Cube
obtained from bootstrapping the observations
from a true contingency table.
fastBoot4CA
is based on the base
R
function rmultinom
.
Used
for Correspondence Analysis
(CA),
computes a permutation test
for CA when CA is performed
on a true contingency table.
fastPerm4CA
is based on the base
R
function rmultinom
.
Changes the sign of the values of the columns
in one matrix or two matrices
according to the sign
of the elements in the first line
of the first matrix.
Used to standardize an eigenvector matrix
ortwo matched singular vector matrices.
Makes positive the first value of factor scores and
singular vectors from the output of functions from
the ExPosition
package such as
epCA
(see also firstpos
and renormInertiaExPo
).
Computes the biplot coordinates of variables
for a PCA computed with ExPosition::epPCA
.
Gets the items with
contributions important for a factorial plane.
getImportantCtr.12
gives the contributions of the items
to the plane, the items important for the plane,
and the items important for the plane or
for at least one dimension of the plane.
Generates multivariate random observations (i.e., factor scores) that match the variance/eigenvalues of the dimensions (as defined by their loadings).
Computes the Malinvaud / Saporta test for
the omnibus and dimensions of a correspondence
analysis (CA). malinvaudQ4CA.perm
gives
the asymptotic Chi2 values
and their associated
p-value under the usual
assumptions and from a permutation test.
A Monte Carlo generation of (random) eigenvalues to match a data matrix. Used to implement the parallel test for the number of reliable eigenvalues.
A very fast function that computes the permuted and bootstrapped eigenvalues of the correspondence analysis (CA) of a matrix suitable for CA (i.e., a matrix with non negative elements).
Adds dimension names to the results of analyses
performed with ExPosition
(recall that column names are needed
for plots created with ggplot2
when
using aes()
). Note that
if there is a space is a name
(as is the case by default),
aes_string()
may, in some contexts,
have to be used in lieu of aes()
.
Normalizes a matrix by column (i.e., each column vector has a norm of 1).
Computes a permutation test
for Canonical Correlation Analysis
(CCA, (as implemented
in tepCCA
).
Computes an omnibus permutation test and
specific tests for the eigenvalues when
performing a CCA.
Computes a permutation test for PLSC
(as implemented
in TExPosition::tepPLS
.
Computes an omnibus permutation test and
specific tests for the eigenvalues when
performing a PLSC.
A fast function to compute the
correlation
coefficient for a contingency table.
Used for CA and MCA and
as a helper function
by function phi2Mat4BurtTable
.
Computes the
In PLS regression PLS4jack
compute a
supplementary projection
for a jackknifed estimation of
one supplementary element.
PLS4jack
is mainly used by
PLSR_SVD
for computing jackknifed estimates.
PLS regression (PLSR) using the Singular Value Decomposition (SVD) instead of the original NIPALS (faster for large data sets).
Computes latent variables /
factor scores for supplementary
observations in a PLSC
analysis.
Note: projSupplementaryObservations4PLSC
is mostly used
as a preparatory step for
supplementaryObservations4PLSC
.
Recodes a vector of real values to integers matching a given distribution for the integers.
Renorms
the output of an ExPosition
function (e.g.,
epCA, epPCA, epMCA
) to a given constant.
Centers and normalizes a vector to norm 1 or to norm
scale0
gives the same results as scale
except
- for the option
'SS1'
(scale0
normalizes to 1); whenx
comprises all identical values and 2) when the parameterscale = TRUE
,scale
gives back a vector ofNaN
(because of a division by 0 error), whereasscale0
gives back a vector of0
.
Computes the projection of supplementary elements
(rows or columns) for a PCA (computed with
ExPosition::epPCA
)
followed by a Varimax
rotation
(computed with
data4PCCAR::epVari
).
Computes the value of
the latent variables for
supplementary observations for a PLSC model
from TExPosition::PLS
.
Computes the value
of the latent variables for
supplementary observations for a PLSCA model
from TExPosition::PLSCA
.
Projects supplementary variables (columns)
for a PLSC analysis
(computed with TExPosition::tepPLS
.
Includes the internal function
projOnDualSet4PLSC
.
Projects supplementary variables (columns)
for a PLSCA analysis
(computed with TExPosition::tepPLSCA)
.
A TExPosition
-like version of
Canonical Correlation
Analysis (CCA).
Inputs and outputs are the same
as TExPosition::tepPLS
.
Current version uses a clone
of GSVD
from Derek Beaton's
soon-to-be-released-on-CRAN package
GSVD
(see Github
:
derekbeaton/GSVD
for details`).
Note that this version is still
under development.
A TExPosition
-like version of
Redundancy Analysis.
Inputs and outputs are the same
as TExPosition::tepPLS
.
Current version uses a clone of GSVD
from Derek Beaton's
soon-to-be-released-on-CRAN package
GSVD
(see Github
:
derekbeaton/GSVD
for details`).
Note that this version is still under development.
A ggplot2
theme that matches the defaults
of the factorial figures in PTCA4CATA
.
theme_ptca2
is, a (slightly smarter)
cousin of PTCA4CATA::theme_ptca
(which will be deprecated in the
near future).