Research Summary table
Research scope
Compositional data analysis
The core research line of the group is the statistical analysis of compositional
data, which are defined as (random) vectors of positive components whose
sum is constant (i.e., 100, one, a million). These restrictions make standard
statistical techniques to lose their applicability and classical interpretation.
Although this is a known problem since the nineteenth century, no theoretically
founded solution was proposed until the 80s. Then, Professor J. Aitchison
gives, from a strictly statistical perspective, the first suggestions or
principles to consistently analyze compositional data. From these first
indications, it has been a posteriori found that the mathematical
foundation of statistics is based on the definition of a specific geometry
on the simplex (the support space of compositional data), from where it
is possible to rigorously develop any classical statistical analysis (cluster
analysis, discriminant analysis, factor analysis, regression models, etc).
This implies that our research goals include both the development of proper
statistical techniques for compositional data, and the understanding of
the mathematics underlying these techniques, which belong to the fields
of algebra and geometry, the theory of measure, or the differential and
integral calculus on the simplex. Obviously, this reasoning can be
also applied to other support spaces, like R+, R2+,
the (0,1) interval and many others, thus offering promising new ways to
analyse constrained data sets.
In Geology, Petrology, Chemistry, Economy, Archeometry, etc., people
usually work with vectors of data whose components represent the relative
contribution of different parts in a whole. The main goal of the group
is to move forward in the statistical analysis of compositional data and
their mathematical foundation. This general goal has nowadays the following
specific objectives:
-
Mathematical foundations of compositional data analysis. After the
definition of compositions as equivalence classes, it is possible to apply
to the quotient space of compositions the whole methodology that has been
produced on the simplex, even to broaden it. In this way, these very techniques
are precisely and rigorously founded, at the same time that old concepts
of geometry, the theory of measure or differential and integral calculus,
must be translated and interpreted on the compositional quotient space,
or equivalently, on the simplex.
-
Orthogonality and independence on the simplex. Since the simplex
with a proper metric becomes an Euclidean space, we can define orthogonal
and orthonormal basis on it, as well as the associated isometric log-ratio
transformations. Hence we suggest to study subcompositional independence
as closely related to orthogonality of subspaces in the simplex.
-
Parametric cluster analysis of compositional data. During the last
years we have dealt with non-parametric techniques of classification of
compositional data, essentially based on the compositional distance introduced
by Aitchison. Now we tackle some parametric methodologies of classification
of compositional data, based on the hypothesis that the clusters are samples
generated from distributions of aln class (aditive logistic normal).
However, geochemical compositional data usually present null components,
which are almost zero since their values are under the detection limit
of the measurement tools. This implies that these null components should
be previously replaced by non-zero values before any classification procedure
can be applied. Now we are trying to apply the multiplicative substitution
methology-introduced by J.A. Martín, a member of the group-to the
parametric classification techniques.
-
Aditive logistic skew normal distribution (alsn). We have
recently finished off the study and modelling of compositional data with
the skew normal distribution, introduced by Azzalini (1996), by using the
same log-ratio transformation technique originally applied by Aitchison
to the normal distribution, and complementing it with some results from
the theory of measure. Hence we introduced the alsn distributrion,
and further studied their properties, specially those related to the real
vector space structure of the simplex, and to subcompositions. This distribution
family offers a promising complement to one of the lacks of additive logistic
normal distributions: the fact that two (or more) amalgamated aln
components, are seldom aln-distributed.
-
Goodness-of-fit tables for skew normal distributions. Since the
skew normal distribution was so recently introduced, there are still no
statistical tools devised to test whether these skewed models fit reasonably
well with real data sets, specially against the classical symmetric normal
model. This fact has forced the group to elaborate some specific tables
to test this goodness of fit, which are nowadays being developed according
to the methodology suggested by Stephens, for different sample sizes and
significance levels.
-
Statistical analysis of spatially-dependent compositional data sets.
It is rather usual in Geostatistics (from mining to environmental studies)
to deal with compositional data showing spatial dependency. Until now,
the standard cokriging techniques used in this framework are based on an
extension of the transformation techniques suggested by Aitchison to the
spatial case, but without considering the vector space structure of the
simplex. Therefore, we plan to reformulate these techniques by taking into
account the Euclidean metric introduced on the simplex, which we expect
to solve some flaws detected in classical application of cokriging to log-ratio
transformed variables.
-
Linear and non-linear models on the simplex. The most recent advances
in the building on the simplex of an algebraic-geometric structure suggest
the need to reinterpret-from a compositional point of view-the modelling
techniques of linear and non-linear processes in terms of compositional
processes. We have began with some real case studies from geological framework,
producing very promising results.
-
Compositional software (CoDaPack).
Since the beginning of the XXI century, the research group has developed a package called CoDaPack
containing a set of routines oriented to users with minimum knowledge on computers, with the aim to be simple and easy to use.
Through menus user communicates with the package and it returns both numerical and graphical outputs.
The graphic outputs can be applied in 3D and you can zoom and rotate it.
Originally CoDaPack was associated, by means VisualBasic routines, with the software Excel so it that ran like
a menu in Excel and the results were placed also in Excel sheets. Later, without leaving Excel, the graphics
were improved by means of OpenGL.
Since May 2011 there is a new version of CoDaPack, 2.0, which does not depend on Excel. This version is programmed
in Java and requires only to have installed the Java virtual machine (minimum version 1.5). This allows
CoDaPack 2.0 to run under any computer with Java Virtual Machine whatever its Operating System.
In particular the family of Mac computers
from Apple and Unix operating systems can now run CoDaPack 2.0.
This package is constantly expanding with new routines and improved existent ones.
Factor analysis
(soon)
Survey design and processing
(soon)