This document intends to be a guide on how to work with quantities data (magnitudes with units and/or uncertainty) in two distinct workflows: R base and the so-called tidyverse. Units and errors (and, by extension, quantities) objects essentially are numeric vectors, arrays and matrices with associated metadata. This metadata is not always compatible with some functions, and thus we here explore the most common operations in data wrangling (subsetting, ordering, transformations, aggregations…) to identify potential issues and propose possible workarounds.
Let us consider the traditional iris
data set for this exercise. According to its documentation,
iris
is a data frame with 150 cases (rows) and 5 variables (columns) namedSepal.Length
,Sepal.Width
,Petal.Length
,Petal.Width
, andSpecies
.
And values are provided in centimeters. If we consider, for instance, a 5% of uncertainty, the first step is to define proper quantities. Then we will work on the resulting data frame for the rest of this article.
library(quantities)
#> Loading required package: units
#> udunits system database from /usr/share/udunits
#> Loading required package: errors
iris.q <- iris
for (i in 1:4)
quantities(iris.q[,i]) <- list("cm", iris.q[,i] * 0.05)
head(iris.q)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1(3) [cm] 3.5(2) [cm] 1.40(7) [cm] 0.20(1) [cm] setosa
#> 2 4.9(2) [cm] 3.0(2) [cm] 1.40(7) [cm] 0.20(1) [cm] setosa
#> 3 4.7(2) [cm] 3.2(2) [cm] 1.30(6) [cm] 0.20(1) [cm] setosa
#> 4 4.6(2) [cm] 3.1(2) [cm] 1.50(8) [cm] 0.20(1) [cm] setosa
#> 5 5.0(2) [cm] 3.6(2) [cm] 1.40(7) [cm] 0.20(1) [cm] setosa
#> 6 5.4(3) [cm] 3.9(2) [cm] 1.70(8) [cm] 0.40(2) [cm] setosa
Note that, throughout this document, and unless otherwise stated, we will talk about quantities objects as a shortcut for quantities, units and errors objects.
In this section, we consider all the methods and functions included in the default packages, i.e., those that are automatically installed along with any R distribution:
#> [1] "base" "compiler" "datasets" "graphics" "grDevices"
#> [6] "grid" "methods" "parallel" "splines" "stats"
#> [11] "stats4" "tcltk" "tools" "utils"
Quantities objects have all the subsetting methods defined ([
, [[
, [<-
, [[<-
). Therefore they can be used in the same way as with plain numeric vectors, and in conjunction with which
and other functions to perform subsetting. The subset
function is very handy too and achieves the same result:
iris.q[which(iris.q$Sepal.Length > set_quantities(7.5, cm)), ]
#> Warning: In '>' : boolean operators not defined for 'errors' objects,
#> uncertainty dropped
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 106 7.6(4) [cm] 3.0(2) [cm] 6.6(3) [cm] 2.1(1) [cm] virginica
#> 118 7.7(4) [cm] 3.8(2) [cm] 6.7(3) [cm] 2.2(1) [cm] virginica
#> 119 7.7(4) [cm] 2.6(1) [cm] 6.9(3) [cm] 2.3(1) [cm] virginica
#> 123 7.7(4) [cm] 2.8(1) [cm] 6.7(3) [cm] 2.0(1) [cm] virginica
#> 132 7.9(4) [cm] 3.8(2) [cm] 6.4(3) [cm] 2.0(1) [cm] virginica
#> 136 7.7(4) [cm] 3.0(2) [cm] 6.1(3) [cm] 2.3(1) [cm] virginica
subset(iris.q, Sepal.Length > set_quantities(7.5, cm))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 106 7.6(4) [cm] 3.0(2) [cm] 6.6(3) [cm] 2.1(1) [cm] virginica
#> 118 7.7(4) [cm] 3.8(2) [cm] 6.7(3) [cm] 2.2(1) [cm] virginica
#> 119 7.7(4) [cm] 2.6(1) [cm] 6.9(3) [cm] 2.3(1) [cm] virginica
#> 123 7.7(4) [cm] 2.8(1) [cm] 6.7(3) [cm] 2.0(1) [cm] virginica
#> 132 7.9(4) [cm] 3.8(2) [cm] 6.4(3) [cm] 2.0(1) [cm] virginica
#> 136 7.7(4) [cm] 3.0(2) [cm] 6.1(3) [cm] 2.3(1) [cm] virginica
Note that another quantities object is defined for the comparison. This is needed because different units are incomparable. Also note that the first line throws a warning telling us that the uncertainty was dropped for this operation. This kind of warning is thrown once, and this is why subset
succeeds silently.
The sort
function, as its name suggests, sorts vectors, and it is compatible with quantities:
iris.q$Sepal.Length[1:5]
#> Units: [cm]
#> Errors: 0.255 0.245 0.235 0.230 0.250
#> [1] 5.1 4.9 4.7 4.6 5.0
sort(iris.q$Sepal.Length[1:5])
#> Units: [cm]
#> Errors: 0.230 0.235 0.245 0.250 0.255
#> [1] 4.6 4.7 4.9 5.0 5.1
More generally, the order
function can be used for data frame ordering:
head(iris.q[order(iris.q$Sepal.Length), ])
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 14 4.3(2) [cm] 3.0(2) [cm] 1.10(6) [cm] 0.100(5) [cm] setosa
#> 9 4.4(2) [cm] 2.9(1) [cm] 1.40(7) [cm] 0.20(1) [cm] setosa
#> 39 4.4(2) [cm] 3.0(2) [cm] 1.30(6) [cm] 0.20(1) [cm] setosa
#> 43 4.4(2) [cm] 3.2(2) [cm] 1.30(6) [cm] 0.20(1) [cm] setosa
#> 42 4.5(2) [cm] 2.3(1) [cm] 1.30(6) [cm] 0.30(2) [cm] setosa
#> 4 4.6(2) [cm] 3.1(2) [cm] 1.50(8) [cm] 0.20(1) [cm] setosa
The transform
function is able to modify variables in a data frame or to create new ones. The within
function provides a similar but more flexible approach though. Both are fully compatible with quantities:
head(within(iris.q, {
Sepal.Area <- Sepal.Length * Sepal.Width
Petal.Area <- Petal.Length * Petal.Width
rm(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
}))
#> Species Petal.Area Sepal.Area
#> 1 setosa 0.28(2) [cm^2] 18(1) [cm^2]
#> 2 setosa 0.28(2) [cm^2] 15(1) [cm^2]
#> 3 setosa 0.26(2) [cm^2] 15(1) [cm^2]
#> 4 setosa 0.30(2) [cm^2] 14(1) [cm^2]
#> 5 setosa 0.28(2) [cm^2] 18(1) [cm^2]
#> 6 setosa 0.68(5) [cm^2] 21(1) [cm^2]
Row aggregation is the process of summarising data based on some grouping variable(s). There are several ways of working with data split by factors in R base, and, although they tend to preserve classes, they are generally not very kind to other metadata (i.e., attributes) by default.
In the following example, the average Sepal.Length
is computed per Species
, but the metadata gets dropped:
tapply(iris.q$Sepal.Length, iris.q$Species, mean)
#> setosa versicolor virginica
#> 5.006 5.936 6.588
Many of these functions include a simplify
parameter which, if set to FALSE
, preserves quantities metadata:
(sepal.length.agg <-
tapply(iris.q$Sepal.Length, iris.q$Species, mean, simplify=FALSE))
#> $setosa
#> 5.0(3) [cm]
#>
#> $versicolor
#> 5.9(3) [cm]
#>
#> $virginica
#> 6.6(3) [cm]
The only drawback is that the result is a list, and such a list must be unlisted with care, otherwise, metadata gets dropped again:
# drops quantities
unlist(sepal.length.agg)
#> setosa versicolor virginica
#> 5.006 5.936 6.588
# preserves quantities
do.call(c, sepal.length.agg)
#> Units: [cm]
#> Errors: 0.2503 0.2968 0.3294
#> setosa versicolor virginica
#> 5.006 5.936 6.588
The by
function is an object-oriented wrapper for tapply
applied to data frames which also provides a simplify
parameter. A more convenient way of working with summary statistics is the aggregate
generic, from the stats
namespace. Although there is a aggregate.data.frame
method, there is a more intuitive interface to it through the aggregate.formula
method. Again, it is necessary to set simplify=FALSE
to keep quantities:
(iris.q.agg <- aggregate(. ~ Species, data = iris.q, mean, simplify=FALSE))
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 setosa 5.006 3.428 1.462 0.246
#> 2 versicolor 5.936 2.77 4.26 1.326
#> 3 virginica 6.588 2.974 5.552 2.026
Apparently, the output has no metadata associated, but what really happens is that the resulting columns are lists:
Therefore, as in the tapply
/by
case, they must be unlisted with care to still preserve the metadata:
unlist_quantities <- function(x) {
stopifnot(is.list(x) || is.data.frame(x))
unlist <- function(x) {
if (any(class(x[[1]]) %in% c("quantities", "units", "errors")))
do.call(c, x)
else x
}
if (is.data.frame(x))
as.data.frame(lapply(x, unlist), col.names=colnames(x))
else unlist(x)
}
unlist_quantities(iris.q.agg)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 setosa 5.0(3) [cm] 3.4(2) [cm] 1.46(7) [cm] 0.25(1) [cm]
#> 2 versicolor 5.9(3) [cm] 2.8(1) [cm] 4.3(2) [cm] 1.33(7) [cm]
#> 3 virginica 6.6(3) [cm] 3.0(1) [cm] 5.6(3) [cm] 2.0(1) [cm]
And this method works for the tapply
/by
case too:
Joining data frames by common columns can done with the merge
generic. Such operations are based on appending columns, which may be subset or replicated to fit the length of the merged observations. Therefore, quantities should be preserved in all cases. In the following example, we generate a data frame with the height per species and then merge it with the main data set:
height <- data.frame(
Height = set_quantities(c(55, 60, 45), cm, c(45, 30, 35)),
Species = c("setosa", "virginica", "versicolor")
)
head(merge(iris.q, height))
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width Height
#> 1 setosa 5.1(3) [cm] 3.5(2) [cm] 1.40(7) [cm] 0.20(1) [cm] 60(40) [cm]
#> 2 setosa 4.9(2) [cm] 3.0(2) [cm] 1.40(7) [cm] 0.20(1) [cm] 60(40) [cm]
#> 3 setosa 4.7(2) [cm] 3.2(2) [cm] 1.30(6) [cm] 0.20(1) [cm] 60(40) [cm]
#> 4 setosa 4.6(2) [cm] 3.1(2) [cm] 1.50(8) [cm] 0.20(1) [cm] 60(40) [cm]
#> 5 setosa 5.0(2) [cm] 3.6(2) [cm] 1.40(7) [cm] 0.20(1) [cm] 60(40) [cm]
#> 6 setosa 5.4(3) [cm] 3.9(2) [cm] 1.70(8) [cm] 0.40(2) [cm] 60(40) [cm]
The reshape
function, from the stats
namespace, provides an interface for both pivoting and unpivoting (i.e., tidyfying data). In the case of the iris
data set, we would say that it is in the wide format, because each row has more than one observation.
This function has a quite peculiar nomenclature. First of all, the unpivoting operation is accessed by providing the argument direction="long"
. We need to define the varying
columns (columns to unpivot), as character or indices, and they are unpivoted based on their names. By default, the separator sep="."
is used, which means that Sepal.Width
will be broken down into Sepal
and Width
, and the former will be unpivoted with the latter as grouping variable. We can specify the name of the grouping variable with the timevar
argument.
Putting everything together, this is how to unpivot the data set by the dimension (which we will call it dim
) of the petal/sepal:
long.1 <- reshape(iris.q, varying=1:4, timevar="dim", idvar="dim.id", direction="long")
head(long.1)
#> Species dim Sepal Petal dim.id
#> 1.Length setosa Length 5.1(3) [cm] 1.40(7) [cm] 1
#> 2.Length setosa Length 4.9(2) [cm] 1.40(7) [cm] 2
#> 3.Length setosa Length 4.7(2) [cm] 1.30(6) [cm] 3
#> 4.Length setosa Length 4.6(2) [cm] 1.50(8) [cm] 4
#> 5.Length setosa Length 5.0(2) [cm] 1.40(7) [cm] 5
#> 6.Length setosa Length 5.4(3) [cm] 1.70(8) [cm] 6
It can be noted that the unpivoting also generates an index to indentify multiple records from the same group. We have changed the name of that identifier to dim.id
(just id
by default).
We can further unpivot sepal and petal as the part
of the flower. First, we need to prepend a common identifier to columns 3 and 4, which are to be unpivoted:
names(long.1)[3:4] <- paste0("value.", names(long.1)[3:4])
long.2 <- reshape(long.1, varying=3:4, timevar="part", idvar="part.id", direction="long")
head(long.2)
#> Species dim dim.id part value part.id
#> 1.Sepal setosa Length 1 Sepal 5.1(3) [cm] 1
#> 2.Sepal setosa Length 2 Sepal 4.9(2) [cm] 2
#> 3.Sepal setosa Length 3 Sepal 4.7(2) [cm] 3
#> 4.Sepal setosa Length 4 Sepal 4.6(2) [cm] 4
#> 5.Sepal setosa Length 5 Sepal 5.0(2) [cm] 5
#> 6.Sepal setosa Length 6 Sepal 5.4(3) [cm] 6
And the final result has one tidy observation per row.
The pivoting operation can be accessed by providing the argument direction="wide"
. The process is almost symmetrical, but we need to specify v.names
, as character, instead of varying
columns. First, we can pivot by flower part:
wide.1 <- reshape(long.2, v.names="value", timevar="part", idvar="part.id", direction="wide")
head(wide.1)
#> Species dim dim.id part.id value.Sepal value.Petal
#> 1.Sepal setosa Length 1 1 5.1(3) [cm] 1.40(7) [cm]
#> 2.Sepal setosa Length 2 2 4.9(2) [cm] 1.40(7) [cm]
#> 3.Sepal setosa Length 3 3 4.7(2) [cm] 1.30(6) [cm]
#> 4.Sepal setosa Length 4 4 4.6(2) [cm] 1.50(8) [cm]
#> 5.Sepal setosa Length 5 5 5.0(2) [cm] 1.40(7) [cm]
#> 6.Sepal setosa Length 6 6 5.4(3) [cm] 1.70(8) [cm]
Then, we remove "value."
from the column names and pivot by dimension (note that indices are removed to match the initial data frame):
names(wide.1)[5:6] <- sub("value\\.", "", names(wide.1)[5:6])
wide.2 <- reshape(wide.1, v.names=c("Sepal", "Petal"), timevar="dim", idvar="dim.id", direction="wide")
#> Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
#> varying, : some constant variables (part.id) are really varying
wide.2$dim.id <- NULL
wide.2$part.id <- NULL
head(wide.2)
#> Species Sepal.Length Petal.Length Sepal.Width Petal.Width
#> 1.Sepal setosa 5.1(3) [cm] 1.40(7) [cm] 3.5(2) [cm] 0.20(1) [cm]
#> 2.Sepal setosa 4.9(2) [cm] 1.40(7) [cm] 3.0(2) [cm] 0.20(1) [cm]
#> 3.Sepal setosa 4.7(2) [cm] 1.30(6) [cm] 3.2(2) [cm] 0.20(1) [cm]
#> 4.Sepal setosa 4.6(2) [cm] 1.50(8) [cm] 3.1(2) [cm] 0.20(1) [cm]
#> 5.Sepal setosa 5.0(2) [cm] 1.40(7) [cm] 3.6(2) [cm] 0.20(1) [cm]
#> 6.Sepal setosa 5.4(3) [cm] 1.70(8) [cm] 3.9(2) [cm] 0.40(2) [cm]
We have seen that quantities have been correctly preserved through the whole process. Finally, we can check whether both data frames are identical. Given that the order of columns have changed, we can simply check this column name by column name and then put everything together:
The core tidyverse includes the following packages: ggplot2
, dplyr
, tidyr
, readr
, purrr
, tibble
, stringr
and forcats
. This section covers use cases for dplyr
(everything except for pivoting and unpivoting) and tidyr
(for pivoting and unpivoting).
library(dplyr); packageVersion("dplyr")
#> [1] '0.8.3'
library(tidyr); packageVersion("tidyr")
#> [1] '1.0.0'
The filter
generic finds observations where conditions hold. The main difference with base subsetting is that, if a condition evaluates to NA
for a certain row, it is dropped. As in the base case, another quantities object must be defined for the comparison:
iris.q %>%
filter(Sepal.Length > set_quantities(7.5, cm)) %>%
head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 7.6(4) [cm] 3.0(2) [cm] 6.6(3) [cm] 2.1(1) [cm] virginica
#> 2 7.7(4) [cm] 3.8(2) [cm] 6.7(3) [cm] 2.2(1) [cm] virginica
#> 3 7.7(4) [cm] 2.6(1) [cm] 6.9(3) [cm] 2.3(1) [cm] virginica
#> 4 7.7(4) [cm] 2.8(1) [cm] 6.7(3) [cm] 2.0(1) [cm] virginica
#> 5 7.9(4) [cm] 3.8(2) [cm] 6.4(3) [cm] 2.0(1) [cm] virginica
#> 6 7.7(4) [cm] 3.0(2) [cm] 6.1(3) [cm] 2.3(1) [cm] virginica
There are also three scoped variants available (filter_all
, filter_if
, filter_at
) and a subsetting function by row number called slice
. All of them preserve quantities.
The arrange
generic sorts variables in a straightforward way, and it is compatible with quantities:
iris.q %>%
arrange(Sepal.Length) %>%
head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 4.3(2) [cm] 3.0(2) [cm] 1.10(6) [cm] 0.100(5) [cm] setosa
#> 2 4.4(2) [cm] 2.9(1) [cm] 1.40(7) [cm] 0.20(1) [cm] setosa
#> 3 4.4(2) [cm] 3.0(2) [cm] 1.30(6) [cm] 0.20(1) [cm] setosa
#> 4 4.4(2) [cm] 3.2(2) [cm] 1.30(6) [cm] 0.20(1) [cm] setosa
#> 5 4.5(2) [cm] 2.3(1) [cm] 1.30(6) [cm] 0.30(2) [cm] setosa
#> 6 4.6(2) [cm] 3.1(2) [cm] 1.50(8) [cm] 0.20(1) [cm] setosa
The desc
function can be applied to individual variables to arrange in descending order.
There are two generics for column transformations: mutate
modifies or adds new variables preserving the existing ones, while transmute
drops the existing variables. The syntax is very similar to base functions transform
and within
, and equally compatible with quantities:
iris.q %>%
transmute(
Species = Species,
Petal.Area = Petal.Length * Petal.Width,
Sepal.Area = Sepal.Length * Sepal.Width
) %>%
head()
#> Species Petal.Area Sepal.Area
#> 1 setosa 0.28(2) [cm^2] 18(1) [cm^2]
#> 2 setosa 0.28(2) [cm^2] 15(1) [cm^2]
#> 3 setosa 0.26(2) [cm^2] 15(1) [cm^2]
#> 4 setosa 0.30(2) [cm^2] 14(1) [cm^2]
#> 5 setosa 0.28(2) [cm^2] 18(1) [cm^2]
#> 6 setosa 0.68(5) [cm^2] 21(1) [cm^2]
dplyr
breaks down aggregation operations in two distinct parts: grouping (with group_by
) and summarising (using summarise
and others). The shortcoming of this approach is that it is possible to apply other operations (such as subsetting) to grouped data, which may lead to performance degradation.
Another shortcoming is that dplyr
’s grouped operations are not yet fully compatible with quantities (see tidyverse/dplyr#2773):
iris.q %>%
group_by(Species) %>%
summarise_all(mean)
#> # A tibble: 3 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <[(err) cm]> <[(err) cm]> <[(err) cm]> <[(err) cm]>
#> 1 setosa 5.0(3) cm 3.4(2) cm 1.46(7) cm 0.25(1) cm
#> 2 versicolor 5.936(NA) cm 2.77(NA) cm 4.26(NA) cm 1.326(NA) cm
#> 3 virginica 6.588(NA) cm 2.974(NA) cm 5.552(NA) cm 2.026(NA) cm
As we can see above, although units are correctly preserved, uncertainty is not correctly handled.
iris.q %>%
mutate_at(vars(-Species), drop_errors) %>%
group_by(Species) %>%
summarise_all(mean)
#> # A tibble: 3 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> [cm] [cm] [cm] [cm]
#> 1 setosa 5.006 3.428 1.462 0.246
#> 2 versicolor 5.936 2.770 4.260 1.326
#> 3 virginica 6.588 2.974 5.552 2.026
Units alone work without issue, but errors or full-featured quantities are not compatible with dplyr
’s grouped operations.
Several verbs are provided for different types of joins, such as inner_join
, left_join
, right_join
or full_join
. It seems that, internally, they use the same grouping mechanism than summaries, and therefore they will generally fail for errors and full-featured quantities (note the missing uncertainty, as in the previous case):
iris.q %>%
left_join(data.frame(
Height = set_quantities(c(55, 60, 45), cm, c(45, 30, 35)),
Species = c("setosa", "virginica", "versicolor")
)) %>%
head()
#> Joining, by = "Species"
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Height
#> 1 5.1(3) [cm] 3.5(2) [cm] 1.40(7) [cm] 0.20(1) [cm] setosa 60(40) [cm]
#> 2 4.9(2) [cm] 3.0(2) [cm] 1.40(7) [cm] 0.20(1) [cm] setosa 60(40) [cm]
#> 3 4.7(2) [cm] 3.2(2) [cm] 1.30(6) [cm] 0.20(1) [cm] setosa 60(40) [cm]
#> 4 4.6(2) [cm] 3.1(2) [cm] 1.50(8) [cm] 0.20(1) [cm] setosa 60(40) [cm]
#> 5 5.0(2) [cm] 3.6(2) [cm] 1.40(7) [cm] 0.20(1) [cm] setosa 60(40) [cm]
#> 6 5.4(3) [cm] 3.9(2) [cm] 1.70(8) [cm] 0.40(2) [cm] setosa 60(40) [cm]
Finally, pivoting and unpivoting is handled by a separate package, tidyr
, using the verbs spread
(pivot) and gather
(unpivot).
The unpivoting operation is substantially more straightforward. In the next example, we directly merge the four columns of interest into the value
column, and the correspoding column names are gathered into the key
column. Such a column is then separated into flower part
(sepal, petal) and dim
(length, height):
iris.q %>%
gather("key", "value", 1:4) %>%
separate(key, c("part", "dim")) %>%
head()
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
#> Species part dim value
#> 1 setosa Sepal Length 5.1
#> 2 setosa Sepal Length 4.9
#> 3 setosa Sepal Length 4.7
#> 4 setosa Sepal Length 4.6
#> 5 setosa Sepal Length 5.0
#> 6 setosa Sepal Length 5.4
Unfortunately, it is evident that the operation completely drops all classes and attributes, and then quantities are not preserved.
The pivoting operation does preserve classes and attributes, but the latter ones are not correctly handled. In the following example, we first gather the original data set, then we assign quantities and try to spread it to obtain iris.q
:
wide <- iris %>%
# first gather, with row numbers as row_id
mutate(row_id = 1:n()) %>%
gather("key", "value", 1:4) %>%
separate(key, c("part", "dim")) %>%
# assign quantities
mutate(value = set_quantities(value, cm, value * 0.05)) %>%
# now spread
unite(key, part, dim, sep=".") %>%
spread(key, value) %>%
select(-row_id)
head(wide)
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width
#> 1 setosa 1.40(7) [cm] 0.20(7) [cm] 5.10(7) [cm] 3.50(7) [cm]
#> 2 setosa 1.40(7) [cm] 0.20(7) [cm] 4.90(7) [cm] 3.00(7) [cm]
#> 3 setosa 1.30(6) [cm] 0.20(6) [cm] 4.70(6) [cm] 3.20(6) [cm]
#> 4 setosa 1.50(8) [cm] 0.20(8) [cm] 4.60(8) [cm] 3.10(8) [cm]
#> 5 setosa 1.40(7) [cm] 0.20(7) [cm] 5.00(7) [cm] 3.60(7) [cm]
#> 6 setosa 1.70(8) [cm] 0.40(8) [cm] 5.40(8) [cm] 3.90(8) [cm]
Apparently, everything worked, but in fact it didn’t:
all(sapply(colnames(iris.q), function(col) all(iris.q[[col]] == wide[[col]])))
#> Error in `errors<-.errors`(`*tmp*`, value = e): any(length(value) == c(length(x), 1L)) is not TRUE
length(errors(iris.q$Sepal.Length))
#> [1] 150
length(errors(wide$Sepal.Length))
#> [1] 600
As shown above, the uncertainty was not properly subset and pivoted.
R base works smoothly with quantities in most cases. The only shortcoming is that some care must be applied to aggregations. In particular, simplification must be explicitly disabled (simplify=FALSE
), and such a simplification (i.e., converting lists to vectors of quantities) must be applied manually while avoiding unlist
.
The tidyverse handles quantities correctly for subsetting, ordering and transformations. It fails to do so for aggregations (grouped operations in general), column joining and (un)pivoting. Most of these incompatibilities are due to the same internal grouping mechanism, which is in C and prevents the R subsetting operator from being called (which in turn calls the subsetting operator on the errors attribute). Interestingly, those operations still work for units alone, except for column gathering, which drops all classes and attributes.
data.table
The data.table
package is another popular data tools, which provides a high-performance version of base R’s data.frame
with syntax and feature enhancements for ease of use, convenience and programming speed.
Long story short, we have not included a section on data.table
because currently (v1.11.4) it does not work well with vectorised attributes. The underlying problem is similar to dplyr
’s issue, but unfortunately it affects more operations, including row subsetting and ordering. Only column transformation seems to work, and other operations generate corrupted objects.
We have found that defining quantities columns as lists (where each element consists of a single value, with unit and uncertainty) may be a workaround, but this probably would be a serious performance penalty for a package that is typically chosen for speed reasons.