6. Data normalization options¶
The microarray expression data provided by the AHBA has been subjected to some normalization procedures designed to mitigate potential differences in expression values between donors due to “batch effects.” Despite these procedures, there are still some notable differences between donors present in the downloadable data.
By default, abagen.get_expression_data()
aggregates expression data
across donors (though this can be prevented via the return_donors
parameter). Prior to aggregation, the function performs a within-donor
normalization procedure to attempt to mitigate donor-specific effects; however,
there are a number of ways to achieve this.
Currently, abagen
supports nine options for normalizing data:
Most of the options have been used at various points throughout the published record, so while there is no “right” choice we do encourage using the default option (scaled robust sigmoid) due to recent work by Arnatkevičiūte et al., 2019 showing that it is—as the name might suggest—robust to outlier effects commonly observed in microarray data.
We describe all the methods in detail here; these can be implemented by passing
the sample_norm
and gene_norm
keyword arguments to
abagen.get_expression_data()
. For a selection of references to published
works that have used these different methods please see the documentation of
abagen.normalize_expression()
.
6.1. sample_norm
vs gene_norm
¶
Microarray expression data can be normalized in two directions:
Each sample can be normalized across all genes, or
Each gene can be normalized across all samples
These different forms of normalization are controlled by two parameters in the
abagen.get_expression_data()
function: sample_norm
and gene_norm
.
Note that normalization of each sample across all genes occurs before
normalization of each gene across all samples.
Both parameters can accept the same arguments (detailed below), and both are turned on by default.
6.2. Normalization methods¶
6.2.1. Centering¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='center')
Microarray values are centered with:
where \(\bar{x}\) is the mean of the microarray expression values.
6.2.2. Z-score¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='zscore')
Microarray values are normalized using a basic z-score function:
where \(\bar{x}\) is the mean and \(\sigma_{x}\) is the sample standard deviation of the microarray expression values.
6.2.3. Min-max¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='minmax')
Microarray values are rescaled to the unit interval with:
6.2.4. Sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='sigmoid')
Microarray values are normalized using a general sigmoid function:
where \(\bar{x}\) is the mean and \(\sigma_{x}\) is the sample standard deviation of the microarray expression values.
6.2.5. Scaled sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='scaled_sigmoid')
Microarray values are processed with the sigmoid function and then rescaled to the unit interval with the min-max function.
6.2.6. Scaled sigmoid quantiles¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='scaled_sigmoid_quantiles')
Input data are clipped to the 5th and 95th percentiles before being processed with the scaled sigmoid transform. The clipped distribution is only used for calculation of \(\bar{x}\) and \(\sigma_{x}\); the full (i.e., unclipped) distribution is processed through the transformation.
6.2.7. Robust sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='robust_sigmoid')
Microarray values are normalized using a robust sigmoid function:
where \(\langle x \rangle\) is the median and \(\text{IQR}_{x}\) is the normalized interquartile range of the microarray expression values given as:
6.2.8. Scaled robust sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='scaled_robust_sigmoid')
Microarray values are processed with the robust sigmoid function and then rescaled to the unit interval with the min-max function.
6.2.9. Mixed sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='mixed_sigmoid')
Microarray values are processed with the scaled sigmoid function when their interquartile range is 0; otherwise, they are processed with the scaled robust sigmoid function.
6.2.10. No normalization¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm=None)
Providing None
to the sample_norm
and gene_norm
parameters will
prevent any normalization procedure from being performed on the data. Use this
with caution!
Note
Some of the more advanced methods described on this page were initially proposed in:
Fulcher, B.F., Little, M.A., & Jones, N.S. (2013). Highly comparative time-series analysis: the empirical structure of time series and their methods. Journal of the Royal Society Interface, 10(83), 20130048.
If using one of these methods please consider citing this paper in your work!
Applicable methods: robust sigmoid, scaled robust sigmoid, mixed sigmoid
6.3. Normalizing only matched samples¶
While sample normalization is _always_ performed across all genes, you can
control which samples are used when performing gene normalization. By default,
only those samples matched to regions in the provided atlas are used in the
normalization process (controllable via the norm_matched
parameter):
>>> abagen.get_expression_data(atlas['image'], norm_matched=True)
However, when a smaller atlas is provided with only a few regions, normalizing
over just those samples matched to the atlas can be less desirable. To make it
so that all available samples are used instead of only those matched, set
norm_matched
to False
:
>>> abagen.get_expression_data(atlas['image'], norm_matched=False)
Warning
Given the preponderence of parameters in abagen.get_expression_data()
it is perhaps unsurprising that they will interact with one another.
However, it is worth pointing out that norm_matched
will interact with
the missing
parameter in a relatively surprising manner (hence why we
feel the need to make this note). This is due to the order in which
sample-to-region matching, normalization, and “missing” regions are
handled: when norm_matched
is set to True
all samples not matched
to regions are removed prior to normalization. As such, if the missing
parameter is set, the program is only able to fill in missing regions with
samples that had already been assigned to other regions. If, instead,
norm_matched=False
and the missing
parameter is set, the program
can use the full range of samples to fill in missing regions. For this
reason, we suggest using norm_matched=False
when also setting the
missing
parameter; however, we do not impose a restriction on this.
6.4. Normalizing within structural classes¶
There are known differences in microarray expression between broad structural
designations (e.g. cortex, subcortex/brainstem, cerebellum). As such, it may
occasionally be desirable to constrain normalization such that the procedure is
performed separately for each structural designation. This process can be
controlled via the norm_structures
parameter:
>>> abagen.get_expression_data(atlas['image'], norm_structures=True)
By default, this parameter is set to False
and normalization uses all
available samples. Note that changing this parameter will _dramatically_ modify
the returned expression information, so use with caution. For obvious reasons
this parameter will interact heavily with the norm_matched
parameter
described above.