6. Data normalization options

The microarray expression data provided by the AHBA has been subjected to some normalization procedures designed to mitigate potential differences in expression values between donors due to “batch effects.” Despite these procedures, there are still some notable differences between donors present in the downloadable data.

By default, abagen.get_expression_data() aggregates expression data across donors (though this can be prevented via the return_donors parameter). Prior to aggregation, the function performs a within-donor normalization procedure to attempt to mitigate donor-specific effects; however, there are a number of ways to achieve this.

Currently, abagen supports nine options for normalizing data:

Most of the options have been used at various points throughout the published record, so while there is no “right” choice we do encourage using the default option (scaled robust sigmoid) due to recent work by Arnatkevičiūte et al., 2019 showing that it is—as the name might suggest—robust to outlier effects commonly observed in microarray data.

We describe all the methods in detail here; these can be implemented by passing the sample_norm and gene_norm keyword arguments to abagen.get_expression_data(). For a selection of references to published works that have used these different methods please see the documentation of abagen.normalize_expression().

6.1. sample_norm vs gene_norm

Microarray expression data can be normalized in two directions:

  1. Each sample can be normalized across all genes, or

  2. Each gene can be normalized across all samples

These different forms of normalization are controlled by two parameters in the abagen.get_expression_data() function: sample_norm and gene_norm. Note that normalization of each sample across all genes occurs before normalization of each gene across all samples.

Both parameters can accept the same arguments (detailed below), and both are turned on by default.

6.2. Normalization methods

6.2.1. Centering

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='center')

Microarray values are centered with:

\[x_{norm} = x_{y} - \bar{x}\]

where \(\bar{x}\) is the mean of the microarray expression values.

6.2.2. Z-score

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='zscore')

Microarray values are normalized using a basic z-score function:

\[x_{norm} = \frac{x_{y} - \bar{x}} {\sigma_{x}}\]

where \(\bar{x}\) is the mean and \(\sigma_{x}\) is the sample standard deviation of the microarray expression values.

6.2.3. Min-max

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='minmax')

Microarray values are rescaled to the unit interval with:

\[x_{norm} = \frac{x_{y} - \text{min}(x)} {\text{max}(x) - \text{min}(x)}\]

6.2.4. Sigmoid

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='sigmoid')

Microarray values are normalized using a general sigmoid function:

\[x_{y} = \frac{1} {1 + \exp \left( \frac{-(x_{y} - \bar{x})} {\sigma_{x}} \right)}\]

where \(\bar{x}\) is the mean and \(\sigma_{x}\) is the sample standard deviation of the microarray expression values.

6.2.5. Scaled sigmoid

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='scaled_sigmoid')

Microarray values are processed with the sigmoid function and then rescaled to the unit interval with the min-max function.

6.2.6. Scaled sigmoid quantiles

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='scaled_sigmoid_quantiles')

Input data are clipped to the 5th and 95th percentiles before being processed with the scaled sigmoid transform. The clipped distribution is only used for calculation of \(\bar{x}\) and \(\sigma_{x}\); the full (i.e., unclipped) distribution is processed through the transformation.

6.2.7. Robust sigmoid

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='robust_sigmoid')

Microarray values are normalized using a robust sigmoid function:

\[x_{y} = \frac{1} {1 + \exp \left( \frac{-(x_{y} - \langle x \rangle)} {\text{IQR}_{x}} \right)}\]

where \(\langle x \rangle\) is the median and \(\text{IQR}_{x}\) is the normalized interquartile range of the microarray expression values given as:

\[\DeclareMathOperator\erf{erf} \text{IQR}_{x} = \frac{Q_{3} - Q{1}} {2 \cdot \sqrt{2} \cdot \erf^{-1}\left(\frac{1}{2}\right)} \approx \frac{Q_{3} - Q_{1}} {1.35}\]

6.2.8. Scaled robust sigmoid

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='scaled_robust_sigmoid')

Microarray values are processed with the robust sigmoid function and then rescaled to the unit interval with the min-max function.

6.2.9. Mixed sigmoid

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='mixed_sigmoid')

Microarray values are processed with the scaled sigmoid function when their interquartile range is 0; otherwise, they are processed with the scaled robust sigmoid function.

6.2.10. No normalization

>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm=None)

Providing None to the sample_norm and gene_norm parameters will prevent any normalization procedure from being performed on the data. Use this with caution!

Note

Some of the more advanced methods described on this page were initially proposed in:

Fulcher, B.F., Little, M.A., & Jones, N.S. (2013). Highly comparative time-series analysis: the empirical structure of time series and their methods. Journal of the Royal Society Interface, 10(83), 20130048.

If using one of these methods please consider citing this paper in your work!

Applicable methods: robust sigmoid, scaled robust sigmoid, mixed sigmoid

6.3. Normalizing only matched samples

While sample normalization is _always_ performed across all genes, you can control which samples are used when performing gene normalization. By default, only those samples matched to regions in the provided atlas are used in the normalization process (controllable via the norm_matched parameter):

>>> abagen.get_expression_data(atlas['image'], norm_matched=True)

However, when a smaller atlas is provided with only a few regions, normalizing over just those samples matched to the atlas can be less desirable. To make it so that all available samples are used instead of only those matched, set norm_matched to False:

>>> abagen.get_expression_data(atlas['image'], norm_matched=False)

Warning

Given the preponderence of parameters in abagen.get_expression_data() it is perhaps unsurprising that they will interact with one another. However, it is worth pointing out that norm_matched will interact with the missing parameter in a relatively surprising manner (hence why we feel the need to make this note). This is due to the order in which sample-to-region matching, normalization, and “missing” regions are handled: when norm_matched is set to True all samples not matched to regions are removed prior to normalization. As such, if the missing parameter is set, the program is only able to fill in missing regions with samples that had already been assigned to other regions. If, instead, norm_matched=False and the missing parameter is set, the program can use the full range of samples to fill in missing regions. For this reason, we suggest using norm_matched=False when also setting the missing parameter; however, we do not impose a restriction on this.

6.4. Normalizing within structural classes

There are known differences in microarray expression between broad structural designations (e.g. cortex, subcortex/brainstem, cerebellum). As such, it may occasionally be desirable to constrain normalization such that the procedure is performed separately for each structural designation. This process can be controlled via the norm_structures parameter:

>>> abagen.get_expression_data(atlas['image'], norm_structures=True)

By default, this parameter is set to False and normalization uses all available samples. Note that changing this parameter will _dramatically_ modify the returned expression information, so use with caution. For obvious reasons this parameter will interact heavily with the norm_matched parameter described above.