4. Probe selection options

The probes used to measure microarray expression levels in the AHBA data are often redundant; that is, there are frequently several probes indexing the same gene. Since the output of the abagen.get_expression_data() workflow is a region by gene dataframe, at some point we need to transition from indexing probe expression levels to indexing gene expression levels. Effectively, this means we need to select from or condense the redundant probes for each gene; however, there are a number of ways to do that.

Currently, abagen supports eight options for this probe to gene conversion. All the options have been used at various points throughout the published record, so while there is no “right” choice we do encourage using the default option (differential stability) due to recent work by Arnatkevičiūte et al., 2019 showing that it provides the highest fidelity to RNA sequencing data.

Available methods for probe_selection fall into two primary families:

We describe all the methods within these families here. Methods can be implemented by passing the probe_selection argument to the abagen.get_expression_data() function. For a selection of references to published works that have used these different methods please see the documentation of abagen.probes_.collapse_probes().

4.1. Selecting a representative probe

The first group of methods aim to select a single probe from each redundant group. This involves generating some sort of selection criteria and masking the original probe by sample expression matrix to extract only the chosen probe, which will be used to represent the associated gene’s microarray expression values:

../_images/select_probes.png

The only difference between methods in this group is the criteria used to select which probe to retain. The descriptions below explain how the mask in the above diagram is generated for each available option; in each diagram the red outline on the generated vector indicates which entry will be used to mask the original matrix. The extraction procedure (i.e., applying the mask to the original probe by sample expression matrix) is identical for all these methods.

4.1.1. Max intensity

>>> abagen.get_expression_data(atlas['image'], probe_selection='max_intensity')

Selects the probe with the highest average expression across all samples (where samples are concatenated across donors).

../_images/max_intensity.png

4.1.2. Max variance

>>> abagen.get_expression_data(atlas['image'], probe_selection='max_variance')

Selects the probe with the highest variance in expression across all samples (where samples are concatenated across donors).

../_images/max_variance.png

4.1.3. Principal component loading

>>> abagen.get_expression_data(atlas['image'], probe_selection='pc_loading')

Selects the probe with the highest loading on the first principal component derived from the probe microarray expression across all samples (where samples are concatenated across donors).

../_images/pc_loading.png

4.1.4. Correlation

>>> abagen.get_expression_data(atlas['image'], probe_selection='corr_intensity')
>>> abagen.get_expression_data(atlas['image'], probe_selection='corr_variance')

When there are more than two probes indexing the same gene, selects the probe with the highest average correlation to other probes across all samples (where samples are concatenated across donors).

../_images/correlation.png

When there are exactly two probes the correlation procedure cannot be used, and so you can fall back to either the Max intensity (corr_intensity) or the Max variance (corr_variance) criteria.

4.1.5. Differential stability

>>> abagen.get_expression_data(atlas['image'], probe_selection='diff_stability')

Computes the Spearman correlation of microarray expression values for each probe across brain regions for every pair of donors. Correlations are averaged and the probe with the highest correlation is retained.

../_images/diff_stability.png

4.1.6. RNAseq

>>> abagen.get_expression_data(atlas['image'], probe_selection='rnaseq')

Computes the Spearman correlation of microarray expression values for each probe across brain regions with RNAseq data for the corresponding gene. As only two donors have RNAseq data (donors #9861 and 10021), this method only computes the correlations for these two donors. Correlations are averaged across the two donors and the probe with the highest correlation for each gene is retained.

../_images/rnaseq.png

4.2. Collapsing across probes

In contrast to selecting a single representative probe for each gene and discarding the others, we can instead opt to use all available probes and collapse them into a unified representation of the associated gene:

../_images/collapse_probes.png

Currently only one method supports this operation.

4.2.1. Average

>>> abagen.get_expression_data(atlas['image'], probe_selection='average')

Takes the average expression values for all probes indexing the same gene.

../_images/average.png

Providing 'mean' instead of 'average' will return identical results.

5. Donor aggregation in probe selection

Unless otherwise specified in the description of that method, probe selection is performed using data aggregated across samples from all donors. However, this may not be desired: the probe that most reliably indexes a gene in one donor may differ from the probe that does so in another donor.

To allow for this possibility, we describe three options for modifying how probe selection is performed across donors in detail below. These methods can be implemented by passing the donor_probes argument to the abagen.get_expression_data() function.

5.1. Aggregate selection across donors

>>> abagen.get_expression_data(atlas['image'], donor_probes='aggregate')

The default option, this will aggregate tissue samples from all donors and apply the chosen probe_selection method to this single probe x sample matrix. The probe chosen to represent each gene will be identical across all donors.

5.2. Independent selection for donors

>>> abagen.get_expression_data(atlas['image'], donor_probes='independent')

Performs the chosen probe_selection method independently for each donor. The probe chosen to represent each gene may be different across donors.

Note: this option cannot be used when the specified probe_selection is one of: ‘diff_stability’, ‘rnaseq’, or ‘average’.

5.3. Most common selection across donors

>>> abagen.get_expression_data(atlas['image'], donor_probes='common')

Performs the chosen probe_selection method independently for each donor and then uses the most commonly-selected probe to represent each gene. The probe chosen to represent each gene will be identical across all donors.

Note: this option cannot be used when the specified probe_selection is one of: ‘diff_stability’, ‘rnaseq’, or ‘average’.