3. Parcellating expression data¶
3.1. Basic usage¶
Once you’ve downladed the microarray data and selected your parcellation you can process the data. This should be as simple as:
>>> expression = abagen.get_expression_data(atlas['image'])
If you are using a volumetric image it is highly recommended you provide the additional information on your parcellation, which can be done with:
>>> expression = abagen.get_expression_data(atlas['image'], atlas['info'])
Note
By default this function will use data from all the donors! Wrangling all
this raw microarray data can be quite time-consuming, so if you’d like to
speed up this step make sure you’ve performed the IO installation.
Alternatively, if you don’t want to use all the donors when testing or
debugging the code, you can provide the subject IDs of the donors you want
(e.g., donors=['9861', '10021']
).
The abagen.get_expression_data()
function will print out some information
about what’s happening as it goes. However, briefly the function:
Fetches the microarray data from AHBA (if this has not already been done). Refer to parameters
data_dir
anddonors
for more info.Updates the MNI coordinates of all the tissue samples from AHBA using the coordinates from the
alleninf
package. This occurs by default; refer to parametercorrected_mni
for more info.Mirrors samples across hemispheres to increase spatial coverage. This does not occur by default; refer to parameter
lr_mirror
for more info (or see Duplicating samples with the lr_mirror parameter).Reannotates microarray probe-to-gene mappings with information from Arnatkevic̆iūtė et al., 2019, NeuroImage. This occurs by default; refer to parameter
reannotated
for more info.Performs a similarity-based filtering of tissue samples, removing those samples whose expression across probes is poorly correlated with other samples. This does not occur by default; refer to parameter
sim_threshold
for more info.Performs intensity-based filtering of probes to remove those that do not exceed background noise. This occurs by default with a threshold of 0.5 (i.e., probes must exceed background noise in 50% of all tissue samples); refer to parameter
ibf_threshold
for more info.Selects a representative probe amongst those probes indexing the same gene. This occurs by default by selecting the probe with the highest differential stability amongst donors; refer to parameter
probe_selection
for more info (or see Probe selection options).Matches tissue samples to regions in the user-specified
atlas
. Refer to parametersatlas
,atlas_info
,missing
, andtolerance
for more info (or see Filling in data with the missing parameter).Normalizes expression values for each sample across genes for each donor. This occurs by default using a scaled robust sigmoid normalization function; refere to parameter
sample_norm
for more info.Normalizes expression values for each gene across samples for each donor. This occurs by default using a scaled robust sigmoid normalization function; refer to parameter
gene_norm
for more info.Aggregates samples within regions in the user-specified
atlas
based on matches made in Step 7. By default, samples are averaged separately for each donor and then averaged across donors. Refer to parametersregion_agg
,agg_metric
, andreturn_donors
for more info.
You can investigate all these parameters and options for modifying how the
expression
array is generated by looking at the Reference API.
3.2. The parcellated expression DataFrame¶
The expression
object returned by abagen.get_expression_data()
is a
pandas.DataFrame
, where rows correspond to region labels as defined in the
atlas image, columns correspond to genes, and entry values are microarray
expression data normalized and aggregated across donors:
>>> print(expression)
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
1 0.498266 0.664570 0.395276 ... 0.675843 0.555539 0.487572
2 0.649068 0.578997 0.496142 ... 0.483165 0.382653 0.504041
3 0.530613 0.623289 0.516300 ... 0.732930 0.359707 0.450664
... ... ... ... ... ... ... ...
81 0.388748 0.277961 0.474202 ... 0.279683 0.480953 0.405504
82 0.825836 0.602271 0.334143 ... 0.195722 0.447894 0.746475
83 0.384593 0.203654 0.746060 ... 0.379274 0.706803 0.509437
[83 rows x 15633 columns]
By default the data are normalized using a scaled robust sigmoid function such that expression values for a given gene will range from 0-1, where 0 indicates the region with the lowest expression of that gene and 1 indicates the region with highest.
Since the generated DataFrame is an aggregate (default: average) of multiple donors it is possible (likely) that a given region may not have any expression values exactly equal to 0 or 1.
3.3. Getting dense expression data¶
Unfortunately, due to how tissue samples were collected from the donor brains it is possible that some regions in an atlas may not be represented by any expression data. In the above example, two of the rows are missing data:
>>> print(expression.loc[[72, 73]])
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
72 NaN NaN NaN ... NaN NaN NaN
73 NaN NaN NaN ... NaN NaN NaN
[2 rows x 15633 columns]
These regions, corresponding to the right frontal pole (label 72) and right temporal pole (label 73) in the Desikan-Killiany atlas, were not matched to any tissue samples; this is likely due to the fact that only two of the six donors have tissue samples taken from the right hemisphere.
If you require a dense matrix—that is, you need expression values for
every region in your atlas
—there are a few parameters that you can
consider tuning to try and achieve this.
3.3.1. Filling in data with the missing
parameter¶
By default, the abagen.get_expression_data()
function will attempt to be
as precise as possible in matching microarray samples with brain regions. It
takes the following steps to do this for each tissue sample:
Determine if the sample falls directly within a region of
atlas
.Check to see if the sample is close to any regions by slowly expanding the search space (in 1mm increments) to include nearby voxels up to a specified distance threshold (specified via the
tolerance
parameter).If there are multiple nearby regions, determine which region is closer by calculating the center-of-mass of the abutting regions.
If at any step a sample can be assigned to a region in atlas
the sample is
assigned to that region and the matching procedure is terminated. However, as
we saw, regions with no assigned samples from any donor are simply left as NaN.
If you would like to force all regions to be assigned at least one sample you
can set the missing
parameter. This parameter accepts three options:
None
(default), "centroids"
, and "interpolate"
. By setting this
parameter the workflow will go through the normal procedure as documented above
and then, once all samples are matched, check for any empty regions and assign
them expression values based on the specified method.
When using the ‘centroid’ method the empty regions in the atlas will be assigned the expression values of the tissue sample falling closest to the centroid of that region. Note that this procedure is only performed when _all_ donors are missing data in a given region. In this case, a weighted average of the matched samples are taken across donors, where weights are calculated as the inverse distance between the tissue sample matched to the parcel centroid for each donor.
When using the ‘interpolate’ method, expression values will be interpolated in the empty regions by assigning every node in the region the expression of the nearest tissue sample. The weighted (inverse distance) average of the densely-interpolated map will be taken and used to represent parcellated expression values for the region. Note that, unlike in the centroid matching procedure described above, this interpolation is done independently for every donor, irrespective of whether other donors have tissue samples that fall within a given region.
Thus, setting the missing
parameter when calling
abagen.get_expression_data()
will always return a dense expression
matrix (at the expense of some anatomical precision):
# first, check with ``missing='centroids'``
>>> exp_centroids = abagen.get_expression_data(atlas['image'], atlas['info'],
... missing='centroids')
>>> print(exp_centroids.loc[[72, 73]])
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
72 0.574699 0.750184 0.246746 ... 0.656938 0.193677 0.647785
73 0.725151 0.652906 0.528831 ... 0.478334 0.501293 0.483642
[2 rows x 15633 columns]
# then, check with ``missing='interpolate'``
>>> exp_interpolate = abagen.get_expression_data(atlas['image'], atlas['info'],
... missing='interpolate')
>>> print(exp_interpolate.loc[[72, 73]])
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
72 0.532308 0.710846 0.299322 ... 0.675837 0.301105 0.586290
73 0.736345 0.663072 0.497092 ... 0.507378 0.467046 0.531494
[2 rows x 15633 columns]
Warning
Refer to the documentation for normalization
for additional information on how other settings interact with the
missing
parameter.
3.3.2. Duplicating samples with the lr_mirror
parameter¶
If your parcellation is sufficiently low-resolution it is likely that most regions in the left hemisphere (for which all six donors have tissue samples) will be matched to at least one sample, whereas regions in the right hemisphere may come up short.
To remedy this you can try modifying the lr_mirror
parameter when calling
abagen.get_expression_data()
. This parameter accepts four options:
None
(default), "bidirectional"
, "leftright"
, and "rightleft"
.
As the name suggests, the lr_mirror
options control whether tissue samples
are mirrored across the left/right hemisphere axis. By supplying the
‘bidirectional’ options, all samples in the left hemisphere are duplicated and
mirrored onto the right hemisphre, and vice-versa for right to left. The other
options (‘leftright’ and ‘rightleft) will mirror only one hemisphere (i.e.,
‘leftright’ will mirror samples in the left onto the right hemisphere).
Unlike the missing
parameter this will not guarantee that all regions are
matched to a sample, but it will increase the likelihood that this happens:
>>> exp_mirror = abagen.get_expression_data(atlas['image'], atlas['info'],
... lr_mirror='bidirectional')
>>> print(exp_mirror.loc[[72, 73]])
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
72 0.832617 0.648154 0.425707 ... 0.580406 0.439378 0.799856
73 0.682180 0.569551 0.627497 ... 0.430146 0.302926 0.425995
[2 rows x 15633 columns]
Note that since this effectively duplicates the number of tissue samples the
function runtime will increase somewhat. Also, importantly, setting the
lr_mirror
parameter will change the expression values of all of the
regions in the generated matrix–not just the regions that are missing data. It
is worth considering which (if either!) of these options best suits your
intended analysis.