3. Parcellating expression data

3.1. Basic usage

Once you’ve downladed the microarray data and selected your parcellation you can process the data. This should be as simple as:

>>> expression = abagen.get_expression_data(atlas['image'])

If you are using a volumetric image it is highly recommended you provide the additional information on your parcellation, which can be done with:

>>> expression = abagen.get_expression_data(atlas['image'], atlas['info'])

Note

By default this function will use data from all the donors! Wrangling all this raw microarray data can be quite time-consuming, so if you’d like to speed up this step make sure you’ve performed the IO installation. Alternatively, if you don’t want to use all the donors when testing or debugging the code, you can provide the subject IDs of the donors you want (e.g., donors=['9861', '10021']).

The abagen.get_expression_data() function will print out some information about what’s happening as it goes. However, briefly the function:

  1. Fetches the microarray data from AHBA (if this has not already been done). Refer to parameters data_dir and donors for more info.

  2. Updates the MNI coordinates of all the tissue samples from AHBA using the coordinates from the alleninf package. This occurs by default; refer to parameter corrected_mni for more info.

  3. Mirrors samples across hemispheres to increase spatial coverage. This does not occur by default; refer to parameter lr_mirror for more info (or see Duplicating samples with the lr_mirror parameter).

  4. Reannotates microarray probe-to-gene mappings with information from Arnatkevic̆iūtė et al., 2019, NeuroImage. This occurs by default; refer to parameter reannotated for more info.

  5. Performs a similarity-based filtering of tissue samples, removing those samples whose expression across probes is poorly correlated with other samples. This does not occur by default; refer to parameter sim_threshold for more info.

  6. Performs intensity-based filtering of probes to remove those that do not exceed background noise. This occurs by default with a threshold of 0.5 (i.e., probes must exceed background noise in 50% of all tissue samples); refer to parameter ibf_threshold for more info.

  7. Selects a representative probe amongst those probes indexing the same gene. This occurs by default by selecting the probe with the highest differential stability amongst donors; refer to parameter probe_selection for more info (or see Probe selection options).

  8. Matches tissue samples to regions in the user-specified atlas. Refer to parameters atlas, atlas_info, missing, and tolerance for more info (or see Filling in data with the missing parameter).

  9. Normalizes expression values for each sample across genes for each donor. This occurs by default using a scaled robust sigmoid normalization function; refere to parameter sample_norm for more info.

  10. Normalizes expression values for each gene across samples for each donor. This occurs by default using a scaled robust sigmoid normalization function; refer to parameter gene_norm for more info.

  11. Aggregates samples within regions in the user-specified atlas based on matches made in Step 7. By default, samples are averaged separately for each donor and then averaged across donors. Refer to parameters region_agg, agg_metric, and return_donors for more info.

You can investigate all these parameters and options for modifying how the expression array is generated by looking at the Reference API.

3.2. The parcellated expression DataFrame

The expression object returned by abagen.get_expression_data() is a pandas.DataFrame, where rows correspond to region labels as defined in the atlas image, columns correspond to genes, and entry values are microarray expression data normalized and aggregated across donors:

>>> print(expression)
gene_symbol      A1BG  A1BG-AS1       A2M  ...       ZYX     ZZEF1      ZZZ3
label                                      ...
1            0.498266  0.664570  0.395276  ...  0.675843  0.555539  0.487572
2            0.649068  0.578997  0.496142  ...  0.483165  0.382653  0.504041
3            0.530613  0.623289  0.516300  ...  0.732930  0.359707  0.450664
...               ...       ...       ...  ...       ...       ...       ...
81           0.388748  0.277961  0.474202  ...  0.279683  0.480953  0.405504
82           0.825836  0.602271  0.334143  ...  0.195722  0.447894  0.746475
83           0.384593  0.203654  0.746060  ...  0.379274  0.706803  0.509437

[83 rows x 15633 columns]

By default the data are normalized using a scaled robust sigmoid function such that expression values for a given gene will range from 0-1, where 0 indicates the region with the lowest expression of that gene and 1 indicates the region with highest.

Since the generated DataFrame is an aggregate (default: average) of multiple donors it is possible (likely) that a given region may not have any expression values exactly equal to 0 or 1.

3.3. Getting dense expression data

Unfortunately, due to how tissue samples were collected from the donor brains it is possible that some regions in an atlas may not be represented by any expression data. In the above example, two of the rows are missing data:

>>> print(expression.loc[[72, 73]])
gene_symbol  A1BG  A1BG-AS1  A2M  ...  ZYX  ZZEF1  ZZZ3
label                             ...
72            NaN       NaN  NaN  ...  NaN    NaN   NaN
73            NaN       NaN  NaN  ...  NaN    NaN   NaN

[2 rows x 15633 columns]

These regions, corresponding to the right frontal pole (label 72) and right temporal pole (label 73) in the Desikan-Killiany atlas, were not matched to any tissue samples; this is likely due to the fact that only two of the six donors have tissue samples taken from the right hemisphere.

If you require a dense matrix—that is, you need expression values for every region in your atlas—there are a few parameters that you can consider tuning to try and achieve this.

3.3.1. Filling in data with the missing parameter

By default, the abagen.get_expression_data() function will attempt to be as precise as possible in matching microarray samples with brain regions. It takes the following steps to do this for each tissue sample:

  1. Determine if the sample falls directly within a region of atlas.

  2. Check to see if the sample is close to any regions by slowly expanding the search space (in 1mm increments) to include nearby voxels up to a specified distance threshold (specified via the tolerance parameter).

  3. If there are multiple nearby regions, determine which region is closer by calculating the center-of-mass of the abutting regions.

If at any step a sample can be assigned to a region in atlas the sample is assigned to that region and the matching procedure is terminated. However, as we saw, regions with no assigned samples from any donor are simply left as NaN.

If you would like to force all regions to be assigned at least one sample you can set the missing parameter. This parameter accepts three options: None (default), "centroids", and "interpolate". By setting this parameter the workflow will go through the normal procedure as documented above and then, once all samples are matched, check for any empty regions and assign them expression values based on the specified method.

When using the ‘centroid’ method the empty regions in the atlas will be assigned the expression values of the tissue sample falling closest to the centroid of that region. Note that this procedure is only performed when _all_ donors are missing data in a given region. In this case, a weighted average of the matched samples are taken across donors, where weights are calculated as the inverse distance between the tissue sample matched to the parcel centroid for each donor.

When using the ‘interpolate’ method, expression values will be interpolated in the empty regions by assigning every node in the region the expression of the nearest tissue sample. The weighted (inverse distance) average of the densely-interpolated map will be taken and used to represent parcellated expression values for the region. Note that, unlike in the centroid matching procedure described above, this interpolation is done independently for every donor, irrespective of whether other donors have tissue samples that fall within a given region.

Thus, setting the missing parameter when calling abagen.get_expression_data() will always return a dense expression matrix (at the expense of some anatomical precision):

# first, check with ``missing='centroids'``
>>> exp_centroids = abagen.get_expression_data(atlas['image'], atlas['info'],
...                                            missing='centroids')
>>> print(exp_centroids.loc[[72, 73]])
gene_symbol      A1BG  A1BG-AS1       A2M  ...       ZYX     ZZEF1      ZZZ3
label                                      ...
72           0.574699  0.750184  0.246746  ...  0.656938  0.193677  0.647785
73           0.725151  0.652906  0.528831  ...  0.478334  0.501293  0.483642

[2 rows x 15633 columns]

# then, check with ``missing='interpolate'``
>>> exp_interpolate = abagen.get_expression_data(atlas['image'], atlas['info'],
...                                              missing='interpolate')
>>> print(exp_interpolate.loc[[72, 73]])
gene_symbol      A1BG  A1BG-AS1       A2M  ...       ZYX     ZZEF1      ZZZ3
label                                ...
72           0.532308  0.710846  0.299322  ...  0.675837  0.301105  0.586290
73           0.736345  0.663072  0.497092  ...  0.507378  0.467046  0.531494

[2 rows x 15633 columns]

Warning

Refer to the documentation for normalization for additional information on how other settings interact with the missing parameter.

3.3.2. Duplicating samples with the lr_mirror parameter

If your parcellation is sufficiently low-resolution it is likely that most regions in the left hemisphere (for which all six donors have tissue samples) will be matched to at least one sample, whereas regions in the right hemisphere may come up short.

To remedy this you can try modifying the lr_mirror parameter when calling abagen.get_expression_data(). This parameter accepts four options: None (default), "bidirectional", "leftright", and "rightleft". As the name suggests, the lr_mirror options control whether tissue samples are mirrored across the left/right hemisphere axis. By supplying the ‘bidirectional’ options, all samples in the left hemisphere are duplicated and mirrored onto the right hemisphre, and vice-versa for right to left. The other options (‘leftright’ and ‘rightleft) will mirror only one hemisphere (i.e., ‘leftright’ will mirror samples in the left onto the right hemisphere).

Unlike the missing parameter this will not guarantee that all regions are matched to a sample, but it will increase the likelihood that this happens:

>>> exp_mirror = abagen.get_expression_data(atlas['image'], atlas['info'],
...                                         lr_mirror='bidirectional')
>>> print(exp_mirror.loc[[72, 73]])
gene_symbol      A1BG  A1BG-AS1       A2M  ...       ZYX     ZZEF1      ZZZ3
label                                ...
72           0.832617  0.648154  0.425707  ...  0.580406  0.439378  0.799856
73           0.682180  0.569551  0.627497  ...  0.430146  0.302926  0.425995

[2 rows x 15633 columns]

Note that since this effectively duplicates the number of tissue samples the function runtime will increase somewhat. Also, importantly, setting the lr_mirror parameter will change the expression values of all of the regions in the generated matrix–not just the regions that are missing data. It is worth considering which (if either!) of these options best suits your intended analysis.