abagen.get_expression_data

abagen.get_expression_data(atlas, atlas_info=None, *, ibf_threshold=0.5, probe_selection='diff_stability', donor_probes='aggregate', sim_threshold=None, lr_mirror=None, exact=None, missing=None, tolerance=2, sample_norm='srs', gene_norm='srs', norm_matched=True, norm_structures=False, region_agg='donors', agg_metric='mean', corrected_mni=True, reannotated=True, return_counts=False, return_donors=False, return_report=False, donors='all', data_dir=None, verbose=0, n_proc=1)[source]

Assigns microarray expression data to ROIs defined in atlas

This function aims to provide a workflow for generating pre-processed, microarray expression data from the Allen Human Brain Atlas ([A2]) for abitrary atlas designations. First, some basic filtering of genetic probes is performed, including:

  1. Intensity-based filtering of microarray probes to remove probes that do not exceed a certain level of background noise (specified via the ibf_threshold parameter),

  2. Selection of a single, representative probe (or collapsing across probes) for each gene, specified via the probe_selection parameter (and influenced by the donor_probes parameter), and

  3. Optional mirroring of the tissue samples across the left/right hemisphere boundary, as specified via the lr_mirror parameter (turned off by default).

Tissue samples are then matched to parcels in the defined atlas for each donor. If atlas_info is provided then this matching is constrained by both hemisphere and tissue class designation (e.g., cortical samples from the left hemisphere are only matched to ROIs in the left cortex, subcortical samples from the right hemisphere are only matched to ROIs in the left subcortex); see the atlas_info parameter description for more information.

Matching of microarray samples to parcels in atlas is done via a multi- step process:

  1. Determine if the sample falls directly within a parcel,

  2. Check to see if there are nearby parcels by slowly expanding the search space to include nearby voxels, up to a specified distance (specified via the tolerance parameter),

  3. If there are multiple nearby parcels, the sample is assigned to the closest parcel, as determined by the parcel centroid.

If at any step a sample can be assigned to a parcel the matching process is terminated. When the provided atlas is not volumetric (i.e., surface-based) the samples are simply matched to the nearest vertex, and tolerance is used as a standard deviation threshold. More control over the sample matching can be obtained by setting the exact parameter; see the parameter description for more information.

Once all samples have been matched to parcels for all supplied donors, the microarray expression data are optionally normalized via the provided sample_norm and gene_norm functions (which are influenced by the norm_matched and norm_structures parameters) before being aggregated across donors via the supplied region_agg and agg_metric parameters.

Parameters:
  • atlas (niimg-like object or dict) – A parcellation image in MNI space or a tuple of GIFTI images in fsaverage5 space, where each parcel is identified by a unique integer ID. Alternatively, a dictionary where keys are donor IDs and values are parcellation images (or surfaces) in the native space of each donor.

  • atlas_info (os.PathLike or pandas.DataFrame, optional) – Filepath to or pre-loaded dataframe containing information about atlas. Must have at least columns ‘id’, ‘hemisphere’, and ‘structure’ containing information mapping atlas IDs to hemisphere (i.e, “L”, “R”, “B”) and broad structural class (i.e., “cortex”, “subcortex/brainstem”, “cerebellum”). If provided, this will constrain matching of tissue samples to regions in atlas. If atlas is a tuple of GIFTI images with valid label tables this will be intuited from the data. Default: None

  • ibf_threshold ([0, 1] float, optional) – Threshold for intensity-based filtering. This number specifies the ratio of samples, across all supplied donors, for which a probe must have signal significantly greater than background noise in order to be retained. Default: 0.5

  • probe_selection (str, optional) – Selection method for subsetting (or collapsing across) probes that index the same gene. Must be one of ‘average’, ‘max_intensity’, ‘max_variance’, ‘pc_loading’, ‘corr_variance’, ‘corr_intensity’, or ‘diff_stability’, ‘rnaseq’; see Notes for more information on different options. Default: ‘diff_stability’

  • donor_probes ({'aggregate', 'independent', 'common'}, optional) – Whether specified probe_selection method should be performed with microarray data from all donors (‘aggregate’), independently for each donor (‘independent’), or based on the most common selected probe across donors (‘common’). Not all combinations of probe_selection and donor_probes methods are viable. Default: ‘aggregate’

  • sim_threshold ((0, inf) float, optional) – Threshold for inter-areal similarity filtering. Samples are correlated across probes and those samples with a total correlation less than sim_threshold standard deviations below the mean across samples are excluded from futher analysis. If not specified no filtering is performed. Default: None

  • lr_mirror ({None, 'bidirectional', 'leftright', 'rightleft'}, optional) – Whether to mirror microarray expression samples across hemispheres to increase spatial coverage. Using ‘bidirectional’ will mirror samples across both hemispheres, ‘leftright’ will mirror samples in the left hemisphere to the right, and ‘rightleft’ will mirror the right to the left. Default: None

  • missing ({'centroids', 'interpolate', None}, optional) – How to handle regions in atlas that are not assigned any tissue samples. If ‘centroids’, any empty regions will be assigned the expression value of the nearest tissue sample (defined as the sample with the closest Euclidean distance to the parcel centroid). If ‘interpolate’, expression values will be interpolated in the empty regions by assigning every node in the region the expression of the nearest sample and taking a weighted (inverse distance) average. If not specified empty regions will be returned with expression values of NaN. Default: None

  • tolerance (int, optional) – Distance (in mm) that a sample must be from a parcel for it to be matched to that parcel. If atlas is a tuple of surface files then this measure is a standard deviation threshold (i.e., samples greater than tolerance SDs away from the mean matched distance are ignored). Default: 2

  • sample_norm ({'rs', 'srs', 'minmax', 'center', 'zscore', None}, optional) – Method by which to normalize microarray expression values for each sample. Expression values are normalized separately for each sample and donor across all genes; see Notes for more information on different methods. If None is specified then no normalization is performed. Default: ‘srs’

  • gene_norm ({'rs', 'srs', 'minmax', 'center', 'zscore', None}, optional) – Method by which to normalize microarray expression values for each donor. Expression values are normalized separately for each gene and donor across all samples; see Notes for more information on different methods. If None is specified then no normalization is performed. Default: ‘srs’

  • norm_matched (bool, optional) – Whether to perform gene normalization (gene_norm) across only those samples matched to regions in atlas instead of all available samples. If atlas is very small (i.e., only a few regions of interest), using norm_matched=False is suggested. Default: True

  • norm_structures (bool, optional) – Whether to perform gene normalization (gene_norm) within structural classes (i.e., ‘cortex’, ‘subcortex/brainstem’, ‘cerebellum’) instead of across all available samples. Default: False

  • region_agg ({'samples', 'donors'}, optional) – When multiple samples are identified as belonging to a region in atlas this determines how they are aggegated. If ‘samples’, expression data from all samples for all donors assigned to a given region are combined. If ‘donors’, expression values for all samples assigned to a given region are combined independently for each donor before being combined across donors. See agg_metric for mechanism by which samples are combined. Default: ‘donors’

  • agg_metric ({'mean', 'median'} or callable, optional) – Mechanism by which to reduce sample-level expression data into region- level expression (see region_agg). If a callable, should be able to accept an N-dimensional input and the axis keyword argument and return an N-1-dimensional output. Default: ‘mean’

  • corrected_mni (bool, optional) – Whether to use the “corrected” MNI coordinates shipped with the alleninf package instead of the coordinates provided with the AHBA data when matching tissue samples to anatomical regions. Default: True

  • reannotated (bool, optional) – Whether to use reannotated probe information provided by [A1] instead of the default probe information from the AHBA dataset. Using reannotated information will discard probes that could not be reliably matched to genes. Default: True

  • return_counts (bool, optional) – Whether to return dataframe containing information on how many samples were assigned to each parcel in atlas for each donor. Default: False

  • return_donors (bool, optional) – Whether to return donor-level expression arrays instead of aggregating expression across donors with provided agg_metric. Default: False

  • return_report (bool, optional) – Whether to return a string containing longform text describing the processing procedures used to generate the expression DataFrames returned by this function. Default: False

  • donors (list, optional) – List of donors to use as sources of expression data. Can be either donor numbers or UID. If not specified will use all available donors. Note that donors ‘9861’ and ‘10021’ have samples from both left + right hemispheres; all other donors have samples from the left hemisphere only. Default: ‘all’

  • data_dir (os.PathLike, optional) – Directory where expression data should be downloaded (if it does not already exist) / loaded. If not specified will use the current directory. Default: None

  • verbose (int, optional) – Specifies verbosity of status messages to display during workflow. Higher numbers increase verbosity of messages while zero suppresses all messages. Default: 1

  • n_proc (int, optional) – Number of processors to use to download AHBA data. Can parallelize up to six times. Default: 1

Returns:

  • expression ((R, G) pandas.DataFrame) – Microarray expression for R regions in atlas for G genes, aggregated across donors, where the index corresponds to the unique integer IDs of atlas and the columns are gene names. If return_donors=True then this is a list of (R, G) dataframes, one for each donor.

  • counts ((R, D) pandas.DataFrame) – Number of samples assigned to each of R regions in atlas for each of D donors (if multiple donors were specified); only returned if return_counts=True.

  • report (str) – Methods describing processing procedures implemented to generate expression, suitable to be used in a manuscript Methods section. Only returned if return_report=True.

Notes

The following methods can be used for collapsing across probes when multiple probes are available for the same gene:

  1. probe_selection='average'

Takes the average of expression data across all probes indexing the same gene. Providing ‘mean’ as the input method will return the same thing. This method can only be used when donor_probes=’aggregate’.

  1. probe_selection='max_intensity'

Selects the probe with the maximum average expression across samples from all donors.

  1. probe_selection='max_variance'

Selects the probe with the maximum variance in expression across samples from all donors.

  1. probe_selection='pc_loading'

Selects the probe with the maximum loading along the first principal component of a decomposition performed across samples from all donors.

  1. probe_selection='corr_intensity'

Selects the probe with the maximum correlation to other probes from the same gene when >2 probes exist; otherwise, uses the same procedure as max_intensity.

  1. probe_selection='corr_variance'

Selects the probe with the maximum correlation to other probes from the same gene when >2 probes exist; otherwise, uses the same procedure as max_varance.

  1. probe_selection='diff_stability'

Selects the probe with the most consistent pattern of regional variation across donors (i.e., the highest average correlation across brain regions between all pairs of donors). This method can only be used when donor_probes=’aggregate’.

  1. method='rnaseq'

Selects probes with most consistent pattern of regional variation to RNAseq data (across the two donors with RNAseq data). This method can only be used when donor_probes=’aggregate’.

Note that for incompatible combinations of probe_selection and donor_probes (as detailed above), the probe_selection choice will take precedence. For example, providing ``probe_selection=’diff_stability’` and donor_probes='independent' will cause donor_probes to be reset to ‘aggregate’.

The following methods can be used for normalizing microarray expression values prior to aggregating:

  1. {sample,gene}_norm=='rs'

Uses a robust sigmoid function as in [A3] to normalize values

  1. {sample,gene}_norm='srs'

Same as ‘rs’ but scales output to the unit normal (i.e., range 0-1)

  1. {sample,gene}_norm='minmax'

Scales data to the unit normal (i.e., range 0-1)

  1. {sample,gene}_norm='center'

Removes the mean of expression values

  1. {sample,gene}_norm='zscore'

Applies a basic z-score (subtract mean, divide by standard deviation); uses degrees of freedom equal to one for standard deviation

References

[A1]

Arnatkevic̆iūtė, A., Fulcher, B. D., & Fornito, A. (2019). A practical guide to linking brain-wide gene expression and neuroimaging data. NeuroImage, 189, 353-367.

[A2]

Hawrylycz, M.J. et al. (2012) An anatomically comprehensive atlas of the adult human transcriptome. Nature, 489, 391-399.

[A3]

Fulcher, B. D., & Fornito, A. (2016). A transcriptional signature of hub connectivity in the mouse connectome. Proceedings of the National Academy of Sciences, 113(5), 1435-1440.