Command-line usage

You can use many of the primary workflows in abagen from the command line.

The abagen command

Assigns microarray expression data to ROIs defined in the specified atlas

This command aims to provide a workflow for generating pre-processed microarray expression data from the Allen Human Brain Atlas for arbitrary atlas designations. First, some basic filtering of genetic probes is performed, including:

  1. Intensity-based filtering of microarray probes to remove probes that do not exceed a certain level of background noise (specified via the –ibf_threshold parameter),

  2. Selection of a single, representative probe (or collapsing across probes) for each gene, specified via the –probe_selection parameter (and influenced by the –donor_probes parameter), and

  3. Optional mirroring of the tissue samples across the left/right hemisphere boundary, as specified via the –lr_mirror parameter (turned off by default).

Tissue samples are then matched to parcels in the defined atlas for each donor. If –atlas_info is provided then this matching is constrained by both hemisphere and tissue class designation (e.g., cortical samples from the left hemisphere are only matched to ROIs in the left cortex, subcortical samples from the right hemisphere are only matched to ROIs in the left subcortex); see the atlas_info parameter description for more information.

Matching of microarray samples to parcels in atlas is done via a multi-step process:

  1. Determine if the sample falls directly within a parcel,

  2. Check to see if there are nearby parcels by slowly expanding the search space to include nearby voxels, up to a specified distance (specified via the –tolerance parameter),

  3. If there are multiple nearby parcels, the sample is assigned to the closest parcel, as determined by the parcel centroid.

If at any step a sample can be assigned to a parcel the matching process is terminated. When the provided atlas is not volumetric (i.e., is surface-based) the samples are simply matched to the nearest vertex, and –tolerance is used as a standard deviation threshold. More control over the sample matching can be obtained by setting the –missing parameter.

Once all samples have been matched to parcels for all supplied donors, the microarray expression data are optionally normalized via the provided –sample_norm and –gene_norm functions (which are influenced by the –norm_matched and –norm_structures parameters) before being aggregated across donors via the supplied –region_agg and –agg_metric parameters.

usage: abagen [-h] [--version] [-v] [--atlas_info PATH]
              [--donors DONOR_ID [DONOR_ID ...]] [--data_dir PATH]
              [--n_proc N_PROC] [--ibf_threshold THRESHOLD]
              [--probe_selection METHOD] [--lr_mirror METHOD]
              [--sim_threshold THRESHOLD] [--missing METHOD] [--tol TOLERANCE]
              [--sample_norm METHOD] [--gene_norm METHOD] [--norm_all]
              [--norm_structures] [--region_agg METHOD] [--agg_metric METHOD]
              [--no-reannotated] [--no-corrected-mni] [--stdout]
              [--output-file PATH] [--save-counts] [--save-donors]
              atlas [atlas ...]

Positional Arguments

atlas

A NIFTI image in MNI152 space or two GIFTI images in fsaverage5 space, where each parcel is identified by a unique integer ID.

Named Arguments

--version

Show program version and exit.

-v, --verbose

Increase verbosity of status messages to display during workflow.

Options to specify information about the atlas used

--atlas_info, --atlas-info

Filepath to CSV file containing information about atlas. The CSV file must have at least columns [“id”, “hemisphere”, “structure”] which contain information mapping the atlas IDs to hemispheres (i.e, “L”, “R”, or “B”) and broad structural groups (i.e., “cortex”, “subcortex/brainstem”, “cerebellum”). If provided, this will constrain matching of tissue samples to regions in atlas. If the supplied atlas is a pair of GIFTI files with valid label tables this information will be intuited.

Options to specify which AHBA data to use during processing

--donors

List of donors to use as sources of expression data. Specified IDs can be either donor numbers (i.e., 9861, 10021) or UIDs (i.e., H0351.2001). Can specify “all” to use all available donors. Default: “all”

--data_dir, --data-dir

Directory where expression data should be downloaded to (if it does not already exist) / loaded from. If not specified this will check the environmental variable $ABAGEN_DATA, the $HOME/abagen-data directory, and the current working directory. If data does not already exist at one of those locations then it will be downloaded to the first of these location that exists and for which write access is enabled.

--n_proc, --n-proc

Number of processors to use to download AHBA data. Can paralellize up to six times if all donors are requested. Default: 1

Options to specify processing options

--ibf_threshold, --ibf-threshold

Threshold for intensity-based filtering of probes. This number should specify the ratio of samples, across all supplied donors, for which a probe must have signal above background noise in order to be retained. Default: 0.5

--probe_selection, --probe-selection

Possible choices: average, corr_intensity, corr_variance, diff_stability, max_intensity, max_variance, mean, pc_loading, rnaseq

Selection method for subsetting (or collapsing across) probes that index the same gene. Must be one of {“average”, “mean”, “max_intensity”, “max_variance”, “pc_loading”, “corr_variance”, “corr_intensity”, “diff_stability”, “rnaseq”}. Default: “diff_stability”

--lr_mirror, --lr-mirror

Possible choices: None, bidirectional, leftright, rightleft

Whether to mirror microarray expression samples across hemispheres to increase spatial coverage. Using “bidirectional” will mirror samples across both hemispheres, “leftright” will mirror samples in the left hemisphere to the right, and “rightleft” will mirror the right to the left. Default: None

--sim_threshold, --sim-threshold

Threshold for inter-areal similarity filtering. Samples are correlated across probes and those samples with a total correlation less than the the provided threshold s.d. below the mean across samples are excluded from futheranalysis. If not specified no filtering is performed. Default: None

--missing

Possible choices: None, centroids, interpolate

How to handle regions in atlas that are not assigned any tissue samples. If “centroids”, any empty regions will be assigned the expression value of the nearest tissue sample (defined as the sample with the closest Euclidean distance to the parcel centroid). If “interpolate”, expression values will be interpolated in the empty regions by assigning every node in the region the expression of the nearest sample and taking a weighted (inverse distance) average. If not specified empty regions will be returned with expression values of NaN. Default: None

--tol, --tolerance

Distance (in mm) that a sample can be from a parcel for it to be matched to that parcel. If atlas is GIFTI files then this measure is a standard deviation threshold (i.e., samples greater than tolerance SDs away from the mean matched distance are ignored). Default: 2

--sample_norm, --sample-norm

Possible choices: center, demean, minmax, mixed_sig, mixed_sigmoid, robust_sigmoid, rs, rsig, scaled_robust_sigmoid, scaled_rsig, scaled_sig, scaled_sig_qnt, scaled_sigmoid, scaled_sigmoid_quantiles, sig, sigmoid, srs, zscore, None, None

Method by which to normalize microarray expression values for each sample prior to collapsing into regions in atlas. Expression values are normalized separately for each sample and donor across genes. If None is specified then no normalization is performed. Default: “srs”

--gene_norm, --gene-norm

Possible choices: center, demean, minmax, mixed_sig, mixed_sigmoid, robust_sigmoid, rs, rsig, scaled_robust_sigmoid, scaled_rsig, scaled_sig, scaled_sig_qnt, scaled_sigmoid, scaled_sigmoid_quantiles, sig, sigmoid, srs, zscore, None, None

Method by which to normalize microarray expression values for each donor prior to collapsing across donors. Expression values are normalized separately for each gene for each donor across all expression samples. If None is specified then no normalization is performed. Default: “srs”

--norm_all, --norm-all

Whether to perform gene normalization (gene_norm) across all available samples instead of only across samples that were matched to regions in atlas. If atlas is very small (i.e., only a few regions of interest) using –norm_all is suggested.

--norm_structures, --norm-structures

Whether to perform gene normalization (gene_norm) within structural classes (i.e., “cortex”, “subcortex/brainstem”, “cerebellum”) instead of across all available samples.

--region_agg, --region-agg

Possible choices: donors, samples

When multiple samples are identified as belonging to a region in atlas this determines how they are aggegated. If ‘samples’, expression data from all samples for all donors assigned to a given region are combined. If ‘donors’, expression values for all samples assigned to a given region are combined independently for each donor before being combined across donors. See agg_metric for mechanism by which samples are combined. Default: ‘donors’

--agg_metric, --agg-metric

Possible choices: mean, median

Mechanism by which to (1) reduce expression data of multiple samples in the same atlas region, and (2) reduce donor-level expression data into a single “group” expression dataframe. Must be one of {“mean”, “median”}. Default: “mean”

Options to modify the AHBA data used

--no-reannotated, --no_reannotated

Whether to use the original probe information from the AHBA dataset instead of the reannotated probe information from Arnatkevic̆iūtė et al., 2019. Using reannotated probe information discards probes that could not be reliably matched to genes. Default: False (i.e., use reannotations)

--no-corrected-mni, --no_corrected_mni

Whether to use the original MNI coordinates provided with the AHBA data instead of the “corrected” MNI coordinates shipped with the alleninf package when matching tissue samples to anatomical regions. Default: False (i.e., use corrected coordinates)

Options to modify how data are output

--stdout

Generated region x gene dataframes will be printed to stdout for piping to other things. You should REALLY consider just using –output-file instead and working with the generated CSV file(s). Incompatible with –save-counts and –save-donors (i.e., this will override those options). Default: False

--output-file, --output_file

Path to desired output file. The generated region x gene dataframe will be saved here. Default: $PWD/abagen_expression.csv

--save-counts, --save_counts

Whether to save dataframe containing number of samples from each donor that were assigned to each region in atlas. If specified, will be saved to the path specified by output-file, appending “counts” to the end of the filename. Default: False

--save-donors, --save_donors

Whether to save donor-level expression dataframes instead of aggregating expression across donors with provided agg_metric. If specified, dataframes will be saved to path specified by output-file, appending donor IDs to the end of the filename. Default: False