abagen: A toolbox for the Allen Brain Atlas genetics data¶
This package provides a Python interface for fetching and working with the Allen Human Brain Atlas (AHBA) microarray expression data.
Overview¶
In 2013, the Allen Institute for Brain Science released the Allen Human Brain Atlas, a dataset containing microarray expression data collected from six human brains (Hawrylycz et al., 2012) . This dataset has offered an unprecedented opportunity to examine the genetic underpinnings of the human brain, and has already yielded novel insight into e.g., adolescent brain development and functional brain organization.
However, in order to be effectively used in most analyses, the AHBA microarray expression data often needs to be (1) collapsed into regions of interest (e.g., parcels or networks), and (2) combined across donors. While this may potentially seem trivial, there are a number of analytic choices in these steps that can dramatically influence the resulting data and any downstream analyses. Arnatkevičiūte et al., 2019 provided a thorough treatment of this in a recent publication, demonstrating how the techniques and code used to prepare the raw AHBA data have varied widely across published reports. We extended this work in a recent preprint (Markello et al., 2021) to quantify how different processing choices can impact statistical analyses of the AHBA.
The current Python package, abagen
, aims to provide reproducible workflows
for processing and preparing the AHBA microarray expression data for analysis.
Installation requirements¶
Currently, abagen
works with Python 3.6+ and requires a few dependencies:
nibabel
numpy (>=1.14.0)
pandas (>=0.25.0), and
scipy
There are some additional (optional) dependencies you can install to speed up some functions:
fastparquet, and
python-snappy
These latter packages are primarily used to facilitate loading the (rather large!) microarray expression dataframes provided by the Allen Institute,
For detailed information on how to install abagen
, including these
dependencies, refer to our installation instructions.
Quickstart¶
At it’s core, using abagen
is as simple as:
>>> import abagen
>>> expression = abagen.get_expression_data('myatlas.nii.gz')
where 'myatlas.nii.gz'
points to a brain parcellation file.
This function can also be called from the command line with:
$ abagen --output-file expression.csv myatlas.nii.gz
For more detailed instructions on how to use abagen
please refer to our
user guide!
Development and getting involved¶
If you’ve found a bug, are experiencing a problem, or have a question about
using the package, please head on over to our GitHub issues and make a new
issue with some information about it! Someone will try and get back to you
as quickly as possible, though please note that the primary developer for
abagen
(@rmarkello) is a graduate student so responses make take some time!
If you’re interested in getting involved in the project: welcome ✨!
We’re thrilled to welcome new contributors. You should start by reading our
code of conduct; all activity on abagen
should adhere to the CoC. After
that, take a look at our contributing guidelines so you’re familiar with the
processes we (generally) try to follow when making changes to the repository!
Once you’re ready to jump in head on over to our issues to see if there’s
anything you might like to work on.
Citing abagen
¶
For up-to-date instructions on how to cite abagen please refer to our documentation.
License Information¶
This codebase is licensed under the 3-clause BSD license. The full license
can be found in the LICENSE file in the abagen
distribution.
Reannotated gene information located at abagen/data/reannotated.csv.gz
and
individualized donor parcellations for the Desikan-Killiany atlas located at
abagen/data/native_dk
are taken from Arnatkevičiūte et al., 2018 and are
separately licensed under the CC BY 4.0; these data can also be found on
figshare.
Corrected MNI coordinates used to match AHBA tissues samples to MNI space
located at abagen/data/corrected_mni_coordinates.csv
are taken from the
alleninf package, provided under the 3-clause BSD license.
All microarray expression data is copyrighted under non-commercial reuse policies by the Allen Institute for Brain Science (© 2010 Allen Institute for Brain Science. Allen Human Brain Atlas. Available from: Allen Human Brain Atlas).
All trademarks referenced herein are property of their respective holders.
Contents¶
Installation and setup¶
Note
Using the instructions for the IO installation is highly recommended! It may take a little more set-up depending on your operating system, but the benefits during processing are noticeable!
Basic installation¶
This package requires Python 3.6+. Assuming you have the correct version of
Python installed, you can install abagen
by opening a terminal and running
the following:
pip install abagen
Alternatively, you can install the most up-to-date version of from GitHub:
git clone https://github.com/rmarkello/abagen.git
cd abagen
pip install .
Important
If you are going to install the version directly from GitHub make sure that you are using the most up-to-date documentation!
IO installation¶
The data supplied by the Allen Human Brain Atlas is quite large—on the order
of ~4GB for all six donors. Because loading these datasets into memory can be
quite time-consuming, abagen
has integrated support for parquet and can do some on-the-fly conversions to speed things
up. However, using parquet is completely optional, and therefore support for it
is not installed when using the “vanilla” installation procedures.
If you would like to enable parquet support, you will need to install some
additional dependencies. This can be done using pip
:
pip install abagen[io]
Note
If you are using a Linux OS, you will need to install the libsnappy-dev
(Debian) or snappy-devel
(Fedora) package before running the above
installation command!
You can also install these extra packages from the GitHub source:
git clone https://github.com/rmarkello/abagen.git
cd abagen
pip install .[io]
Important
If you are going to install the version directly from GitHub make sure that you are using the most up-to-date documentation!
What’s new¶
0.1.3 (June 18, 2021)¶
More bug fixes for (identified thanks to @richardajdear and @Silflame!) as well as some minor QOL updates for getting sample-level information from workflows.
[REF] Handle atlases w/decimal vox sizes (#197), @rmarkello
[REF] Re-orient img in leftify_atlas() (#193), @rmarkello
[FIX] Fixes logger error w/reports (#192), @rmarkello
[FIX] Error with donor-specific atlases and tolerance=0 (#191), @rmarkello
[REF] Fix coordinates of get_samples_in_mask (#189), @rmarkello
0.1.1 (March 29, 2021)¶
Small bug fix when probe_selection=’average’ and minor QOL update to check_atlas().
[ENH] Add geometry/space option to check_atlas (#186), @rmarkello
[FIX] Fixes ‘average’ probe selection bug (#185), @rmarkello
0.1 (March 25, 2021)¶
First minor release! :tada: This update brings some major changes to the codebase, including a massive internal refactoring to allow handling of surface parcellations, new methods for interpolating missing data, and more!
Important: as this is our first official minor release, note that the changes in this release are not backwards compatible with previous versions.
[ENH] Adds interareal similarity thresholding (#184), @rmarkello
[REF] Allow negative tolerance for surfaces (#183), @rmarkello
[REF] Updates reports for missing param (#182), @rmarkello
[DOC] Updates master refs to main (#181), @rmarkello
[ENH] Adds interpolation options for missing data (#180), @rmarkello
[REF] Updates mouse data cache location (#179), @rmarkello
[FIX] Bugfix for lr_mirror parameter (#178), @rmarkello
[ENH] Adds reporting of processing methods (#177), @rmarkello
[FIX] Nan hemisphere designation (#176), @rmarkello
[FIX] Addresses bug when tolerance=0 (#175), @rmarkello
[TEST] Add py3.8 to azure tests (#174), @rmarkello
[REF] Constrains centroid matching by atlas_info (#173), @rmarkello
[REF] Updates CLI to match main workflow (#172), @rmarkello
[REF] Modifies lr_mirror to accept single hemisphere mirroring (#171), @rmarkello
[ENH] Add norm_structures parameter (#170), @rmarkello
[ENH] Adds ability to handle surface parcellations (#169), @rmarkello
[TEST] Parametrize test_resid_dist (#164), @4lovi4
0.0.8 (January 29, 2021)¶
Largely internal codebase changes, but an important bug fix for compatibility
with newer versions of pandas
.
Huge thanks to @4lovi4 for all their contributions!
[FIX] Read_csv error when using pandas >= 1.2.0 (#167), @rmarkello
[TEST] Change Azure VM image to fix SSL (#165), @rmarkello
Adds Python 3.8 testing support to CircleCI (#163), @4lovi4
issue #73 unquote url (#162), @4lovi4
issue #80 doc string fix (#161), @4lovi4
issue #139 readme fix (#160), @4lovi4
0.0.7 (October 15, 2020)¶
Only two changes—but relatively impactful ones—including the addition of a new
workflow-style function (abagen.get_samples_in_mask()
) to get preprocessed
tissue samples without aggregating in specific brain regions and the ability to
provide parcellations in donor-native space (rather than MNI space).
Documentation has been updated to reflect all new features!
[ENH] Adds handling for donor-specific atlases (#156), @rmarkello
[ENH] Allows users to get microarray samples in mask (#155), @rmarkello
0.0.6 (August 17, 2020)¶
Minor bug fixes, including:
[FIX] Allow None input to CLI norm opts (#153), @rmarkello
[MNT] Pin minimum pandas version to 0.25 (#151), @rmarkello
[DOC] Adds Fulcher norm ref (#149), @rmarkello
[FIX] Always remap annotation dataframe cols (#147), @rmarkello
[FIX] Allow ‘common’ for donor_probes method (#146), @rmarkello
0.0.5 (March 24, 2020)¶
This release includes quite a bit of new functionality, including:
Several new gene/sample normalization methods,
A fetcher / new loader functions for RNAseq data (
abagen.fetch_rnaseq()
),The ability to use RNAseq data for selecting probes (
probe_selection='rnaseq
), andA new
donor_probes
parameter forabagen.get_expression_data()
to control how probe selection is performed
Check out the documentation for more details!
[ENH] Adds new mechanisms for probe selection (#145), @rmarkello
[ENH,REF] Adds RNAseq probe selection method (#144), @rmarkello
[TEST] Fix Azure maybe? (#143), @rmarkello
[ENH] Adds FreeSurfer data fetcher (#142), @rmarkello
[ENH] Adds fetchers / loaders for RNAseq data (#140), @rmarkello
[TEST] Adds Windows testing with Azure (#141), @rmarkello
[FIX,ENH] Error in get_expression_data, gene stability calculation (#136), @rmarkello
[ENH,REF] New norms, utils, renamed modules (#135), @rmarkello
0.0.4 (February 26, 2020)¶
A release with a small bugfixes (#134) and a small fix-fix (#133)!
[FIX] Coerce atlas dataobj to arr before indexing (#134), @rmarkello
[REF] Drop probes with invalid/missing Entrez ID (#133), @rmarkello
0.0.3 (November 26, 2019)¶
A new release with some added features but primarily a good bit of re-arranging
in the primary abagen.get_expression_data()
workflow. Notable changes
include:
New parameters
region_agg
,agg_metric
,sample_norm
, andgene_norm
(the latter of which supplantsdonor_norm
), controlling how microarray samples are normalized and aggregated across donors;Large reductions in memory usage (#130), such that the primary workflow should only use ~2GB of RAM at its peak; and,
Migration to CircleCI for all testing purposes!
Special thanks to @rhannema, @Ellen8780, @gifuni, and @VinceBaz for their contributions.
[REF] Adds option to suppress norm warnings (#132), @rmarkello
[ENH,REF] Adds new region_agg parameter (#131), @rmarkello
[REF] Massive reduction in memory usage (#130), @rmarkello
[FIX] Solves pandas bug with max_variance method (#128), @gifuni
[STY] Fixes trailing whitespace (#129), @rmarkello
[FIX] Fixes strange TypeError for pc_loading (#1), @rmarkello
[REF] Ensures integer DataFrame when return_count=True (#127), @rhannema
[ENH] Add fetcher for donor info (#126), @Ellen8780
[REF,ENH] Modifies and adds normalization procedures (#119), @rmarkello
[TEST] Updating CircleCI build (#122), @rmarkello
[TEST] TravisCI –> CircleCI (#121), @rmarkello
[REF] Removes .get_data() nibabel calls (#120), @rmarkello
[FIX] Specify engine=’python’ in pandas queries (#117), @VinceBaz
0.0.2 (September 19, 2019)¶
This release comes with a lot of new changes, including:
Several new arguments for
abagen.get_expression_data()
, including new probe selection methods, donor normalization techniques, and hemispheric mirroring of tissue samples;A command-line version of the primary workflow accessible via the
abagen
command;Improved data handling, using $HOME/abagen-data as the default storage location for data from the AHBA;
New functionality for fetching raw AHBA donor MRI scans;
Zenodo integration to make it easy to cite
abagen
; and,Massive documentation overhauls, with a dramatically updated user guide and API reference!
Special thanks to Golia Shafiei (@gshafiei), Ying-Qiu Zheng (@yingqiuz), James Frierson (@JamesFrierson1), and Arda Kosar (@abkosar) for their contributions.
[MNT] Finishes Zenodo integration (#109), @rmarkello
[MNT] Adds framework for Zenodo integration (#108), @rmarkello
[REF] Fixes one-donor bug in get_expression_data() (#107), @rmarkello
[REF] Identifies hippocampus as subcortex in Allen ontology (#106), @rmarkello
[DOC] Updates documentation + contributing (#105), @rmarkello
[REF] Mirroring before probe filtering (#101), @rmarkello
[FIX] Installation not correctly bundling package data (#102), @rmarkello
[MNT] Docs / package structure updates in prep for 0.2.0 (#95), @rmarkello
[REF] abagen.io functions don’t copy dataframes by default (#94), @rmarkello
[FIX] Fixes broken include directive in API reference (#91), @rmarkello
[ENH] Adds parameter for normalizing donor microarray expression values (#90), @rmarkello
[ENH] Adds option to mirror samples across L/R hemispheres (#87), @rmarkello
[ENH] Adds CLI for abagen.get_expression_data functionality (#82), @rmarkello
[ENH] Adds ability to fetch raw AHBA MRIs (#85), @rmarkello
[ENH] Adds ability to query gene groups (#83), @rmarkello
[MNT,REF] Updates install, versioning, dependencies (#84), @rmarkello
[REF] Adds brainstem to abagen.process ontology (#81), @rmarkello
[DOC] Updates API documentation (#76), @rmarkello
[REF,ENH] Adds new abagen.probes module (#67), @rmarkello
[REF] Changes data directory locator for abagen data (#66), @rmarkello
[FIX] Fixes doctest in abagen.mouse (#65), @rmarkello
[REF] Removes .get_values() references (#64), @rmarkello
[DOC] Adds logging to workflow functions (#61), @rmarkello
Fixed abagen.mouse column ordering (#62), @abkosar
[DOC] Update refs and http (#60), @rmarkello
[REF] Use cached alleninf coordinates only (#59), @rmarkello
[FIX] Removes RuntimeWarning in example code (#58), @rmarkello
Updated README to include Allen Institute citations and disclaimers (#57), @JamesFrierson1
[FIX] Catches AttributeError w/pandas fastparquet (#41), @rmarkello
[REF] Updates get_expression_data() (#38), @rmarkello
[TEST] Fixes tests (#34), @rmarkello
Add mouse features (#32), @yingqiuz
[TEST] Fix pytest version and update travis (#33), @rmarkello
[TEST] Update travis testing (#31), @rmarkello
[FIX] More fixes for atlas numbering (#30), @rmarkello
[FIX] Allow non-sequential atlas numbering (#29), @rmarkello
[ENH] Adds input check for remove_distance() (#28), @rmarkello
[ENH] Allow label exclusion in remove_distance() (#27), @rmarkello
[REF] Changes remove_distance() inputs (#26), @rmarkello
[ENH] Add function for aggregating donors (#25), @rmarkello
[ENH] Adds reannotated probe information (#24), @rmarkello
[ENH] Adds abagen.correct for postprocessing (#20), @rmarkello
[TEST] Removes pytest capturing (#23), @rmarkello
[TEST] Calculates coverage only for extras (#22), @rmarkello
[DOC] Updates doc-strings for primary functions (#19), @rmarkello
[TEST] Add early test to reduce timeouts (#21), @rmarkello
[FIX] Adds updated MNI coordinates file as backup (#17), @rmarkello
[DOC] Updates default tolerance (#16), @gshafiei
0.0.1 (September 7, 2018)¶
Initial release of abagen
, a toolbox for working with the Allen Brain Atlas human genetics data.
[DOC] Updates various documentation (#15), @rmarkello
[DOC] Adds LICENSE reference for alleninf (#14), @rmarkello
[DOC] Updates README links and example usage (#13), @rmarkello
[TEST] Updates tests of abagen.get_expression_data() (#12), @rmarkello
[DOC] Adds Sphinx documentation (#11), @rmarkello
[FIX] Resolves dataframe formatting issue (#10), @rmarkello
[ENH] Adds DK atlas fetcher and updates README.md (#8), @rmarkello
[REF] Cleaning up unused code (#7), @rmarkello
[ENH] MAJOR refactoring of IO and processing (#4), @rmarkello
[TEST] Adds .travis.yml and initial tests (#3), @rmarkello
[STY] Stylistic updates to abagen.datasets (#2), @rmarkello
Command-line usage¶
You can use many of the primary workflows in abagen
from the command line.
The abagen
command¶
Assigns microarray expression data to ROIs defined in the specified atlas
This command aims to provide a workflow for generating pre-processed microarray expression data from the Allen Human Brain Atlas for arbitrary atlas designations. First, some basic filtering of genetic probes is performed, including:
Intensity-based filtering of microarray probes to remove probes that do not exceed a certain level of background noise (specified via the –ibf_threshold parameter),
Selection of a single, representative probe (or collapsing across probes) for each gene, specified via the –probe_selection parameter (and influenced by the –donor_probes parameter), and
Optional mirroring of the tissue samples across the left/right hemisphere boundary, as specified via the –lr_mirror parameter (turned off by default).
Tissue samples are then matched to parcels in the defined atlas for each donor. If –atlas_info is provided then this matching is constrained by both hemisphere and tissue class designation (e.g., cortical samples from the left hemisphere are only matched to ROIs in the left cortex, subcortical samples from the right hemisphere are only matched to ROIs in the left subcortex); see the atlas_info parameter description for more information.
Matching of microarray samples to parcels in atlas is done via a multi-step process:
Determine if the sample falls directly within a parcel,
Check to see if there are nearby parcels by slowly expanding the search space to include nearby voxels, up to a specified distance (specified via the –tolerance parameter),
If there are multiple nearby parcels, the sample is assigned to the closest parcel, as determined by the parcel centroid.
If at any step a sample can be assigned to a parcel the matching process is terminated. When the provided atlas is not volumetric (i.e., is surface-based) the samples are simply matched to the nearest vertex, and –tolerance is used as a standard deviation threshold. More control over the sample matching can be obtained by setting the –missing parameter.
Once all samples have been matched to parcels for all supplied donors, the microarray expression data are optionally normalized via the provided –sample_norm and –gene_norm functions (which are influenced by the –norm_matched and –norm_structures parameters) before being aggregated across donors via the supplied –region_agg and –agg_metric parameters.
usage: abagen [-h] [--version] [-v] [--atlas_info PATH]
[--donors DONOR_ID [DONOR_ID ...]] [--data_dir PATH]
[--n_proc N_PROC] [--ibf_threshold THRESHOLD]
[--probe_selection METHOD] [--lr_mirror METHOD]
[--sim_threshold THRESHOLD] [--missing METHOD] [--tol TOLERANCE]
[--sample_norm METHOD] [--gene_norm METHOD] [--norm_all]
[--norm_structures] [--region_agg METHOD] [--agg_metric METHOD]
[--no-reannotated] [--no-corrected-mni] [--stdout]
[--output-file PATH] [--save-counts] [--save-donors]
atlas [atlas ...]
Positional Arguments¶
- atlas
A NIFTI image in MNI152 space or two GIFTI images in fsaverage5 space, where each parcel is identified by a unique integer ID.
Named Arguments¶
- --version
Show program version and exit.
- -v, --verbose
Increase verbosity of status messages to display during workflow.
Options to specify information about the atlas used¶
- --atlas_info, --atlas-info
Filepath to CSV file containing information about atlas. The CSV file must have at least columns [“id”, “hemisphere”, “structure”] which contain information mapping the atlas IDs to hemispheres (i.e, “L”, “R”, or “B”) and broad structural groups (i.e., “cortex”, “subcortex/brainstem”, “cerebellum”). If provided, this will constrain matching of tissue samples to regions in atlas. If the supplied atlas is a pair of GIFTI files with valid label tables this information will be intuited.
Options to specify which AHBA data to use during processing¶
- --donors
List of donors to use as sources of expression data. Specified IDs can be either donor numbers (i.e., 9861, 10021) or UIDs (i.e., H0351.2001). Can specify “all” to use all available donors. Default: “all”
- --data_dir, --data-dir
Directory where expression data should be downloaded to (if it does not already exist) / loaded from. If not specified this will check the environmental variable $ABAGEN_DATA, the $HOME/abagen-data directory, and the current working directory. If data does not already exist at one of those locations then it will be downloaded to the first of these location that exists and for which write access is enabled.
- --n_proc, --n-proc
Number of processors to use to download AHBA data. Can paralellize up to six times if all donors are requested. Default: 1
Options to specify processing options¶
- --ibf_threshold, --ibf-threshold
Threshold for intensity-based filtering of probes. This number should specify the ratio of samples, across all supplied donors, for which a probe must have signal above background noise in order to be retained. Default: 0.5
- --probe_selection, --probe-selection
Possible choices: average, corr_intensity, corr_variance, diff_stability, max_intensity, max_variance, mean, pc_loading, rnaseq
Selection method for subsetting (or collapsing across) probes that index the same gene. Must be one of {“average”, “mean”, “max_intensity”, “max_variance”, “pc_loading”, “corr_variance”, “corr_intensity”, “diff_stability”, “rnaseq”}. Default: “diff_stability”
- --lr_mirror, --lr-mirror
Possible choices: None, bidirectional, leftright, rightleft
Whether to mirror microarray expression samples across hemispheres to increase spatial coverage. Using “bidirectional” will mirror samples across both hemispheres, “leftright” will mirror samples in the left hemisphere to the right, and “rightleft” will mirror the right to the left. Default: None
- --sim_threshold, --sim-threshold
Threshold for inter-areal similarity filtering. Samples are correlated across probes and those samples with a total correlation less than the the provided threshold s.d. below the mean across samples are excluded from futheranalysis. If not specified no filtering is performed. Default: None
- --missing
Possible choices: None, centroids, interpolate
How to handle regions in atlas that are not assigned any tissue samples. If “centroids”, any empty regions will be assigned the expression value of the nearest tissue sample (defined as the sample with the closest Euclidean distance to the parcel centroid). If “interpolate”, expression values will be interpolated in the empty regions by assigning every node in the region the expression of the nearest sample and taking a weighted (inverse distance) average. If not specified empty regions will be returned with expression values of NaN. Default: None
- --tol, --tolerance
Distance (in mm) that a sample can be from a parcel for it to be matched to that parcel. If atlas is GIFTI files then this measure is a standard deviation threshold (i.e., samples greater than tolerance SDs away from the mean matched distance are ignored). Default: 2
- --sample_norm, --sample-norm
Possible choices: center, demean, minmax, mixed_sig, mixed_sigmoid, robust_sigmoid, rs, rsig, scaled_robust_sigmoid, scaled_rsig, scaled_sig, scaled_sig_qnt, scaled_sigmoid, scaled_sigmoid_quantiles, sig, sigmoid, srs, zscore, None, None
Method by which to normalize microarray expression values for each sample prior to collapsing into regions in atlas. Expression values are normalized separately for each sample and donor across genes. If None is specified then no normalization is performed. Default: “srs”
- --gene_norm, --gene-norm
Possible choices: center, demean, minmax, mixed_sig, mixed_sigmoid, robust_sigmoid, rs, rsig, scaled_robust_sigmoid, scaled_rsig, scaled_sig, scaled_sig_qnt, scaled_sigmoid, scaled_sigmoid_quantiles, sig, sigmoid, srs, zscore, None, None
Method by which to normalize microarray expression values for each donor prior to collapsing across donors. Expression values are normalized separately for each gene for each donor across all expression samples. If None is specified then no normalization is performed. Default: “srs”
- --norm_all, --norm-all
Whether to perform gene normalization (gene_norm) across all available samples instead of only across samples that were matched to regions in atlas. If atlas is very small (i.e., only a few regions of interest) using –norm_all is suggested.
- --norm_structures, --norm-structures
Whether to perform gene normalization (gene_norm) within structural classes (i.e., “cortex”, “subcortex/brainstem”, “cerebellum”) instead of across all available samples.
- --region_agg, --region-agg
Possible choices: donors, samples
When multiple samples are identified as belonging to a region in atlas this determines how they are aggegated. If ‘samples’, expression data from all samples for all donors assigned to a given region are combined. If ‘donors’, expression values for all samples assigned to a given region are combined independently for each donor before being combined across donors. See agg_metric for mechanism by which samples are combined. Default: ‘donors’
- --agg_metric, --agg-metric
Possible choices: mean, median
Mechanism by which to (1) reduce expression data of multiple samples in the same atlas region, and (2) reduce donor-level expression data into a single “group” expression dataframe. Must be one of {“mean”, “median”}. Default: “mean”
Options to modify the AHBA data used¶
- --no-reannotated, --no_reannotated
Whether to use the original probe information from the AHBA dataset instead of the reannotated probe information from Arnatkevic̆iūtė et al., 2019. Using reannotated probe information discards probes that could not be reliably matched to genes. Default: False (i.e., use reannotations)
- --no-corrected-mni, --no_corrected_mni
Whether to use the original MNI coordinates provided with the AHBA data instead of the “corrected” MNI coordinates shipped with the alleninf package when matching tissue samples to anatomical regions. Default: False (i.e., use corrected coordinates)
Options to modify how data are output¶
- --stdout
Generated region x gene dataframes will be printed to stdout for piping to other things. You should REALLY consider just using –output-file instead and working with the generated CSV file(s). Incompatible with –save-counts and –save-donors (i.e., this will override those options). Default: False
- --output-file, --output_file
Path to desired output file. The generated region x gene dataframe will be saved here. Default: $PWD/abagen_expression.csv
- --save-counts, --save_counts
Whether to save dataframe containing number of samples from each donor that were assigned to each region in atlas. If specified, will be saved to the path specified by output-file, appending “counts” to the end of the filename. Default: False
- --save-donors, --save_donors
Whether to save donor-level expression dataframes instead of aggregating expression across donors with provided agg_metric. If specified, dataframes will be saved to path specified by output-file, appending donor IDs to the end of the filename. Default: False
User guide¶
abagen
aims to provide a reproducible pipeline for processing and preparing
microarray expression data provided by the Allen Human Brain Atlas (AHBA) for
research analyses.
This user guide steps through the basics of fetching microarray expression data from the AHBA, defining an atlas or parcellation for wrangling that data, and actually parcellating the data into a more usable, analyis-ready format. If you still have questions after going through this guide you can refer to the Reference API or ask a question on GitHub.
The Allen Human Brain Atlas dataset¶
Fetching the AHBA data¶
In order to use abagen
, you’ll need to download the AHBA microarray data. You can download it with the
following command:
>>> import abagen
>>> files = abagen.fetch_microarray(donors='all', verbose=0)
Note
Downloading the entire dataset (about 4GB) can take a long time depending
on your internet connection speed! If you don’t want to download all the
donors you can provide the subject IDs of the donors you want as a list
(e.g., ['9861', '10021']
) instead of passing 'all'
.
This command will download data from the specified donors into a folder called
microarray
in the $HOME/abagen-data
directory. If you have already
downloaded the data you can provide the data_dir
argument to specify where
the files have been stored:
>>> files = abagen.fetch_microarray(donors=['12876', '15496'], data_dir='/path/to/my/data/')
Alternatively, abagen
will check the directory specified by the
environmental variable $ABAGEN_DATA
and use that as the download location
if the dataset does not already exist there.
If you provide a path to data_dir
(or specify a path with $ABAGEN_DATA
)
the directory specified should have the following structure:
/path/to/my/data/
├── normalized_microarray_donor10021/
│ ├── MicroarrayExpression.csv
│ ├── Ontology.csv
│ ├── PACall.csv
│ ├── Probes.csv
│ └── SampleAnnot.csv
├── normalized_microarray_donor12876/
├── normalized_microarray_donor14380/
├── normalized_microarray_donor15496/
├── normalized_microarray_donor15697/
└── normalized_microarray_donor9861/
(Note the directory does not have to be named microarray
for this to work.)
Loading the AHBA data¶
The files
object returned by abagen.fetch_microarray()
is a nested
dictionary with filepaths to the five different file types in the AHBA
microarray dataset. The keys are the donor IDs:
>>> print(files.keys())
dict_keys(['9861', '10021', '12876', '14380', '15496', '15697'])
And the values for each entry are a sub-dictionary of the downloaded files:
>>> print(sorted(files['9861']))
['annotation', 'microarray', 'ontology', 'pacall', 'probes']
You can load the data in these files using the abagen.io
functions.
There are IO functions for each of the five file types; you can get more
information on the functions and the data contained in each file type by
looking at the Reference API. Notably, all IO functions return
pandas.DataFrame
objects for ease-of-use.
For example, you can load the annotation file for the first donor with:
>>> data = files['9861']
>>> annotation = abagen.io.read_annotation(data['annotation'])
>>> print(annotation)
structure_id slab_num well_id ... mni_x mni_y mni_z
sample_id ...
1 4077 22 594 ... 5.9 -27.7 49.7
2 4323 11 2985 ... 29.2 17.0 -2.9
3 4323 18 2801 ... 28.2 -22.8 16.8
... ... ... ... ... ... ... ...
944 4758 67 1074 ... 7.9 -72.3 -40.6
945 4760 67 1058 ... 8.3 -57.4 -59.0
946 4761 67 1145 ... 9.6 -46.7 -47.6
[946 rows x 13 columns]
And you can do the same for, e.g., the probe file with:
>>> probes = abagen.io.read_probes(data['probes'])
>>> print(probes)
probe_name gene_id gene_symbol gene_name entrez_id chromosome
probe_id
1058685 A_23_P20713 729 C8G complement component 8, gamma polypeptide 733 9
1058684 CUST_15185_PI416261804 731 C9 complement component 9 735 5
1058683 A_32_P203917 731 C9 complement component 9 735 5
... ... ... ... ... ... ...
1071209 A_32_P885445 1012197 A_32_P885445 AGILENT probe A_32_P885445 (non-RefSeq) <NA> NaN
1071210 A_32_P9207 1012198 A_32_P9207 AGILENT probe A_32_P9207 (non-RefSeq) <NA> NaN
1071211 A_32_P94122 1012199 A_32_P94122 AGILENT probe A_32_P94122 (non-RefSeq) <NA> NaN
[58692 rows x 6 columns]
The other IO functions work similarly for the remaining filetypes.
Defining a parcellation¶
Acceptable parcellations¶
In order to process the microarray expression data from AHBA you’ll need a
parcellation (or “atlas”). Here, we define a parcellation (atlas) as either (1)
a NIFTI image in MNI space, or (2) a tuple of GIFTI images in fsaverage
space (and with fsaverage5 resolution!). In both cases, parcels in the atlas
should be denoted by unique integer IDs (distinct across hemispheres). The
primary workflows in abagen
are designed to readily accept any
parcellations / atlases in this format; however, if you want to use a different
format please refer to Non-standard parcellations.
For demonstration purposes, abagen
has a copy of the Desikan-Killiany
atlas that you can use. Here, we load the volumetric atlas by default:
>>> import abagen
>>> atlas = abagen.fetch_desikan_killiany()
The returned object atlas
is a dictionary with two keys: image
, which
is filepath to a NIFTI image containing the atlas data, and info
, which is
a filepath to a CSV file containing extra information about the parcellation:
>>> print(atlas['image'])
/.../data/atlas-desikankilliany.nii.gz
>>> print(atlas['info'])
/.../data/atlas-desikankilliany.csv
You can load the surface version of the atlas by providing the surface
parameter:
>>> atlas = abagen.fetch_desikan_killiany(surface=True)
>>> print(atlas['image'])
('/.../data/atlas-desikankilliany-lh.label.gii.gz', '/.../data/atlas-desikankilliany-rh.label.gii.gz')
Providing additional parcellation info¶
While only the image (i.e., NIFTI or GIFTIs) is required for processing the
microarray data, the CSV file with information on the parcellation scheme can
also be very useful. In particular, abagen
can use the CSV file to
constrain the matching of tissue samples to anatomical regions in the atlas
image.
Note
If you are using a surface atlas and your GIFTI files have valid label
tables then abagen
will
automatically create a pandas.DataFrame with all the relevant information
described below. However, you can always provide a separate CSV file if
you are unsure and this will override any label tables present in the
GIFTI files.
If you want to supply your own CSV file with information about an atlas you must ensure it has (at least) the following columns:
id
: an integer ID corresponding to the labels in theatlas
image
hemisphere
: a left/right/bilateral hemispheric designation (i.e., ‘L’, ‘R’, or ‘B’)
structure
: a broad structural class designation (i.e., one of ‘cortex’, ‘subcortex/brainstem’, ‘cerebellum’, ‘white matter’, or ‘other’)
For example, a valid CSV might look like this:
>>> import pandas as pd
>>> atlas_info = pd.read_csv(atlas['info'])
>>> print(atlas_info)
id label hemisphere structure
0 1 bankssts L cortex
1 2 caudalanteriorcingulate L cortex
2 3 caudalmiddlefrontal L cortex
.. .. ... ... ...
80 81 hippocampus R subcortex/brainstem
81 82 amygdala R subcortex/brainstem
82 83 brainstem B subcortex/brainstem
[83 rows x 4 columns]
Notice that extra columns (i.e., label
) are okay as long as the three
required columns are present! If you want to confirm your file is formatted
correctly you can use abagen.images.check_atlas()
:
>>> from abagen import images
>>> atlas = abagen.fetch_desikan_killiany()
>>> atlas = images.check_atlas(atlas['image'], atlas['info']);
If something is amiss with the file this function will raise an error and try to give some information about what you should check for.
Important
You might be asking: “why should I provide this extra information for
my parcellation?” Providing this CSV file will ensure that microarray
samples designated as belonging to a given hemisphere/structure by the AHBA
ontology are not matched to regions in the atlas
image with different
hemispheric/structural designations. That is, if the AHBA ontology
specifies that a tissue sample comes from the left hemisphere subcortex, it
will only ever be matched to regions in atlas
belonging to the left
hemisphere subcortex.
While this seems trivial, it is very important because there are
numerous tissue samples which occur on the boundaries of hemispheres and
structural classes (i.e., cortex/subcortex). In many instances, these
samples won’t fall directly within a region of the atlas
, at which
point abagen
will attempt to match them to nearby regions. Without the
hemisphere/structure information provided by this CSV file there is a high
likelihood of misassigning samples, leading to biased or skewed expression
data.
Individualized parcellations¶
Instead of providing a single parcellation image that will be used for all
donors, you can instead provide a parcellation image for each donor in the
space of their “raw” (or native) T1w image. abagen
ships with versions
of the Desikan-Killiany parcellation defined in donor-native space:
>>> atlas = abagen.fetch_desikan_killiany(native=True)
>>> print(atlas['image'].keys())
dict_keys(['9861', '10021', '12876', '14380', '15496', '15697'])
>>> print(atlas['image']['9861'])
/.../data/native_dk/9861/atlas-desikankilliany.nii.gz
Note here that atlas['image']
is a dictionary, where the keys are donor IDs
and the corresponding values are paths to the parcellation for each donor. The
primary workflows in abagen
that accept a single atlas (i.e.,
abagen.get_expression_data()
and abagen.get_samples_in_mask()
) will
also accept a dictionary of this format.
We also provide donor-specific surface atlases (derived from the FreeSurfer
outputs that can be fetched with abagen.datasets.fetch_freesurfer()
).
These atlases are also shipped with abagen
and can be loaded with:
>>> atlas = abagen.fetch_desikan_killiany(native=True, surface=True)
>>> print(atlas['image'].keys())
dict_keys(['9861', '10021', '12876', '14380', '15496', '15697'])
>>> print(atlas['image']['9861'])
('/.../9861/atlas-desikankilliany-lh.label.gii.gz', '/.../9861/atlas-desikankilliany-rh.label.gii.gz')
Note that if you are using your own donor-specific surface atlases they must,
by default, be based on the geometry of the FreeSurfer surfaces provided with
abagen.datasets.fetch_freesurfer()
. If you wish to use surface atlases
based on different geometry please refer to Non-standard parcellations,
below.
Finally, when in doubt we recommend simply using a standard-space, group-level
atlas; however, we are actively investigating whether native-space atlases
provide any measurable benefits to the abagen
workflows.
Note
The donor-native volumetric versions of the DK parcellation shipped with
abagen
were generated by Arnatkevičiūte et al., 2018, NeuroImage, and
are provided under the CC BY 4.0 license. The donor-native surface versions
of the DK parcellation were generated by Romero-Garcia et al., 2017,
NeuroImage, and are also provided under the CC BY 4.0 license.
Non-standard parcellations¶
If you’d like to use a non-standard atlas in the primary abagen
workflows
that may be possible—with some caveats. That is, the constraining factor here
is the coordinates of the tissue samples from the AHBA: they are available in
(1) the native space of each donor’s MRI, or (2) MNI152 space, and we strongly
encourage you to use one of these options (rather than e.g., attempting to
register the coordinates to a new space). If you provide a group-level atlas
the toolbox will default to using the MNI152 coordinates; if you provide
donor-specific atlases then the tooblox will use the native coordinates. Thus,
by default, abagen
prefers you use one of the atlas conformations described
above.
However, if you have an atlas in a different space or resolution you can
(potentially) use it in the primary abagen
workflows. To do this you will
need to create a abagen.AtlasTree
object. All atlases provided are
internally coerced to AtlasTree instances, which is then used to assign
microarray tissue samples to parcels in the atlas.
Take, for example, a surface atlas in fsaverage6 resolution (by default, surface atlases are assumed to be fsaverage5 resolution). In this case, you simply need to supply the relevant geometry files for the atlas and specify the space of the atlas:
>>> from abagen import images
>>> atlas = ('/.../fsaverage6-lh.label.gii', '/.../fsaverage6-rh.label.gii')
>>> surf = ('/.../fsaverage6-lh.surf.gii', '/.../fsaverage6-lh.surf.gii')
>>> atlas = images.check_atlas(atlas, geometry=surf, space='fsaverage6')
The same procedure can be used for an atlas using fsLR geometry:
>>> from abagen import images
>>> atlas = ('/.../fslr32k-lh.label.gii', '/.../fslr32k-rh.label.gii')
>>> surf = ('/.../fslr32k-lh.surf.gii', '/.../fslr32k-lh.surf.gii')
>>> atlas = images.check_atlas(atlas, geometry=surf, space='fslr')
Parcellating expression data¶
Basic usage¶
Once you’ve downladed the microarray data and selected your parcellation you can process the data. This should be as simple as:
>>> expression = abagen.get_expression_data(atlas['image'])
If you are using a volumetric image it is highly recommended you provide the additional information on your parcellation, which can be done with:
>>> expression = abagen.get_expression_data(atlas['image'], atlas['info'])
Note
By default this function will use data from all the donors! Wrangling all
this raw microarray data can be quite time-consuming, so if you’d like to
speed up this step make sure you’ve performed the IO installation.
Alternatively, if you don’t want to use all the donors when testing or
debugging the code, you can provide the subject IDs of the donors you want
(e.g., donors=['9861', '10021']
).
The abagen.get_expression_data()
function will print out some information
about what’s happening as it goes. However, briefly the function:
Fetches the microarray data from AHBA (if this has not already been done). Refer to parameters
data_dir
anddonors
for more info.Updates the MNI coordinates of all the tissue samples from AHBA using the coordinates from the
alleninf
package. This occurs by default; refer to parametercorrected_mni
for more info.Mirrors samples across hemispheres to increase spatial coverage. This does not occur by default; refer to parameter
lr_mirror
for more info (or see Duplicating samples with the lr_mirror parameter).Reannotates microarray probe-to-gene mappings with information from Arnatkevic̆iūtė et al., 2019, NeuroImage. This occurs by default; refer to parameter
reannotated
for more info.Performs a similarity-based filtering of tissue samples, removing those samples whose expression across probes is poorly correlated with other samples. This does not occur by default; refer to parameter
sim_threshold
for more info.Performs intensity-based filtering of probes to remove those that do not exceed background noise. This occurs by default with a threshold of 0.5 (i.e., probes must exceed background noise in 50% of all tissue samples); refer to parameter
ibf_threshold
for more info.Selects a representative probe amongst those probes indexing the same gene. This occurs by default by selecting the probe with the highest differential stability amongst donors; refer to parameter
probe_selection
for more info (or see Probe selection options).Matches tissue samples to regions in the user-specified
atlas
. Refer to parametersatlas
,atlas_info
,missing
, andtolerance
for more info (or see Filling in data with the missing parameter).Normalizes expression values for each sample across genes for each donor. This occurs by default using a scaled robust sigmoid normalization function; refere to parameter
sample_norm
for more info.Normalizes expression values for each gene across samples for each donor. This occurs by default using a scaled robust sigmoid normalization function; refer to parameter
gene_norm
for more info.Aggregates samples within regions in the user-specified
atlas
based on matches made in Step 7. By default, samples are averaged separately for each donor and then averaged across donors. Refer to parametersregion_agg
,agg_metric
, andreturn_donors
for more info.
You can investigate all these parameters and options for modifying how the
expression
array is generated by looking at the Reference API.
The parcellated expression DataFrame¶
The expression
object returned by abagen.get_expression_data()
is a
pandas.DataFrame
, where rows correspond to region labels as defined in the
atlas image, columns correspond to genes, and entry values are microarray
expression data normalized and aggregated across donors:
>>> print(expression)
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
1 0.498266 0.664570 0.395276 ... 0.675843 0.555539 0.487572
2 0.649068 0.578997 0.496142 ... 0.483165 0.382653 0.504041
3 0.530613 0.623289 0.516300 ... 0.732930 0.359707 0.450664
... ... ... ... ... ... ... ...
81 0.388748 0.277961 0.474202 ... 0.279683 0.480953 0.405504
82 0.825836 0.602271 0.334143 ... 0.195722 0.447894 0.746475
83 0.384593 0.203654 0.746060 ... 0.379274 0.706803 0.509437
[83 rows x 15633 columns]
By default the data are normalized using a scaled robust sigmoid function such that expression values for a given gene will range from 0-1, where 0 indicates the region with the lowest expression of that gene and 1 indicates the region with highest.
Since the generated DataFrame is an aggregate (default: average) of multiple donors it is possible (likely) that a given region may not have any expression values exactly equal to 0 or 1.
Getting dense expression data¶
Unfortunately, due to how tissue samples were collected from the donor brains it is possible that some regions in an atlas may not be represented by any expression data. In the above example, two of the rows are missing data:
>>> print(expression.loc[[72, 73]])
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
72 NaN NaN NaN ... NaN NaN NaN
73 NaN NaN NaN ... NaN NaN NaN
[2 rows x 15633 columns]
These regions, corresponding to the right frontal pole (label 72) and right temporal pole (label 73) in the Desikan-Killiany atlas, were not matched to any tissue samples; this is likely due to the fact that only two of the six donors have tissue samples taken from the right hemisphere.
If you require a dense matrix—that is, you need expression values for
every region in your atlas
—there are a few parameters that you can
consider tuning to try and achieve this.
Filling in data with the missing
parameter¶
By default, the abagen.get_expression_data()
function will attempt to be
as precise as possible in matching microarray samples with brain regions. It
takes the following steps to do this for each tissue sample:
Determine if the sample falls directly within a region of
atlas
.Check to see if the sample is close to any regions by slowly expanding the search space (in 1mm increments) to include nearby voxels up to a specified distance threshold (specified via the
tolerance
parameter).If there are multiple nearby regions, determine which region is closer by calculating the center-of-mass of the abutting regions.
If at any step a sample can be assigned to a region in atlas
the sample is
assigned to that region and the matching procedure is terminated. However, as
we saw, regions with no assigned samples from any donor are simply left as NaN.
If you would like to force all regions to be assigned at least one sample you
can set the missing
parameter. This parameter accepts three options:
None
(default), "centroids"
, and "interpolate"
. By setting this
parameter the workflow will go through the normal procedure as documented above
and then, once all samples are matched, check for any empty regions and assign
them expression values based on the specified method.
When using the ‘centroid’ method the empty regions in the atlas will be assigned the expression values of the tissue sample falling closest to the centroid of that region. Note that this procedure is only performed when _all_ donors are missing data in a given region. In this case, a weighted average of the matched samples are taken across donors, where weights are calculated as the inverse distance between the tissue sample matched to the parcel centroid for each donor.
When using the ‘interpolate’ method, expression values will be interpolated in the empty regions by assigning every node in the region the expression of the nearest tissue sample. The weighted (inverse distance) average of the densely-interpolated map will be taken and used to represent parcellated expression values for the region. Note that, unlike in the centroid matching procedure described above, this interpolation is done independently for every donor, irrespective of whether other donors have tissue samples that fall within a given region.
Thus, setting the missing
parameter when calling
abagen.get_expression_data()
will always return a dense expression
matrix (at the expense of some anatomical precision):
# first, check with ``missing='centroids'``
>>> exp_centroids = abagen.get_expression_data(atlas['image'], atlas['info'],
... missing='centroids')
>>> print(exp_centroids.loc[[72, 73]])
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
72 0.574699 0.750184 0.246746 ... 0.656938 0.193677 0.647785
73 0.725151 0.652906 0.528831 ... 0.478334 0.501293 0.483642
[2 rows x 15633 columns]
# then, check with ``missing='interpolate'``
>>> exp_interpolate = abagen.get_expression_data(atlas['image'], atlas['info'],
... missing='interpolate')
>>> print(exp_interpolate.loc[[72, 73]])
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
72 0.532308 0.710846 0.299322 ... 0.675837 0.301105 0.586290
73 0.736345 0.663072 0.497092 ... 0.507378 0.467046 0.531494
[2 rows x 15633 columns]
Warning
Refer to the documentation for normalization
for additional information on how other settings interact with the
missing
parameter.
Duplicating samples with the lr_mirror
parameter¶
If your parcellation is sufficiently low-resolution it is likely that most regions in the left hemisphere (for which all six donors have tissue samples) will be matched to at least one sample, whereas regions in the right hemisphere may come up short.
To remedy this you can try modifying the lr_mirror
parameter when calling
abagen.get_expression_data()
. This parameter accepts four options:
None
(default), "bidirectional"
, "leftright"
, and "rightleft"
.
As the name suggests, the lr_mirror
options control whether tissue samples
are mirrored across the left/right hemisphere axis. By supplying the
‘bidirectional’ options, all samples in the left hemisphere are duplicated and
mirrored onto the right hemisphre, and vice-versa for right to left. The other
options (‘leftright’ and ‘rightleft) will mirror only one hemisphere (i.e.,
‘leftright’ will mirror samples in the left onto the right hemisphere).
Unlike the missing
parameter this will not guarantee that all regions are
matched to a sample, but it will increase the likelihood that this happens:
>>> exp_mirror = abagen.get_expression_data(atlas['image'], atlas['info'],
... lr_mirror='bidirectional')
>>> print(exp_mirror.loc[[72, 73]])
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
label ...
72 0.832617 0.648154 0.425707 ... 0.580406 0.439378 0.799856
73 0.682180 0.569551 0.627497 ... 0.430146 0.302926 0.425995
[2 rows x 15633 columns]
Note that since this effectively duplicates the number of tissue samples the
function runtime will increase somewhat. Also, importantly, setting the
lr_mirror
parameter will change the expression values of all of the
regions in the generated matrix–not just the regions that are missing data. It
is worth considering which (if either!) of these options best suits your
intended analysis.
Probe selection options¶
The probes used to measure microarray expression levels in the AHBA data are
often redundant; that is, there are frequently several probes indexing the same
gene. Since the output of the abagen.get_expression_data()
workflow is a
region by gene dataframe, at some point we need to transition from indexing
probe expression levels to indexing gene expression levels. Effectively, this
means we need to select from or condense the redundant probes for each gene;
however, there are a number of ways to do that.
Currently, abagen
supports eight options for this probe to gene conversion.
All the options have been used at various points throughout the published
record, so while there is no “right” choice we do encourage using the default
option (differential stability) due to
recent work by Arnatkevičiūte et al., 2019 showing that it provides the highest
fidelity to RNA sequencing data.
Available methods for probe_selection
fall into two primary families:
We describe all the methods within these families here. Methods can be
implemented by passing the probe_selection
argument to the
abagen.get_expression_data()
function. For a selection of references to
published works that have used these different methods please see the
documentation of abagen.probes_.collapse_probes()
.
Selecting a representative probe¶
The first group of methods aim to select a single probe from each redundant group. This involves generating some sort of selection criteria and masking the original probe by sample expression matrix to extract only the chosen probe, which will be used to represent the associated gene’s microarray expression values:

The only difference between methods in this group is the criteria used to select which probe to retain. The descriptions below explain how the mask in the above diagram is generated for each available option; in each diagram the red outline on the generated vector indicates which entry will be used to mask the original matrix. The extraction procedure (i.e., applying the mask to the original probe by sample expression matrix) is identical for all these methods.
Max intensity¶
>>> abagen.get_expression_data(atlas['image'], probe_selection='max_intensity')
Selects the probe with the highest average expression across all samples (where samples are concatenated across donors).

Max variance¶
>>> abagen.get_expression_data(atlas['image'], probe_selection='max_variance')
Selects the probe with the highest variance in expression across all samples (where samples are concatenated across donors).

Principal component loading¶
>>> abagen.get_expression_data(atlas['image'], probe_selection='pc_loading')
Selects the probe with the highest loading on the first principal component derived from the probe microarray expression across all samples (where samples are concatenated across donors).

Correlation¶
>>> abagen.get_expression_data(atlas['image'], probe_selection='corr_intensity')
>>> abagen.get_expression_data(atlas['image'], probe_selection='corr_variance')
When there are more than two probes indexing the same gene, selects the probe with the highest average correlation to other probes across all samples (where samples are concatenated across donors).

When there are exactly two probes the correlation procedure cannot be used, and
so you can fall back to either the Max intensity
(corr_intensity
) or the Max variance
(corr_variance
) criteria.
Differential stability¶
>>> abagen.get_expression_data(atlas['image'], probe_selection='diff_stability')
Computes the Spearman correlation of microarray expression values for each probe across brain regions for every pair of donors. Correlations are averaged and the probe with the highest correlation is retained.

RNAseq¶
>>> abagen.get_expression_data(atlas['image'], probe_selection='rnaseq')
Computes the Spearman correlation of microarray expression values for each probe across brain regions with RNAseq data for the corresponding gene. As only two donors have RNAseq data (donors #9861 and 10021), this method only computes the correlations for these two donors. Correlations are averaged across the two donors and the probe with the highest correlation for each gene is retained.

Collapsing across probes¶
In contrast to selecting a single representative probe for each gene and discarding the others, we can instead opt to use all available probes and collapse them into a unified representation of the associated gene:

Currently only one method supports this operation.
Average¶
>>> abagen.get_expression_data(atlas['image'], probe_selection='average')
Takes the average expression values for all probes indexing the same gene.

Providing 'mean'
instead of 'average'
will return identical results.
Donor aggregation in probe selection¶
Unless otherwise specified in the description of that method, probe selection is performed using data aggregated across samples from all donors. However, this may not be desired: the probe that most reliably indexes a gene in one donor may differ from the probe that does so in another donor.
To allow for this possibility, we describe three options for modifying how
probe selection is performed across donors in detail below. These methods can
be implemented by passing the donor_probes
argument to the
abagen.get_expression_data()
function.
Aggregate selection across donors¶
>>> abagen.get_expression_data(atlas['image'], donor_probes='aggregate')
The default option, this will aggregate tissue samples from all donors and
apply the chosen probe_selection
method to this single probe x sample
matrix. The probe chosen to represent each gene will be identical across all
donors.
Independent selection for donors¶
>>> abagen.get_expression_data(atlas['image'], donor_probes='independent')
Performs the chosen probe_selection
method independently for each donor.
The probe chosen to represent each gene may be different across donors.
Note: this option cannot be used when the specified probe_selection
is one
of: ‘diff_stability’, ‘rnaseq’, or ‘average’.
Most common selection across donors¶
>>> abagen.get_expression_data(atlas['image'], donor_probes='common')
Performs the chosen probe_selection
method independently for each donor and
then uses the most commonly-selected probe to represent each gene. The probe
chosen to represent each gene will be identical across all donors.
Note: this option cannot be used when the specified probe_selection
is one
of: ‘diff_stability’, ‘rnaseq’, or ‘average’.
Data normalization options¶
The microarray expression data provided by the AHBA has been subjected to some normalization procedures designed to mitigate potential differences in expression values between donors due to “batch effects.” Despite these procedures, there are still some notable differences between donors present in the downloadable data.
By default, abagen.get_expression_data()
aggregates expression data
across donors (though this can be prevented via the return_donors
parameter). Prior to aggregation, the function performs a within-donor
normalization procedure to attempt to mitigate donor-specific effects; however,
there are a number of ways to achieve this.
Currently, abagen
supports nine options for normalizing data:
Most of the options have been used at various points throughout the published record, so while there is no “right” choice we do encourage using the default option (scaled robust sigmoid) due to recent work by Arnatkevičiūte et al., 2019 showing that it is—as the name might suggest—robust to outlier effects commonly observed in microarray data.
We describe all the methods in detail here; these can be implemented by passing
the sample_norm
and gene_norm
keyword arguments to
abagen.get_expression_data()
. For a selection of references to published
works that have used these different methods please see the documentation of
abagen.normalize_expression()
.
sample_norm
vs gene_norm
¶
Microarray expression data can be normalized in two directions:
Each sample can be normalized across all genes, or
Each gene can be normalized across all samples
These different forms of normalization are controlled by two parameters in the
abagen.get_expression_data()
function: sample_norm
and gene_norm
.
Note that normalization of each sample across all genes occurs before
normalization of each gene across all samples.
Both parameters can accept the same arguments (detailed below), and both are turned on by default.
Normalization methods¶
Centering¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='center')
Microarray values are centered with:
where \(\bar{x}\) is the mean of the microarray expression values.
Z-score¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='zscore')
Microarray values are normalized using a basic z-score function:
where \(\bar{x}\) is the mean and \(\sigma_{x}\) is the sample standard deviation of the microarray expression values.
Min-max¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='minmax')
Microarray values are rescaled to the unit interval with:
Sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='sigmoid')
Microarray values are normalized using a general sigmoid function:
where \(\bar{x}\) is the mean and \(\sigma_{x}\) is the sample standard deviation of the microarray expression values.
Scaled sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='scaled_sigmoid')
Microarray values are processed with the sigmoid function and then rescaled to the unit interval with the min-max function.
Scaled sigmoid quantiles¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='scaled_sigmoid_quantiles')
Input data are clipped to the 5th and 95th percentiles before being processed with the scaled sigmoid transform. The clipped distribution is only used for calculation of \(\bar{x}\) and \(\sigma_{x}\); the full (i.e., unclipped) distribution is processed through the transformation.
Robust sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='robust_sigmoid')
Microarray values are normalized using a robust sigmoid function:
where \(\langle x \rangle\) is the median and \(\text{IQR}_{x}\) is the normalized interquartile range of the microarray expression values given as:
Scaled robust sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='scaled_robust_sigmoid')
Microarray values are processed with the robust sigmoid function and then rescaled to the unit interval with the min-max function.
Mixed sigmoid¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm='mixed_sigmoid')
Microarray values are processed with the scaled sigmoid function when their interquartile range is 0; otherwise, they are processed with the scaled robust sigmoid function.
No normalization¶
>>> abagen.get_expression_data(atlas['image'], {sample,gene}_norm=None)
Providing None
to the sample_norm
and gene_norm
parameters will
prevent any normalization procedure from being performed on the data. Use this
with caution!
Note
Some of the more advanced methods described on this page were initially proposed in:
Fulcher, B.F., Little, M.A., & Jones, N.S. (2013). Highly comparative time-series analysis: the empirical structure of time series and their methods. Journal of the Royal Society Interface, 10(83), 20130048.
If using one of these methods please consider citing this paper in your work!
Applicable methods: robust sigmoid, scaled robust sigmoid, mixed sigmoid
Normalizing only matched samples¶
While sample normalization is _always_ performed across all genes, you can
control which samples are used when performing gene normalization. By default,
only those samples matched to regions in the provided atlas are used in the
normalization process (controllable via the norm_matched
parameter):
>>> abagen.get_expression_data(atlas['image'], norm_matched=True)
However, when a smaller atlas is provided with only a few regions, normalizing
over just those samples matched to the atlas can be less desirable. To make it
so that all available samples are used instead of only those matched, set
norm_matched
to False
:
>>> abagen.get_expression_data(atlas['image'], norm_matched=False)
Warning
Given the preponderence of parameters in abagen.get_expression_data()
it is perhaps unsurprising that they will interact with one another.
However, it is worth pointing out that norm_matched
will interact with
the missing
parameter in a relatively surprising manner (hence why we
feel the need to make this note). This is due to the order in which
sample-to-region matching, normalization, and “missing” regions are
handled: when norm_matched
is set to True
all samples not matched
to regions are removed prior to normalization. As such, if the missing
parameter is set, the program is only able to fill in missing regions with
samples that had already been assigned to other regions. If, instead,
norm_matched=False
and the missing
parameter is set, the program
can use the full range of samples to fill in missing regions. For this
reason, we suggest using norm_matched=False
when also setting the
missing
parameter; however, we do not impose a restriction on this.
Normalizing within structural classes¶
There are known differences in microarray expression between broad structural
designations (e.g. cortex, subcortex/brainstem, cerebellum). As such, it may
occasionally be desirable to constrain normalization such that the procedure is
performed separately for each structural designation. This process can be
controlled via the norm_structures
parameter:
>>> abagen.get_expression_data(atlas['image'], norm_structures=True)
By default, this parameter is set to False
and normalization uses all
available samples. Note that changing this parameter will _dramatically_ modify
the returned expression information, so use with caution. For obvious reasons
this parameter will interact heavily with the norm_matched
parameter
described above.
Sample aggregation options¶
The primary goal of abagen.get_expression_data()
is to allow users to
aggregate the ~3,700 disparate tissue samples from the Allen Human Brain Atlas
into regions of interest defined by an atlas or parcellation file. However,
there exist several options for exactly how to aggregate samples within each
region of the specified atlas.
These options are controlled via two parameters to
abagen.get_expression_data()
: region_agg
and agg_metric
. We
discuss both parameters and the different options available to each below.
The region_agg parameter¶
This parameter determines how samples are aggregated together to generate the expression values for a region. It can take two values: ‘donors’ or ‘samples’.
If set to ‘donors’, expression values for all samples assigned to a region are aggregated independently for each donor and then aggregated across donors. If set to ‘samples’, expression values for all samples for all donors assigned to a region are aggregated simultaneously.
The agg_metric parameter¶
This parameter determines the actual metric used for aggregating samples into
regional expression values. It can be set to any callable function (as long as
that function accepts the keyword axis
argument), but generally either
‘mean’ (the default) or ‘median’ will suffice.
Using a binary mask¶
Basic usage¶
Sometimes, you’re not interested in aggregating microarray expression samples
within regions of an atlas—you want the actual, sample-level data instead. In
this case, we provides the abagen.get_samples_in_mask()
function.
To demonstrate how this works, we’ll first make a brainmask for the left parahippocampal gyrus using the region definition from the Desikan-Killiany atlas:
>>> import nibabel as nib
>>> import pandas as pd
>>> atlas = abagen.fetch_desikan_killiany()
>>> dk = nib.load(atlas['image'])
>>> info = pd.read_csv(atlas['info'])
>>> phg = int(info.query('label == "parahippocampal" & hemisphere == "L"')['id'])
>>> img = dk.__class__(dk.dataobj[:] == phg, dk.affine, dk.header)
We can then use this mask to obtain all the microarray samples that fall within its boundaries:
>>> expression, coords = abagen.get_samples_in_mask(mask=img)
abagen.get_samples_in_mask()
returns two objects: (1) the the samples x
gene expression matrix (exp
), and (2) an array of MNI coordinates for those
samples (coords
). Because this is using abagen.get_expression_data()
under the hood, the returned expression data have been preprocessed (i.e.,
filtered, normalized) according to that workflow. As such, you can provide all
the same parameters and keyword arguments to abagen.get_samples_in_mask()
as you can to abagen.get_expression_data()
(with the exception of
atlas
which is superseded by mask
and region_agg
/agg_metric
which will be ignored). Refer to the API documentation) for
more details!
Since the returned expression dataframe is a samples x gene matrix (rather than regions x gene), the index of the dataframe corresponds to the unique well ID of the relevant sample (rather than the atlas region):
>>> print(expression)
gene_symbol A1BG A1BG-AS1 A2M ... ZYX ZZEF1 ZZZ3
well_id ...
2850 0.654914 0.234039 0.283280 ... 0.020379 0.228080 0.000000
998 0.428705 0.375819 0.457741 ... 0.254195 0.315383 0.502122
990 0.400673 0.409852 0.561666 ... 0.270064 0.397740 0.522261
... ... ... ... ... ... ... ...
159226055 0.418706 0.751837 0.087808 ... 0.651541 0.410095 0.462773
159226117 0.533079 0.773214 0.265615 ... 0.441826 0.389615 0.455249
158158343 0.362038 0.553050 0.314730 ... 0.346605 0.261426 0.337738
[43 rows x 15633 columns]
This allows you to match up the samples with additional data provided by the AHBA (e.g., ontological information) as desired.
Get ALL the samples¶
If you want all of the available processed samples rather than only those
within a given mask you can call the function without providing an explicit
mask (this is the default when no mask
parameter is passed):
>>> expression, coords = abagen.get_samples_in_mask(mask=None)
This will return all samples (after dropping those where the listed MNI coordinates don’t match the listed hemisphere designation, etc.).
Generating reporting methods¶
There are a lot of options and parameters to choose from when processing data
with abagen.get_expression_data()
and while we’ve attempted to select
reasonable defaults, we don’t want to limit your options—you are free to pick
and choose any combination of inputs to process the AHBA data! That said, we
also wanted to make it easy for you to report exactly what was done to the
data based on the parameters you choose, so to that end we have added the
return_report
parameter to abagen.get_expression_data()
.
When return_report=True
, the workflow will return an extra output. That is,
in addition to the default regional microarray expression dataframe, a string
will be returned that describes, in detail, all the processing that was done to
the AHBA in the process of generating the expression matrix. We have tried to
write this in such a way that you can simply copy-and-paste the provided text
into the methods section of a paper, though you are of course free to edit it
as you see fit (though if you feel edits are necessary please let us know and
we can modify the generation more permanently!).
Example report¶
A report can be generated with:
>>> expression, report = abagen.get_expression_data(atlas['image'], atlas['info'],
... return_report=True)
Alternatively, you can use the abagen.reporting
module to generate a report
directly without having to re-run the entire pipeline. (Note that the Report
class accepts (nearly) all the same parameters as the get_expression_data()
workflow.)
>>> from abagen import reporting
>>> generator = reporting.Report(atlas['image'], atlas['info'])
>>> report = generator.gen_report()
The returned report
(with default parameters) will look something like this
example:
Regional microarry expression data were obtained from 6 post-mortem brains (1 female, ages 24.0–57.0, 42.50 +/- 13.38) provided by the Allen Human Brain Atlas (AHBA, https://human.brain-map.org; [H2012N]). Data were processed with the abagen toolbox (version X.Y; https://github.com/rmarkello/abagen) using a 83-region volumetric atlas in MNI space.
First, microarray probes were reannotated using data provided by [A2019N]; probes not matched to a valid Entrez ID were discarded. Next, probes were filtered based on their expression intensity relative to background noise [Q2002N], such that probes with intensity less than the background in >=50.00% of samples across donors were discarded. When multiple probes indexed the expression of the same gene, we selected and used the probe with the most consistent pattern of regional variation across donors (i.e., differential stability; [H2015N]), calculated with:
\(\Delta_{S}(p) = \frac{1}{\binom{N}{2}} \, \sum_{i=1}^{N-1} \sum_{j=i+1}^{N} \rho[B_{i}(p), B_{j}(p)]\)
where \(\rho\) is Spearman’s rank correlation of the expression of a single probe, p, across regions in two donor brains \(B_{i}\) and \(B_{j}\), and N is the total number of donors. Here, regions correspond to the structural designations provided in the ontology from the AHBA.
The MNI coordinates of tissue samples were updated to those generated via non-linear registration using the Advanced Normalization Tools (ANTs; https://github.com/chrisfilo/alleninf). Samples were assigned to brain regions in the provided atlas if their MNI coordinates were within 2 mm of a given parcel. To reduce the potential for misassignment, sample-to-region matching was constrained by hemisphere and gross structural divisions (i.e., cortex, subcortex/brainstem, and cerebellum, such that e.g., a sample in the left cortex could only be assigned to an atlas parcel in the left cortex; [A2019N]). All tissue samples not assigned to a brain region in the provided atlas were discarded.
Inter-subject variation was addressed by normalizing tissue sample expression values across genes using a robust sigmoid function [F2013J]:
\(x_{norm} = \frac{1}{1 + \exp(-\frac{(x-\langle x \rangle)} {\text{IQR}_{x}})}\)
where \(\langle x \rangle\) is the median and \(\text{IQR}_{x}\) is the normalized interquartile range of the expression of a single tissue sample across genes. Normalized expression values were then rescaled to the unit interval:
\(x_{scaled} = \frac{x_{norm} - \min(x_{norm})} {\max(x_{norm}) - \min(x_{norm})}\)
Gene expression values were then normalized across tissue samples using an identical procedure. Samples assigned to the same brain region were averaged separately for each donor and then across donors.
REFERENCES
----------
[A2019N]: Arnatkevic̆iūtė, A., Fulcher, B. D., & Fornito, A. (2019). A practical guide to linking brain-wide gene expression and neuroimaging data. Neuroimage, 189, 353-367.
[F2013J]: Fulcher, B. D., Little, M. A., & Jones, N. S. (2013). Highly comparative time-series analysis: the empirical structure of time series and their methods. Journal of the Royal Society Interface, 10(83), 20130048.
[H2012N]: Hawrylycz, M. J., Lein, E. S., Guillozet-Bongaarts, A. L., Shen, E. H., Ng, L., Miller, J. A., … & Jones, A. R. (2012). An anatomically comprehensive atlas of the adult human brain transcriptome. Nature, 489(7416), 391-399.
[H2015N]: Hawrylycz, M., Miller, J. A., Menon, V., Feng, D., Dolbeare, T., Guillozet-Bongaarts, A. L., … & Lein, E. (2015). Canonical genetic signatures of the adult human brain. Nature Neuroscience, 18(12), 1832.
[Q2002N]: Quackenbush, J. (2002). Microarray data normalization and transformation. Nature Genetics, 32(4), 496-501.
Note that due to text formatting limitations in Python, relevant equations used for e.g., normalizing the expression data will be provided in LaTeX format (i.e., surrounded by $$ characters and with TeX math commands).
Important
Please note that we explicitly release all text in the abagen.reporting
module (used to generate the above-referenced reports)
under a CC0 license
such that it can be used in manuscripts without modification.
Getting involved¶
Please refer to our contributing guidelines for a practical walkthrough on how to get
involved and start contributing to abagen
.
Citing abagen¶
Note
We strongly encourage you to use the automatically-generated methods
reports built into abagen
, which will provide a list
of citations based on your selected processing options!
We’re thrilled you’ve found abagen
useful in your work! Please cite the
following manuscripts when referencing your use of the toolbox:
Markello, RD, Arnatkevic̆iūtė, A, Poline, J-B, Fulcher, BD, Fornito, A, & Misic, B. (2021). Standardizing workflows in imaging transcriptomics with the abagen toolbox. Biorxiv. doi:10.1101/2021.07.08.451635
Arnatkevic̆iūtė, A, Fulcher, BD, & Fornito, A. (2019). A practical guide to linking brain-wide gene expression and neuroimaging data. NeuroImage, 189, 353-367. doi:10.1016/j.neuroimage.2019.01.011
Hawrylycz, MJ, Lein, ES, Guillozet-Bongaarts, AL, Shen, EH, Ng, L, Miller, JA, …, & Jones, AR. (2012). An anatomically comprehensive atlas of the adult human brain transcriptome. Nature, 489(7416), 391–399. doi:10.1038/nature11405
Additionally, to cite the specific version of the toolbox used in your analyses you can use the following Zenodo reference:
abagen Available from: 10.5281/zenodo.3451463.
Note that this will always point to the most recent abagen
release; for
older releases please refer to the Zenodo listing.
For more information about why citing software is important please refer to this article from the Software Sustainability Institute.
Reference API¶
List of modules
abagen.allen
- Primary workflowsabagen.datasets
- Fetching AHBA datasetsabagen.images
- Image processing functionsabagen.correct
- Post-processing correctionsabagen.matching
- Functions for matching samplesabagen.reporting
- Functions for generating reportsabagen.io
- Loading AHBA data filesabagen.mouse
- Working with the Allen Mouse Brain Atlas
abagen.allen
- Primary workflows¶
Functions for mapping AHBA microarray dataset to atlases and and parcellations
|
Assigns microarray expression data to ROIs defined in atlas |
|
Returns preprocessed microarray expression data for samples in mask |
|
Generates dense (i.e., interpolated) expression maps for genes |
abagen.datasets
- Fetching AHBA datasets¶
Functions for fetching data relevant to the Allen Brain Atlas human microarray dataset
|
Fetches Desikan-Killiany atlas shipped with abagen |
Returns dataframe with donor demographic information |
|
|
Downloads FreeSurfer reconstructions of the Allen Human Brain Atlas MRIs |
|
Return list of gene acronyms belonging to provided group |
|
Downloads the Allen Human Brain Atlas microarray expression dataset |
|
Downloads the “raw” Allen Human Brain Atlas T1w/T2w MRI images |
|
Downloads RNA-sequencing data from the Allen Human Brain Atlas |
|
Fetches and optionally loads fsaverage5 surface |
|
Fetches and optionally loads fsnative surface of donor |
abagen.images
- Image processing functions¶
|
Zeroes out all ROIs in the right hemisphere of volumetric atlas |
|
Checks that atlas is a valid atlas |
|
Converts FreeSurfer-style annotation file atlas to in-memory GIFTI image |
|
Updates GIFTI images so label IDs are consecutive across hemispheres |
abagen.correct
- Post-processing corrections¶
Functions for processing and correcting gene expression data
|
Corrects for distance-dependent correlation effects in coexpression |
|
Removes genes in expression with differential stability < threshold |
|
Performs normalization on expression data |
abagen.matching
- Functions for matching samples¶
Structures and functions used for matching samples to atlas
|
Representation of a parcellation as a cKDtree for NN lookups |
|
Finds centroids of data in coordinates space |
|
Returns index of centroids closest to coords (Euclidean distance) |
abagen.reporting
- Functions for generating reports¶
Functions for generating workflow methods reports
Note: all text contained within this module is released under a CC-0 license.
|
Generates report of methods for |
abagen.io
- Loading AHBA data files¶
Functions for loading the various files associated with the AHBA microarray and RNAseq datasets.
This also contains functionality for optionally converting the downloaded CSV files to parquet format, which provides much faster I/O access / quicker load times.
|
Loads SampleAnnot.csv file found at fname |
|
Loads MicroarrayExpression.csv file found at fname |
|
Loads Ontology.csv file found at fname |
|
Loads PACall.csv file found at fname |
|
Loads Probes.csv file found at fname |
|
Loads Genes.csv file found at fname |
|
Loads RNAseqTPM.csv file found at fname |
|
Loads RNAseqCounts.csv file found at fname |
abagen.mouse
- Working with the Allen Mouse Brain Atlas¶
Lists available attributes for |
|
Lists available attributes for |
|
Lists attributes for |
|
|
Queries Allen API for information about given gene |
|
Queries Allen API for information about given gene |
Finds xyz coordinates of provided structure(s) in reference_space |
|
Gets unionization data for provided gene(s) |
|
Loads all genes from Allen Reference database |
|
Loads all anatomical structures in the Allen Reference Atlas |
|
Loads subset of anatomical structures in Allen Reference Atlas from [MI1] |