1. The Allen Human Brain Atlas dataset¶
1.1. Fetching the AHBA data¶
In order to use abagen
, you’ll need to download the AHBA microarray data. You can download it with the
following command:
>>> import abagen
>>> files = abagen.fetch_microarray(donors='all', verbose=0)
Note
Downloading the entire dataset (about 4GB) can take a long time depending
on your internet connection speed! If you don’t want to download all the
donors you can provide the subject IDs of the donors you want as a list
(e.g., ['9861', '10021']
) instead of passing 'all'
.
This command will download data from the specified donors into a folder called
microarray
in the $HOME/abagen-data
directory. If you have already
downloaded the data you can provide the data_dir
argument to specify where
the files have been stored:
>>> files = abagen.fetch_microarray(donors=['12876', '15496'], data_dir='/path/to/my/data/')
Alternatively, abagen
will check the directory specified by the
environmental variable $ABAGEN_DATA
and use that as the download location
if the dataset does not already exist there.
If you provide a path to data_dir
(or specify a path with $ABAGEN_DATA
)
the directory specified should have the following structure:
/path/to/my/data/
├── normalized_microarray_donor10021/
│ ├── MicroarrayExpression.csv
│ ├── Ontology.csv
│ ├── PACall.csv
│ ├── Probes.csv
│ └── SampleAnnot.csv
├── normalized_microarray_donor12876/
├── normalized_microarray_donor14380/
├── normalized_microarray_donor15496/
├── normalized_microarray_donor15697/
└── normalized_microarray_donor9861/
(Note the directory does not have to be named microarray
for this to work.)
Note
If downloading failed for some reason, you can try directly downloading
from the AHBA microarray data
website, and unzip them in the appropriate directory structure (in default
case, $HOME/abagen-data/microarray/
).
1.2. Loading the AHBA data¶
The files
object returned by abagen.fetch_microarray()
is a nested
dictionary with filepaths to the five different file types in the AHBA
microarray dataset. The keys are the donor IDs:
>>> print(files.keys())
dict_keys(['9861', '10021', '12876', '14380', '15496', '15697'])
And the values for each entry are a sub-dictionary of the downloaded files:
>>> print(sorted(files['9861']))
['annotation', 'microarray', 'ontology', 'pacall', 'probes']
You can load the data in these files using the abagen.io
functions.
There are IO functions for each of the five file types; you can get more
information on the functions and the data contained in each file type by
looking at the Reference API. Notably, all IO functions return
pandas.DataFrame
objects for ease-of-use.
For example, you can load the annotation file for the first donor with:
>>> data = files['9861']
>>> annotation = abagen.io.read_annotation(data['annotation'])
>>> print(annotation)
structure_id slab_num well_id ... mni_x mni_y mni_z
sample_id ...
1 4077 22 594 ... 5.9 -27.7 49.7
2 4323 11 2985 ... 29.2 17.0 -2.9
3 4323 18 2801 ... 28.2 -22.8 16.8
... ... ... ... ... ... ... ...
944 4758 67 1074 ... 7.9 -72.3 -40.6
945 4760 67 1058 ... 8.3 -57.4 -59.0
946 4761 67 1145 ... 9.6 -46.7 -47.6
[946 rows x 13 columns]
And you can do the same for, e.g., the probe file with:
>>> probes = abagen.io.read_probes(data['probes'])
>>> print(probes)
probe_name gene_id gene_symbol gene_name entrez_id chromosome
probe_id
1058685 A_23_P20713 729 C8G complement component 8, gamma polypeptide 733 9
1058684 CUST_15185_PI416261804 731 C9 complement component 9 735 5
1058683 A_32_P203917 731 C9 complement component 9 735 5
... ... ... ... ... ... ...
1071209 A_32_P885445 1012197 A_32_P885445 AGILENT probe A_32_P885445 (non-RefSeq) <NA> NaN
1071210 A_32_P9207 1012198 A_32_P9207 AGILENT probe A_32_P9207 (non-RefSeq) <NA> NaN
1071211 A_32_P94122 1012199 A_32_P94122 AGILENT probe A_32_P94122 (non-RefSeq) <NA> NaN
[58692 rows x 6 columns]
The other IO functions work similarly for the remaining filetypes.