CELLxGENE: scRNA-seq¶

CZ CELLxGENE hosts the globally largest standardized collection of scRNA-seq datasets.

LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, …).

You can use the CELLxGENE data in two ways:

Query collections of AnnData objects.
Query a big array store produced by concatenated AnnData objects via tiledbsoma.

If you are interested in building similar data assets in-house:

See the transfer guide to zero-copy data to your own LaminDB instance.
See the scRNA guide to create a growing, standardized & versioned scRNA-seq dataset collection.

Load the public LaminDB instance that mirrors cellxgene:

# !pip install 'lamindb[bionty,jupyter]'
!lamin load laminlabs/cellxgene

import lamindb as ln
import bionty as bt

Query & understand metadata¶

Auto-complete metadata¶

You can create look-up objects for any registry in LaminDB, including basic biological entities and things like users or storage locations.

Let’s use auto-complete to look up cell types:

cell_types = bt.CellType.lookup()
cell_types.effector_t_cell

You can also arbitrarily chain filters and create lookups from them:

users = ln.User.lookup()
organisms = bt.Organism.lookup()
experimental_factors = bt.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = bt.Tissue.lookup()  # tissue labels
suspension_types = ln.ULabel.filter(name="is_suspension_type").one().children.lookup()  # suspension types
# here we choose to return .name directly
features = ln.Feature.lookup(return_field="name")
assays = bt.ExperimentalFactor.lookup(return_field="name")

Search & filter metadata¶

We can use search & filters for metadata:

bt.CellType.search("effector T cell").df().head()

Show code cell output Hide code cell output

	uid	name	ontology_id	abbr	synonyms	description	source_id	run_id	created_by_id	updated_at
id
1623	3nfZTVV4	effector T cell	CL:0000911	None	effector T-cell\|effector T-lymphocyte\|effector...	A Differentiated T Cell With Ability To Traffi...	48	NaN	1	2023-11-28 22:30:57.481778+00:00
1229	69TEBGqb	exhausted T cell	CL:0011025	None	Tex cell\|An effector T cell that displays impa...	None	48	NaN	1	2023-11-28 22:27:55.572884+00:00
1331	43cBCa7s	helper T cell	CL:0000912	None	helper T-lymphocyte\|T-helper cell\|helper T lym...	A Effector T Cell That Provides Help In The Fo...	48	NaN	1	2023-11-28 22:27:55.575955+00:00
1169	6JD5JCZC	CD8-positive, alpha-beta cytokine secreting ef...	CL:0000908	None	CD8-positive, alpha-beta cytokine secreting ef...	A Cd8-Positive, Alpha-Beta T Cell With The Phe...	48	NaN	1	2023-11-28 22:27:55.571576+00:00
1503	1oa5G2Mq	memory T cell	CL:0000813	None	memory T-cell\|memory T lymphocyte\|memory T-lym...	A Long-Lived, Antigen-Experienced T Cell That ...	48	NaN	1	2023-11-28 22:27:55.580290+00:00

And use a uid to filter exactly one metadata record:

effector_t_cell = bt.CellType.get("3nfZTVV4")
effector_t_cell

Understand ontologies¶

View the related ontology terms:

effector_t_cell.view_parents(distance=2, with_children=True)

Or access them programmatically:

effector_t_cell.children.df()

Show code cell output Hide code cell output

	uid	name	ontology_id	abbr	synonyms	description	source_id	run_id	created_by_id	updated_at
id
931	2VQirdSp	effector CD8-positive, alpha-beta T cell	CL:0001050	None	effector CD8-positive, alpha-beta T lymphocyte...	A Cd8-Positive, Alpha-Beta T Cell With The Phe...	48	None	1	2023-11-28 22:27:55.565981+00:00
1088	490Xhb24	effector CD4-positive, alpha-beta T cell	CL:0001044	None	effector CD4-positive, alpha-beta T lymphocyte...	A Cd4-Positive, Alpha-Beta T Cell With The Phe...	48	None	1	2023-11-28 22:27:55.569832+00:00
1229	69TEBGqb	exhausted T cell	CL:0011025	None	Tex cell\|An effector T cell that displays impa...	None	48	None	1	2023-11-28 22:27:55.572884+00:00
1309	5s4gCMdn	cytotoxic T cell	CL:0000910	None	cytotoxic T lymphocyte\|cytotoxic T-lymphocyte\|...	A Mature T Cell That Differentiated And Acquir...	48	None	1	2023-11-28 22:27:55.575444+00:00
1331	43cBCa7s	helper T cell	CL:0000912	None	helper T-lymphocyte\|T-helper cell\|helper T lym...	A Effector T Cell That Provides Help In The Fo...	48	None	1	2023-11-28 22:27:55.575955+00:00

Query individual datasets¶

Query artifacts¶

Here we query sets of .h5ad files, which correspond to AnnData objects. Individual datasets or studies normally correspond to ln.Artifact model.

To see what you can query for, simply look at the registry representation:

ln.Artifact

Here is an exemplary string query:

ln.Artifact.filter(
    suffix=".h5ad",  # filename suffix
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__name__in=["B cell", "T cell"],  # cell types measured in AnnData
    created_by__handle="sunnyosun"  # creator
).order_by(
    "created_at"
).df(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()

Show code cell output Hide code cell output

	cell_types__name	created_by__handle	uid	version	is_latest	description	key	suffix	type	size	...	n_observations	_hash_type	_accessor	visibility	_key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
879	[conventional dendritic cell, classical monocy...	sunnyosun	BCutg5cxmqLmy2Z5SS8J	2023-07-25	False	Type I interferon autoantibodies are associate...	cell-census/2023-07-25/h5ads/01ad3cd7-3929-465...	.h5ad	None	6353682597	...	600929	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:14:10.959155+00:00
1106	[immature B cell, monocyte, naive thymus-deriv...	sunnyosun	3xdOASXuAxxJtSchJO3D	2023-07-25	False	HSC/immune cells (all hematopoietic-derived ce...	cell-census/2023-07-25/h5ads/48101fa2-1a63-451...	.h5ad	None	6214230662	...	589390	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:11:10.324135+00:00
1174	[monocyte, conventional dendritic cell, plasma...	sunnyosun	wt7eD72sTzwL3rfYaZr2	2023-07-25	False	A scRNA-seq atlas of immune cells at the CNS b...	cell-census/2023-07-25/h5ads/58b01044-c5e5-4b0...	.h5ad	None	1052158249	...	130908	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:09:45.364255+00:00
1377	[monocyte, ciliated cell, macrophage, natural ...	sunnyosun	znTBqWgfYgFlLjdQ6Ba7	2023-07-25	False	Large-scale single-cell analysis reveals criti...	cell-census/2023-07-25/h5ads/9dbab10c-118d-496...	.h5ad	None	13929140098	...	1462702	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:14:24.084706+00:00
1482	[effector CD4-positive, alpha-beta T cell, con...	sunnyosun	dEP0dZ8UxLgwnkLjz6Iq	2023-07-25	False	Single-cell sequencing links multiregional imm...	cell-census/2023-07-25/h5ads/bd65a70f-b274-413...	.h5ad	None	1204103287	...	167283	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:05:49.602044+00:00

5 rows × 22 columns

Queries by string are prone to typos. Let’s query User and CellType with auto-completed records instead.

ln.Artifact.filter(
    suffix=".h5ad",  # filename suffix
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__in=[cell_types.b_cell, cell_types.t_cell],  # cell types measured in AnnData
    created_by=users.sunnyosun   # creator
).order_by(
    "created_at"
).df(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()

Show code cell output Hide code cell output

	cell_types__name	created_by__handle	uid	version	is_latest	description	key	suffix	type	size	...	n_observations	_hash_type	_accessor	visibility	_key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
879	[conventional dendritic cell, classical monocy...	sunnyosun	BCutg5cxmqLmy2Z5SS8J	2023-07-25	False	Type I interferon autoantibodies are associate...	cell-census/2023-07-25/h5ads/01ad3cd7-3929-465...	.h5ad	None	6353682597	...	600929	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:14:10.959155+00:00
1106	[immature B cell, monocyte, naive thymus-deriv...	sunnyosun	3xdOASXuAxxJtSchJO3D	2023-07-25	False	HSC/immune cells (all hematopoietic-derived ce...	cell-census/2023-07-25/h5ads/48101fa2-1a63-451...	.h5ad	None	6214230662	...	589390	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:11:10.324135+00:00
1174	[monocyte, conventional dendritic cell, plasma...	sunnyosun	wt7eD72sTzwL3rfYaZr2	2023-07-25	False	A scRNA-seq atlas of immune cells at the CNS b...	cell-census/2023-07-25/h5ads/58b01044-c5e5-4b0...	.h5ad	None	1052158249	...	130908	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:09:45.364255+00:00
1377	[monocyte, ciliated cell, macrophage, natural ...	sunnyosun	znTBqWgfYgFlLjdQ6Ba7	2023-07-25	False	Large-scale single-cell analysis reveals criti...	cell-census/2023-07-25/h5ads/9dbab10c-118d-496...	.h5ad	None	13929140098	...	1462702	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:14:24.084706+00:00
1482	[effector CD4-positive, alpha-beta T cell, con...	sunnyosun	dEP0dZ8UxLgwnkLjz6Iq	2023-07-25	False	Single-cell sequencing links multiregional imm...	cell-census/2023-07-25/h5ads/bd65a70f-b274-413...	.h5ad	None	1204103287	...	167283	md5-n	AnnData	1	False	2	11	16	1	2024-01-24 07:05:49.602044+00:00

5 rows × 22 columns

Slice an AnnData-like artifact¶

Let’s look at an artifact and show its metadata using .describe().

artifact = ln.Artifact.filter(description="Mature kidney dataset: immune", is_latest=True).one()
artifact.describe()

Show code cell output Hide code cell output

Artifact(uid='WwmBIhBNLTlRcSoBDt76', version='2024-07-01', is_latest=True, description='Mature kidney dataset: immune', key='cell-census/2024-07-01/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad', suffix='.h5ad', type='dataset', size=45158726, hash='GCMHkdQSTeXxRVF7gMZFIA', n_observations=7803, _hash_type='md5-n', _accessor='AnnData', visibility=1, _key_is_virtual=False, updated_at='2024-07-12 12:40:43 UTC')
  Provenance
    .storage = 's3://cellxgene-data-public'
    .transform = 'Census release 2024-07-01 (LTS)'
    .run = '2024-07-16 12:49:41 UTC'
    .created_by = 'sunnyosun'
  Labels
    .organisms = 'human'
    .tissues = 'cortex of kidney', 'renal medulla', 'kidney', 'kidney blood vessel', 'renal pelvis'
    .cell_types = 'classical monocyte', 'plasmacytoid dendritic cell', 'natural killer cell', 'dendritic cell', 'CD4-positive, alpha-beta T cell', 'mast cell', 'neutrophil', 'non-classical monocyte', 'CD8-positive, alpha-beta T cell', 'B cell', ...
    .diseases = 'normal'
    .phenotypes = 'male', 'female'
    .experimental_factors = '10x 3' v2'
    .developmental_stages = '2-year-old human stage', '4-year-old human stage', '12-year-old human stage', '44-year-old human stage', '49-year-old human stage', '53-year-old human stage', '63-year-old human stage', '64-year-old human stage', '67-year-old human stage', '70-year-old human stage', ...
    .ethnicities = 'unknown'
    .ulabels = 'TxK2', 'Wilms1', 'TxK4', 'TTx', 'RCC3', 'RCC1', 'VHL', 'TxK3', 'TxK1', 'Wilms3', ...
  Features
    'donor_id' = 'Wilms3', 'TTx', 'pRCC', 'VHL', 'RCC3', 'TxK1', 'TxK4', 'TxK3', 'RCC2', 'Wilms2', ...
    'organism' = 'human'
    'suspension_type' = 'cell'
  Feature sets
    'obs' = 'assay', 'cell_type', 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'sex', 'tissue', 'organism', 'tissue_type', 'suspension_type'
    'var' = 'None', 'EBF1', 'LINC02202', 'RNF145', 'LINC01932', 'UBLCP1', 'IL12B', 'LINC01845', 'LINC01847', 'ADRA1B', 'TTC1', 'PWWP2A', 'FABP6', 'FABP6-AS1', 'CCNJL', 'C1QTNF2'

If you want to query a slice of the array data, you have two options:

Cache to the disk and return the path to the cached data. Doesn’t download anything if files are already in the cache.
Cache & load the entire array into memory via artifact.load() -> AnnData (caches the h5ad on disk, so that you only download once)
Stream the array using a (cloud-backed) accessor artifact.open() -> AnnDataAccessor

Both will run much faster in the AWS us-west-2 data center.

Cache:

cache_path = artifact.cache()
cache_path

Cache & load:

adata = artifact.load()
adata

Now we have an AnnData object, which stores observation annotations matching our artifact-level query in the .obs slot, and we can re-use almost the same query on the array-level.

Stream, slice and load the slice into memory:

with artifact.open() as adata_backed:
    display(adata_backed)

We now have an AnnDataAccessor object, which behaves much like an AnnData, and the query looks the same.

Query collections of datasets¶

Exploring data by collection¶

Often, you work with collections of artifacts, which Collection helps managing.

Alternatively,

you can search a file on the LaminHub UI and fetch it through: ln.Artifact.get(uid)
or query for a collection you found on CZ CELLxGENE Discover

Fix the version of the cellxgene-census release.

census_version = "2024-07-01"

Let’s search the collections from CELLxGENE within the 2024-07-01 release:

ln.Collection.filter(version=census_version).search("human retina", limit=10)

Show code cell output Hide code cell output

<QuerySet [Collection(uid='quQDnLsMLkP3JRsC8gp4', version='2024-07-01', is_latest=True, name='Single-cell transcriptomic atlas for adult human retina', description='10.1016/j.xgen.2023.100298', hash='NIo8G6_reJTEqMzW2nMc', reference='af893e86-8e9f-41f1-a474-ef05359b1fb7', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:39 UTC'), Collection(uid='8ohRJQq8e3F7pdlBZbhz', version='2024-07-01', is_latest=True, name='Single cell atlas of the human retina', description='10.1101/2023.11.07.566105', hash='_vU7tll3t-0NCuJL-fm0', reference='4c6eaf5c-6d57-4c76-b1e9-60df8c655f1e', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:19:25 UTC'), Collection(uid='tZYmzwfh0bIYzKBQVuro', version='2024-07-01', is_latest=True, name='Cell Types of the Human Retina and Its Organoids at Single-Cell Resolution', description='10.1016/j.cell.2020.08.013', hash='nGcCV4HJONcma2SExXw2', reference='2f4c738f-e2f3-4553-9db2-0582a38ea4dc', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC'), Collection(uid='2gBKIwx8AtCHc4nfcQqc', version='2024-07-01', is_latest=True, name='A single-cell transcriptome atlas of the adult human retina', description='10.15252/embj.2018100811', hash='sCh4gUTJJJjECsp1dj0q', reference='3472f32d-4a33-48e2-aad5-666d4631bf4c', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:39 UTC'), Collection(uid='zZLyhpo1aDdxdbULFbVT', version='2024-07-01', is_latest=True, name='Single-cell transcriptomic atlas of the human retina identifies cell types associated with age-related macular degeneration', description='10.1038/s41467-019-12780-8', hash='1B0m9_FahAvefSTM8_AV', reference='1a486c4c-c115-4721-8c9f-f9f096e10857', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC'), Collection(uid='Yxth0JJgMb2VVOCfSgWj', version='2024-07-01', is_latest=True, name='Single-cell transcriptomics of the human retinal pigment epithelium and choroid in health and macular degeneration', description='10.1073/pnas.1914143116', hash='j2LqihaaNawOtEFysl3c', reference='f8057c47-fcd8-4fcf-88b0-e2f930080f6e', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:39 UTC'), Collection(uid='kDJ9Xb8d11d93LAHMJpf', version='2024-07-01', is_latest=True, name='Human Brain Cell Atlas v1.0', description='10.1126/science.add7046', hash='pD7t82V30Qg-8Nbm52qI', reference='283d65eb-dd53-496d-adb7-7570c7caa443', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC'), Collection(uid='kAcitlx0g6C2lgacOCAS', version='2024-07-01', is_latest=True, name='Human breast cell atlas', description='10.1038/s41588-024-01688-9', hash='wXMzOvp8a-_nGgkwfjSM', reference='48259aa8-f168-4bf5-b797-af8e88da6637', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC'), Collection(uid='yql5LxVFGGa5LiIEOnE9', version='2024-07-01', is_latest=True, name='Cellular heterogeneity of human fallopian tubes in normal and hydrosalpinx disease states identified by scRNA-seq', description='10.1101/2021.09.16.460628', hash='tC_mN86VmrXsdcGDij3W', reference='fc77d2ae-247d-44d7-aa24-3f4859254c2c', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:39 UTC'), Collection(uid='XGeEFfpeKAYMtQlnJAaY', version='2024-07-01', is_latest=True, name='Multi-scale spatial mapping of cell populations across anatomical sites in healthy human skin and basal cell carcinoma', description='10.1073/pnas.2313326120', hash='SR4yp3Hfk5B3SrqRoNXN', reference='34f12de7-c5e5-4813-a136-832677f98ac8', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:17:41 UTC')]>

Let’s get the record of the top hit collection:

collection = ln.Collection.get("quQDnLsMLkP3JRsC8gp4")
collection

We see it’s a Science paper and we could find more information using the DOI or CELLxGENE collection id.

Check different versions of this collection:

collection.versions.df()

Show code cell output Hide code cell output

	uid	version	is_latest	name	description	hash	reference	reference_type	visibility	transform_id	meta_artifact_id	run_id	created_by_id	updated_at
id
134	quQDnLsMLkP3JRsC6WWz	2023-07-25	False	Single-cell transcriptomic atlas for adult hum...	10.1016/j.xgen.2023.100298	xhfSShX8lypXPx00zevx	af893e86-8e9f-41f1-a474-ef05359b1fb7	CELLxGENE Collection ID	1	NaN	None	NaN	1	2024-01-08 12:22:12.891941+00:00
291	quQDnLsMLkP3JRsCJNGB	2023-12-15	False	Single-cell transcriptomic atlas for adult hum...	10.1016/j.xgen.2023.100298	FsD52kpR7dF2h78-P3ka	af893e86-8e9f-41f1-a474-ef05359b1fb7	CELLxGENE Collection ID	1	17.0	None	22.0	1	2024-01-29 07:53:59.197813+00:00
606	quQDnLsMLkP3JRsC8gp4	2024-07-01	True	Single-cell transcriptomic atlas for adult hum...	10.1016/j.xgen.2023.100298	NIo8G6_reJTEqMzW2nMc	af893e86-8e9f-41f1-a474-ef05359b1fb7	CELLxGENE Collection ID	1	22.0	None	27.0	1	2024-07-16 12:24:39.223727+00:00

Each collection has at least one Artifact file associated to it. Let’s get the associated artifacts:

collection.artifacts.df()

Show code cell output Hide code cell output

! no run & transform get linked, consider calling ln.context.track()

	uid	version	is_latest	description	key	suffix	type	size	hash	n_objects	n_observations	_hash_type	_accessor	visibility	_key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
2852	Oc6ANFJ0FgOW1B70mNIq	2024-07-01	True	Photoreceptor cells in human retina (rod cells...	cell-census/2024-07-01/h5ads/00e5dedd-b9b7-43b...	.h5ad	dataset	990594324	qFT65q6_k30pki8-1_2HoQ	None	21422	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:44.668025+00:00
2855	wYiUe9hn4TJijpoX90Mr	2024-07-01	True	All major cell types in adult human retina	cell-census/2024-07-01/h5ads/0129dbd9-a7d3-4f6...	.h5ad	dataset	14638089351	bXxaz_quQ4mIbVlarLZZKQ	None	244474	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:43.933700+00:00
2919	GA2BXWwoJlcRfzNp3iyQ	2024-07-01	True	Horizontal cells in human retina	cell-census/2024-07-01/h5ads/11ef37ee-2173-458...	.h5ad	dataset	404987285	fR0O7fSUHxmAfEDC8J7Ipw	None	7348	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:45.065488+00:00
3018	QpuY5RsGTBBMN61QGY4t	2024-07-01	True	Amacrine cells in human retina	cell-census/2024-07-01/h5ads/359f7af4-87d4-411...	.h5ad	dataset	3382221253	S7gXlC-cJ362BOqYZFxMOA	None	56507	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:43.940079+00:00
3273	1OyQQLNfu1nzvVADODND	2024-07-01	True	Bipolar cells in human retina	cell-census/2024-07-01/h5ads/8f10185b-e0b3-46a...	.h5ad	dataset	3075818557	1GQwZcymSrr7d2Xit-5Deg	None	53040	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:46.454782+00:00
3378	Ce4Mqe4X2vUhwkwnh5YQ	2024-07-01	True	Retinal ganglion cells in human retina	cell-census/2024-07-01/h5ads/aad97cb5-f375-45e...	.h5ad	dataset	784580498	w-_LJDfBv7vsZqw-9Jt72g	None	11617	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:47.016308+00:00
3600	80xlsVmayPPBCCEZ7aBc	2024-07-01	True	Non-neuronal cells in human retina	cell-census/2024-07-01/h5ads/ed419b4e-db9b-40f...	.h5ad	dataset	1070671504	slN6j-9aSrYFw-IPL-wv-A	None	18011	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:48.497869+00:00

Let’s look at the collection that corresponds to the cellxgene-census release of .h5ad artifacts.

collection = ln.Collection.filter(name="cellxgene-census", version=census_version).one()
collection

You can count all contained artifacts or get them as a dataframe.

collection.artifacts.count()

collection.artifacts.df().head()  # not tracking run & transform because read-only instance

Show code cell output Hide code cell output

! no run & transform get linked, consider calling ln.context.track()

	uid	version	is_latest	description	key	suffix	type	size	hash	n_objects	n_observations	_hash_type	_accessor	visibility	_key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
3042	GcVBvpW5MYlrsH1izOjN	2024-07-01	True	All cells	cell-census/2024-07-01/h5ads/3dc61ca1-ce40-46b...	.h5ad	dataset	947738392	NDhyYVxRpOG6UiEkDZKswg	None	71752	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:43.667567+00:00
3587	1AeEHLQzGyRZL5nwpffu	2024-07-01	True	wilms	cell-census/2024-07-01/h5ads/ea01c125-67a7-4bd...	.h5ad	dataset	75413467	TNsJMqhUOekqUh4qtxvccA	None	4636	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:48.218901+00:00
2850	vEw6vGy47Zi0Qj6TG6l7	2024-07-01	True	Tabula Sapiens - Skin	cell-census/2024-07-01/h5ads/0041b9c3-6a49-4bf...	.h5ad	dataset	199210144	sV0vZMpxZsTXIb6qqCg8ng	None	9424	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:44.720154+00:00
3230	tggrprv4cllqGOrH8RlL	2024-07-01	True	Dissection: Amygdaloid complex (AMY) - Basolat...	cell-census/2024-07-01/h5ads/7d3ab174-e433-40f...	.h5ad	dataset	330480233	eS_gAyJD_P0oLd6IHEsPJQ	None	28984	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:46.355994+00:00
3309	RCzyhZz9tfi6YI4F7mxb	2024-07-01	True	Single cell RNA sequencing of follicular lymphoma	cell-census/2024-07-01/h5ads/99950e99-2758-41d...	.h5ad	dataset	749041844	FaUU0Z0Uk6w2oewwJq8zZg	None	137147	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:41.753173+00:00

You can query across artifacts by arbitrary metadata combinations, for instance:

query = collection.artifacts.filter(
    organisms=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")  # order by size
query.df().head()  # convert to DataFrame

Show code cell output Hide code cell output

	uid	version	is_latest	description	key	suffix	type	size	hash	n_objects	n_observations	_hash_type	_accessor	visibility	_key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
2961	WwmBIhBNLTlRcSoBDt76	2024-07-01	True	Mature kidney dataset: immune	cell-census/2024-07-01/h5ads/20d87640-4be8-487...	.h5ad	dataset	45158726	GCMHkdQSTeXxRVF7gMZFIA	None	7803	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:43.756335+00:00
2961	WwmBIhBNLTlRcSoBDt76	2024-07-01	True	Mature kidney dataset: immune	cell-census/2024-07-01/h5ads/20d87640-4be8-487...	.h5ad	dataset	45158726	GCMHkdQSTeXxRVF7gMZFIA	None	7803	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:43.756335+00:00
3000	gHlQ5Muwu3G9pvFCx3x8	2024-07-01	True	Fetal kidney dataset: immune	cell-census/2024-07-01/h5ads/2d31c0ca-0233-41c...	.h5ad	dataset	64546349	2qy8uy-65Sd_XcBU-nrPgA	None	6847	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:45.273783+00:00
3324	P4Oai3OLGAzRwoicHfLM	2024-07-01	True	Mature kidney dataset: full	cell-census/2024-07-01/h5ads/9ea768a2-87ab-46b...	.h5ad	dataset	194047623	aZVpGZwAfMCziff_5ow2bg	None	40268	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:44.478948+00:00
3324	P4Oai3OLGAzRwoicHfLM	2024-07-01	True	Mature kidney dataset: full	cell-census/2024-07-01/h5ads/9ea768a2-87ab-46b...	.h5ad	dataset	194047623	aZVpGZwAfMCziff_5ow2bg	None	40268	md5-n	AnnData	1	False	2	22	27	1	2024-07-12 12:40:44.478948+00:00

Slice a tiledbsoma-like artifact¶

The previous section showed how to query for AnnData objects.

This section queries “Census”, i.e., a tiledbsoma array store that concatenates many AnnData objects.

Create a query expression for a tiledbsoma array store.

value_filter = (
    f'{features.tissue} == "{tissues.brain.name}" and {features.cell_type} in'
    f' ["{cell_types.microglial_cell.name}", "{cell_types.neuron.name}"] and'
    f' {features.suspension_type} == "{suspension_types.cell.name}" and {features.assay} =='
    f' "{assays.ln_10x_3_v3}"'
)
value_filter

'tissue == "brain" and cell_type in ["microglial cell", "neuron"] and suspension_type == "cell" and assay == "10x 3\' v3"'

Query for the tiledbsoma array store that contains all concatenated expression data.

census = ln.Artifact.filter(description=f"Census {census_version}").one()

Query slices within the array store. (This will run a lot faster from within the AWS us-west-2 data center.)

human = "homo_sapiens"  # subset to human data

# open the array store for queries
with census.open() as store:
    # read SOMADataFrame as a slice
    cell_metadata = store["census_data"][human].obs.read(value_filter=value_filter)
    # concatenate results to pyarrow.Table
    cell_metadata = cell_metadata.concat()
    # convert to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()

cell_metadata.shape

cell_metadata.head()

Show code cell output Hide code cell output

	soma_joinid	dataset_id	assay	assay_ontology_term_id	cell_type	cell_type_ontology_term_id	development_stage	development_stage_ontology_term_id	disease	disease_ontology_term_id	...	tissue	tissue_ontology_term_id	tissue_type	tissue_general	tissue_general_ontology_term_id	raw_sum	nnz	raw_mean_nnz	raw_variance_nnz	n_measured_vars
0	48182177	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	...	brain	UBERON:0000955	tissue	brain	UBERON:0000955	15204.0	3959	3.840364	209.374207	27229
1	48182178	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	...	brain	UBERON:0000955	tissue	brain	UBERON:0000955	39230.0	5885	6.666100	875.502870	27229
2	48182185	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	...	brain	UBERON:0000955	tissue	brain	UBERON:0000955	9576.0	2738	3.497443	121.333753	27229
3	48182187	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	...	brain	UBERON:0000955	tissue	brain	UBERON:0000955	19374.0	4096	4.729980	464.331956	27229
4	48182188	c888b684-6c51-431f-972a-6c963044cef0	10x 3' v3	EFO:0009922	microglial cell	CL:0000129	68-year-old human stage	HsapDv:0000162	glioblastoma	MONDO:0018177	...	brain	UBERON:0000955	tissue	brain	UBERON:0000955	8466.0	2477	3.417844	162.555950	27229

5 rows × 28 columns

Create an AnnData object.

from tiledbsoma import AxisQuery

with census.open() as store:
    
    experiment = store["census_data"][human]
    
    adata = experiment.axis_query(
        "RNA",
        obs_query=AxisQuery(value_filter=value_filter)
    ).to_anndata(
        X_name="raw",
        column_names={
            "obs": [
                features.assay,
                features.cell_type,
                features.tissue,
                features.disease,
                features.suspension_type,
            ]
        }
    )

adata.var = adata.var.set_index("feature_id")
adata

adata.var.head()

Show code cell output Hide code cell output

	soma_joinid	feature_name	feature_length	nnz	n_measured_obs
feature_id
ENSG00000000003	0	TSPAN6	4530	4530448	73855064
ENSG00000000005	1	TNMD	1476	236059	61201828
ENSG00000000419	2	DPM1	9276	17576462	74159149
ENSG00000000457	3	SCYL3	6883	9117322	73988868
ENSG00000000460	4	C1orf112	5970	6287794	73636201

adata.obs.head()

Show code cell output Hide code cell output

	assay	cell_type	tissue	disease	suspension_type
0	10x 3' v3	microglial cell	brain	glioblastoma	cell
1	10x 3' v3	microglial cell	brain	glioblastoma	cell
2	10x 3' v3	microglial cell	brain	glioblastoma	cell
3	10x 3' v3	microglial cell	brain	glioblastoma	cell
4	10x 3' v3	microglial cell	brain	glioblastoma	cell

Train ML models¶

You can directly train ML models on very large collections of AnnData objects.

See Train a machine learning model on a collection.