Differential expression on Packer C. elegans data

This notebook was contributed by Eduardo Beltrame [@Munfred](https://github.com/Munfred) and edited by Romain Lopez with help from Adam Gayoso.

Processing and visualizing 89k cells from Packer et al. 2019 C. elegans 10xv2 single cell data

Original article: A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution

https://science.sciencemag.org/content/365/6459/eaax1971.long

The anndata object we provide has 89701 cells and 20222 genes. It includes short gene descriptions from WormBase that will show up when mousing over the interactive plots.

Steps performed:

  1. Loading the data from anndata containing cell labels and gene descriptions

  2. Training the model with batch labels for integration with scVI

  3. Retrieving the scVI latent space and imputed values

  4. Visualize the latent space with an interactive UMAP plot using Plotly

  5. Perform differential expression and visualize with interactive volcano plot and heatmap using Plotly

This notebook was designed to be run in Google Colab.

Open In Colab

[ ]:
# If running in Colab, navigate to Runtime -> Change runtime type
# and ensure you're using a Python 3 runtime with GPU hardware accelerator
# Installation of scVI in Colab can take several minutes

[ ]:
import sys
IN_COLAB = "google.colab" in sys.modules

show_plot = True
save_path = "./"

if IN_COLAB:
    !pip install --quiet scvi[notebooks]==0.6.3
[4]:
import scvi
scvi.__version__
[4]:
'0.6.3'
[ ]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Control warnings
import warnings; warnings.simplefilter('ignore')

import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from scvi.dataset import GeneExpressionDataset
from scvi.models import VAE
from scvi.inference import UnsupervisedTrainer
import torch
import anndata

import plotly.express as px
import plotly.graph_objects as go

from umap import UMAP

if IN_COLAB:
    %matplotlib inline

[6]:
## Change the path where the models will be saved
save_path = "./"
vae_file_name = 'worm_vae.pkl'

if os.path.isfile('packer2019.h5ad'):
    print ("Found the data file! No need to download.")
else:
    print ("Downloading data...")
    ! wget https://github.com/Munfred/wormcells-site/releases/download/packer2019/packer2019.h5ad

Found the data file! No need to download.
[ ]:
adata = anndata.read('packer2019.h5ad')
[8]:
adata
[8]:
AnnData object with n_obs × n_vars = 89701 × 20222
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    var: 'gene_id', 'gene_name', 'gene_description'
[9]:
adata.X
[9]:
<89701x20222 sparse matrix of type '<class 'numpy.float32'>'
        with 82802059 stored elements in Compressed Sparse Column format>

Take a look at the gene descriptions

The gene descriptions were taken using the WormBase API.

[10]:
display(adata.var.head().style.set_properties(subset=['gene_description'], **{'width': '600px'}))
gene_id gene_name gene_description
index
WBGene00010957 WBGene00010957 nduo-6 Is affected by several genes including daf-16; daf-12; and hsf-1 based on RNA-seq and tiling array studies. Is affected by six chemicals including Rotenone; Psoralens; and metformin based on RNA-seq and microarray studies.
WBGene00010958 WBGene00010958 ndfl-4 Is enriched in Psub2 based on RNA-seq studies. Is affected by several genes including daf-16; daf-12; and clk-1 based on RNA-seq and microarray studies. Is affected by six chemicals including Alovudine; Psoralens; and metformin based on RNA-seq studies.
WBGene00010959 WBGene00010959 nduo-1 Is an ortholog of human MT-ND1 (mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1). Is predicted to contribute to NADH dehydrogenase activity. Human ortholog(s) of this gene are implicated in Leber hereditary optic neuropathy and MELAS syndrome.
WBGene00010960 WBGene00010960 atp-6 Is predicted to contribute to proton-transporting ATP synthase activity, rotational mechanism.
WBGene00010961 WBGene00010961 nduo-2 Is affected by several genes including hsf-1; clk-1; and elt-2 based on RNA-seq and microarray studies. Is affected by eight chemicals including stavudine; Zidovudine; and Psoralens based on RNA-seq and microarray studies.
[11]:
adata.obs.head().T
[11]:
index AAACCTGAGACAATAC-300.1.1 AAACCTGAGGGCTCTC-300.1.1 AAACCTGAGTGCGTGA-300.1.1 AAACCTGAGTTGAGTA-300.1.1 AAACCTGCAAGACGTG-300.1.1
cell AAACCTGAGACAATAC-300.1.1 AAACCTGAGGGCTCTC-300.1.1 AAACCTGAGTGCGTGA-300.1.1 AAACCTGAGTTGAGTA-300.1.1 AAACCTGCAAGACGTG-300.1.1
numi 1630 2319 3719 4251 1003
time_point 300_minutes 300_minutes 300_minutes 300_minutes 300_minutes
batch Waterston_300_minutes Waterston_300_minutes Waterston_300_minutes Waterston_300_minutes Waterston_300_minutes
size_factor 1.02319 1.45821 2.33828 2.65905 0.62961
cell_type Body_wall_muscle nan nan Body_wall_muscle Ciliated_amphid_neuron
cell_subtype BWM_head_row_1 nan nan BWM_anterior AFD
plot_cell_type BWM_head_row_1 nan nan BWM_anterior AFD
raw_embryo_time 360 260 270 260 350
embryo_time 380 220 230 280 350
embryo_time_bin 330-390 210-270 210-270 270-330 330-390
raw_embryo_time_bin 330-390 210-270 270-330 210-270 330-390
lineage MSxpappp MSxapaap nan Dxap ABalpppapav/ABpraaaapav
passed_qc True True True True True

Loading data

We load the Packer data and use the batch annotations for scVI. Here, each experiment correspond to a batch.

[12]:
gene_dataset = GeneExpressionDataset()

# we provide the `batch_indices` so that scvi can perform batch correction
gene_dataset.populate_from_data(
            adata.X,
            gene_names=adata.var.index.values,
            cell_types=adata.obs['cell_type'].values,
            batch_indices=adata.obs['batch'].cat.codes.values,
            )

# We select the 1000 most variable genes, which is a recommended selection criteria of scvi
# this method in particular is based on the method of Seurat v3
gene_dataset.subsample_genes(1000)
sel_genes = gene_dataset.gene_names

[2020-04-02 03:49:05,753] INFO - scvi.dataset.dataset | Remapping labels to [0,N]
[2020-04-02 03:49:05,759] INFO - scvi.dataset.dataset | Remapping batch_indices to [0,N]
[2020-04-02 03:49:05,765] INFO - scvi.dataset.dataset | extracting highly variable genes using seurat_v3 flavor
Transforming to str index.
[2020-04-02 03:49:56,429] INFO - scvi.dataset.dataset | Downsampling from 20222 to 1000 genes
[2020-04-02 03:49:56,775] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-04-02 03:49:57,163] INFO - scvi.dataset.dataset | Filtering non-expressing cells.
[2020-04-02 03:49:57,579] INFO - scvi.dataset.dataset | Computing the library size for the new data
[2020-04-02 03:49:57,965] INFO - scvi.dataset.dataset | Downsampled from 89701 to 89701 cells

At this point we may make the scVI dataset RNA count matrix dense, as we filtered the genes. This makes inference faster because when the matrix is sparse, the trainer has to make each minibatch dense before it loads it on the model and the GPU is using CUDA.

[13]:
gene_dataset.X = gene_dataset.X.toarray()
[2020-04-02 03:50:00,523] INFO - scvi.dataset.dataset | Computing the library size for the new data
[14]:
adata.obs['cell_type'].values
[14]:
[Body_wall_muscle, nan, nan, Body_wall_muscle, Ciliated_amphid_neuron, ..., Rectal_gland, nan, nan, nan, nan]
Length: 89701
Categories (37, object): [ABarpaaa_lineage, Arcade_cell, Body_wall_muscle, Ciliated_amphid_neuron, ...,
                          hmc_and_homolog, hmc_homolog, hyp1V_and_ant_arc_V, nan]

Training

  • n_epochs: Maximum number of epochs to train the model. If the likelihood change is small than a set threshold training will stop automatically.

  • lr: learning rate. Set to 0.001 here.

  • use_cuda: Set to true to use CUDA (GPU required)

[ ]:
# for this dataset 50 epochs is sufficient
# note that smaller datasets require hundreds of epochs
n_epochs = 50
lr = 1e-3
use_cuda = True

We now create the model and the trainer object. We train the model and output model likelihood every epoch. In order to evaluate the likelihood on a test set, we split the datasets (the current code can also so train/validation/test).

If a pre-trained model already exist in the save_path then load the same model rather than re-training it. This is particularly useful for large datasets.

[ ]:
# set the VAE to perform batch correction
vae = VAE(gene_dataset.nb_genes, n_batch=gene_dataset.n_batches)
[ ]:
# Use per minibatch warmup (iters) for large datasets
trainer = UnsupervisedTrainer(
    vae,
    gene_dataset,
    train_size=0.85, # number between 0 and 1, default 0.8
    use_cuda=use_cuda,
    frequency=1,
    n_epochs_kl_warmup=None,
    n_iter_kl_warmup=int(128*5000/400)
)
[18]:
# check if a previously trained model already exists, if yes load it

full_file_save_path = os.path.join(save_path, vae_file_name)

if os.path.isfile(full_file_save_path):
    trainer.model.load_state_dict(torch.load(full_file_save_path))
    trainer.model.eval()
else:
    trainer.train(n_epochs=n_epochs, lr=lr)
    torch.save(trainer.model.state_dict(), full_file_save_path)
[2020-04-02 03:50:07,270] INFO - scvi.inference.inference | KL warmup for 1600 iterations
training: 100%|██████████| 50/50 [08:53<00:00, 10.66s/it]

Plotting the likelihood change across training

[19]:
train_test_results = pd.DataFrame(trainer.history).rename(columns={'elbo_train_set':'Train', 'elbo_test_set':'Test'})

train_test_results
[19]:
Train Test
0 8881.413828 9174.447856
1 857.742913 865.714189
2 821.685303 829.391246
3 808.042363 816.110580
4 796.994002 804.426280
5 787.069519 795.110373
6 775.612372 782.976124
7 785.185200 792.350435
8 778.276230 786.173274
9 766.440439 773.111641
10 763.470819 770.053581
11 762.309879 769.090608
12 759.699914 766.290829
13 760.565406 767.061953
14 755.770818 762.247230
15 755.289439 761.668258
16 760.560715 767.147357
17 755.285547 761.907396
18 752.282227 758.606450
19 752.231292 758.707572
20 751.001023 757.318010
21 750.882924 757.270630
22 753.712712 760.248563
23 752.402924 759.038684
24 750.010072 756.424613
25 751.233984 757.741389
26 748.450227 754.827124
27 752.164061 758.740302
28 748.442852 754.916773
29 753.610544 760.393845
30 748.497168 755.029036
31 749.499188 756.388857
32 748.662938 755.379153
33 747.743216 754.304865
34 747.289906 753.938626
35 746.901389 753.527914
36 747.174424 753.747798
37 746.882975 753.544082
38 746.582319 753.286055
39 747.826352 754.543043
40 747.763953 754.599980
41 746.903656 753.518663
42 746.093606 752.914513
43 747.682025 754.577081
44 748.042126 754.908751
45 746.043148 752.920929
46 745.559933 752.408477
47 745.672933 752.533768
48 748.517461 755.667539
49 745.237304 752.205676
50 745.312490 752.293802
[20]:
ax = train_test_results.plot()
ax.set_xlabel("Epoch")
ax.set_ylabel("Error")
plt.show()
../_images/contributed_tutorials_scVI_DE_worm_27_0.png

Obtaining the posterior object and sample latent space

The posterior object contains a model and a gene_dataset, as well as additional arguments that for Pytorch’s DataLoader. It also comes with many methods or utilities querying the model, such as differential expression, imputation and differential analyisis.

To get an ordered output result, we might use .sequential posterior’s method which return another instance of posterior (with shallow copy of all its object references), but where the iteration is in the same ordered as its indices attribute.

[21]:
# This provides a posterior with sequential index sampling
full = trainer.create_posterior()
latent, batch_indices, labels = full.get_latent()
batch_indices = batch_indices.ravel()
latent.shape
[21]:
(89701, 10)
[22]:
# store the latent space in a new anndata object
post_adata = anndata.AnnData(X=gene_dataset.X)
post_adata.obs=adata.obs
post_adata.obsm["latent"] = latent
post_adata
[22]:
AnnData object with n_obs × n_vars = 89701 × 1000
    obs: 'cell', 'numi', 'time_point', 'batch', 'size_factor', 'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'
    obsm: 'latent'

Using Plotly’s Scattergl we can easily and speedily make interactive plots with 89k cells!

[ ]:
# here's a hack to randomize categorical colors, since plotly can't do that in a straightforward manner
# we take the list of named css colors that it recognizes, and we picked a color based on the code of
# the cluster we are coloring
css_colors=[
'aliceblue','antiquewhite','aqua','aquamarine','azure','bisque','black','blanchedalmond','blue',
'blueviolet','brown','burlywood','cadetblue','chartreuse','chocolate','coral','cornflowerblue',
'crimson','cyan','darkblue','darkcyan','darkgoldenrod','darkgray','darkgrey','darkgreen','darkkhaki',
'darkmagenta','darkolivegreen','darkorange','darkorchid','darkred','darksalmon','darkseagreen',
'darkslateblue','darkslategray','darkslategrey','darkturquoise','darkviolet','deeppink','deepskyblue',
'dimgray','dimgrey','dodgerblue','firebrick','floralwhite','forestgreen','fuchsia','gainsboro','ghostwhite',
'gold','goldenrod','gray','grey','green','greenyellow','honeydew','hotpink','indianred','indigo',
'ivory','khaki','lavender','lavenderblush','lawngreen','lemonchiffon','lightblue','lightcoral','lightcyan',
'lightgoldenrodyellow','lightgray','lightgrey','lightgreen','lightpink','lightsalmon','lightseagreen',
'lightskyblue','lightslategray','lightslategrey','lightsteelblue','lightyellow','lime','limegreen','linen',
'magenta','maroon','mediumaquamarine','mediumblue','mediumorchid','mediumpurple','mediumseagreen',
'mediumslateblue','mediumspringgreen','mediumturquoise','mediumvioletred','midnightblue','mintcream',
'mistyrose','moccasin','navajowhite','navy','oldlace','olive','olivedrab','orange','orangered','orchid',
'palegoldenrod','palegreen','paleturquoise','palevioletred','papayawhip','peachpuff','peru','pink','plum'
,'powderblue','purple','red','rosybrown','royalblue','saddlebrown','salmon','sandybrown','seagreen',
'seashell','sienna','silver','skyblue','slateblue','slategray','slategrey','snow','springgreen','steelblue',
'tan','teal','thistle','tomato','turquoise','violet','wheat','white','whitesmoke','yellow','yellowgreen']

# we just repeat the list of colors a bunch of times to ensure we always have more colors than clusters
css_colors = css_colors*100

# now define a function to plot any embedding

def plot_embedding(embedding_kind,              # the embedding must be a label in the post_adata.obsm
                   adata=adata,                 # the original adata for taking the cluster labels
                   post_adata=post_adata,
                   cluster_feature ='cell_type',
                   xlabel="Dimension 1",
                   ylabel="Dimension 2",
                   plot_title="Embedding on single cell data"):

    # `cluster_feature` should be the name of one of the categorical annotation columns
    # e.g. `cell_type`, `cell_subtype`, `time_point`

    cluster_ids = adata.obs[cluster_feature].cat.codes.unique()
    id_to_cluster_map = dict( zip( adata.obs[cluster_feature].cat.codes, adata.obs[cluster_feature] ) )
    cluster_to_id_map  = dict([[v,k] for k,v in id_to_cluster_map.items()])

    fig = go.Figure()
    for _cluster_id in adata.obs[cluster_feature].cat.codes.unique():

        fig.add_trace(
            go.Scattergl(
                  x = post_adata.obsm[embedding_kind][:,0][post_adata.obs[cluster_feature].cat.codes==_cluster_id]
                , y = post_adata.obsm[embedding_kind][:,1][post_adata.obs[cluster_feature].cat.codes==_cluster_id]
                , mode='markers'
                , marker=dict(
                    # we randomize colors by starting at a random place in the list of named css colors
                    color=css_colors[_cluster_id+np.random.randint(0,len(np.unique(css_colors)))]
                    , size = 3
                        )
                , showlegend=True
                , name=id_to_cluster_map[_cluster_id]
                , hoverinfo=['name']
                )
            )

    layout={
        "title": {"text": plot_title
                  , 'x':0.5
                 }
        , 'xaxis': {'title': {"text": xlabel}}
        , 'yaxis': {'title': {"text": ylabel}}
        , "height": 800
        , "width":1000
    }
    fig.update_layout(layout)
    fig.update_layout(showlegend=True)
    return fig
[ ]:
latent_umap = UMAP(spread=2).fit_transform(latent)
[ ]:
post_adata.obsm["UMAP"] = np.array(latent_umap)
[ ]:
fig = plot_embedding(embedding_kind='UMAP',
               cluster_feature ='cell_type',
               xlabel="UMAP 1",
               ylabel="UMAP 2",
               plot_title="UMAP on scVI latent space for Packer C. elegans single cell data")
[27]:
# uncomment this line to save an interactive html plot, in case it is not rendering
#fig.write_html('worms_interactive_tsne.html')
fig.show()