SC-ATACSEQ-EXPLORER PIPELINE

This single cell ATAC-Seq analysis pipeline is designed for intergative analysis of dataset, initially processed by 10x Genomics Cell Ranger ATAC.

In addition to 10x Genomics results it offers:

  • Flexible and clear data preprocessing and normalization methods
  • Summary on different conditions in case of aggregated dataset
  • Different types of clustering followed by heatmap and t-SNE visualizations in low dimensions space
  • Top cluster markers visualization on heatmap / t-SNE plot
  • Closest genes annotations for peaks and clusters
  • Annotated markers analysis
  • BigWig and BED files for clusters and markers ready-to-be-visualized in JBR Genome Browser by JetBrains Research
  • Data preparation for single cell explorer by Artyomov Lab, Washington University in St.Louis
  • Save all the figures in PDF format - ready for publication

Required 10x Genomics Cell Cell Ranger ATAC files:

  • fragments.tsv - fragments matrix file provided by Cell Ranger
  • peaks.bed - peaks file (union of peaks) in case of merging
  • clusters.csv - Graph clustering after median normalization, IDF scaling, SVD projection and L2 normalization
  • projection.csv - T-SNE 2d project of all the cells

Pipeline steps:

  • Cell Calling
  • Peak barcode matrix
  • UMI normalization
  • Feature selection
  • Dimensionality reduction
  • TSNE
  • Differential markers analysis
  • Supervised annotation of clusters by Gene markers

Other pipelines:

Questions and comments are welcome at oleg dot shpynov at jetbrains dot com.

In [1]:
# Configuration
FRAGMENTS_FILE = '/mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/BAAJ_Spleen/outs/fragments.tsv.gz'
# PEAKS_FILE = '/mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/union_peaks.bed'
PEAKS_FILE = '/mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/merged_200_1.0E-10_2.peak.0E-10_2.peak'
CLUSTERS_FILE = '/mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/BAAJ_Spleen/outs/analysis/clustering/graphclust/clusters.csv'
TSNE_FILE = '/mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/BAAJ_Spleen/outs/analysis/tsne/2_components/projection.csv'

# ! wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz
# ! guznip mm10.blacklist.bed.gz
BLACKLIST_FILE = '/mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/mm10.blacklist.bed'
# ! wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M22/gencode.vM22.annotation.gtf.gz 
# ! gunzip gencode.vM22.annotation.gtf.gz
GTF_FILE = '/mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/gencode.vM22.annotation.gtf'

# representative DNase hypersensitivity sites for mm10 https://www.encodeproject.org/annotations/ENCSR489XTV/
DNASE_FILE = '/mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/ENCFF108APF.bed'

FACTOR = 'age'
FACTORS_MAP = {1: 'old', 2: 'young'}
FACTOR_FUNCTION = lambda x: 1 if x.endswith('-1') else 2

OUTPUT_DIR = '/mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline'
In [2]:
MARKERS_REGIONS = {
    'GZMK classic promoter': 'chr13:113180223-113181928',
    'GZMK alternative promoter': 'chr13:113182148-113184892',
    'CD68 promoter': 'chr11:69665600-69667000',
    'Cd3d promoter': 'chr9:44981200-44982800',
    'CD19 promoter': 'chr7:126414200-126415200',
    'NCR1 promoter': 'chr7:4337400-4337800',
}
In [3]:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

import glob
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import re
import seaborn as sns
sns.set_style("whitegrid")

from pybedtools import BedTool
from tqdm.auto import tqdm
from scipy.stats import zscore
from matplotlib.backends.backend_pdf import PdfPages

# Fix random seed
np.random.seed(0)

figures_path = os.path.join(OUTPUT_DIR, 'figures')
! mkdir -p {figures_path}

Peaks file stats

In [4]:
with PdfPages(os.path.join(figures_path, 'peaks_length.pdf')) as pdf:
    peaks_df = pd.read_csv(PEAKS_FILE, sep='\t', header=None)
    print('Peaks', len(peaks_df))
    sns.distplot(np.log10(peaks_df[2] - peaks_df[1]))
    plt.title('Peak length distribution')
    plt.xlabel('Log10 Length')
    plt.ylabel('Frequency')
Peaks 80914

Overlap of peaks with representative DNAse hypersensitive sites

In [5]:
dnase = BedTool(DNASE_FILE)
peaks_file = BedTool(PEAKS_FILE)
peak_count = peaks_file.count()
overlap = peaks_file.intersect(dnase, wa=True, u=True).count()
peak_by_dnase = 0 if overlap == 0 else overlap * 100.0 / peak_count
overlap = dnase.intersect(peaks_file, wa=True, u=True).count()
dnase_by_peak = overlap * 100.0 / dnase.count()
print('Fraction of peaks overlapping with representative DNAse', int(peak_by_dnase), '%')
print('Fraction of representative DNAse overlapping with peaks', int(dnase_by_peak), '%')
Fraction of peaks overlapping with representative DNAse 93 %
Fraction of representative DNAse overlapping with peaks 11 %

Cell Calling

Pipeline uses noise_threshold and duplets_threshold.


Cell Ranger ATAC aggregation: when combining data from multiple GEM groups, the cellranger-atac aggr pipeline automatically equalizes the sensitivity of the groups before merging, which is the recommended approach in order to avoid the batch effect introduced by sequencing depth. Default method: Subsample fragments from higher-depth GEM wells until they all have an equal number of unique fragments per cell.

For each barcode:

  • we have the record of mapped high-quality fragments that passed all filters (the fragments.tsv file).
  • Having determined peaks prior to this, we use the number of fragments that overlap any peak regions.
  • Separate the signal from noise.

This works better in practice as compared to naively using the number of fragments per barcode. We first subtract a depth-dependent fixed count from all barcode counts to model whitelist contamination. This fixed count is the estimated number of fragments per barcode that originated from a different GEM, assuming a contamination rate of 0.02. Then we fit a mixture model of two negative binomial distributions to capture the signal and noise. Setting an odds ratio of 1000, we separate the barcodes that correspond to real cells from the non-cell barcodes.

Scasat If a cell has open peaks below a user defined threshold (default: 50 peaks) in the peak accessibility matrix we would remove that cell from subsequent analysis. Also peaks not observed across a user defined number of valid cells (default: 10 cells) are not considered for further downstream analysis.

SnapATAC alternative approach - 1kb, 5kb and 10kb resolution.

Existing computational methods rely on pre-defined regions of transposase accessibility identified from the aggregate signals. ...

Limitations:

  1. It requires sufficient number of single cell profiles to create robust aggregate signal for peak calling.
  2. The cell type identification is biased toward the most abundant cell types in the tissues.
  3. These techniques lack the ability to reveal regulatory elements in the rare cell populations which are underrepresented in the aggregate signal. This concern is critical, for example, in brain tissue, where key neuron types may represent less than 1% of all cells while still playing a critical role in the neural circuit.
In [6]:
import re

# Fragments file provided by cell ranger
fragments_file_gz = FRAGMENTS_FILE

# BedTools doesn't work with gz file, so unzip it
!gunzip {fragments_file_gz}
fragments_file = re.sub('\.gz', '', fragments_file_gz)
gzip: /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/BAAJ_Spleen/outs/fragments.tsv.gz: No such file or directory
In [ ]:
# Intersect using bedtools rather than pybedtools, because they are too slow for files of this size!
import tempfile
import pandas as pd
import numpy as np
import os

def intersect_fragments_and_peaks(fragments_file, peaks_file, blacklist_file):
    print('Fragments')
    ! wc -l {fragments_file}
    idf = None
    print('Blacklist')
    ! wc -l {blacklist_file}
    print('Peaks')
    ! wc -l {peaks_file}

    with tempfile.TemporaryDirectory(prefix='pipeline') as td:
        print('Filtering out non-standard chromosomes')
        chr_filtered_file = os.path.join(td, 'chromosome_filtered.bed')
        ! cat {fragments_file} | grep -i -E 'chr[0-9mt]+[^_]' > {chr_filtered_file}
        ! wc -l {chr_filtered_file}

        print('Blacklist regions filtration')
        blacklist_filtered_file = os.path.join(td, 'blacklist_filtered.bed')
        ! bedtools intersect -v -a {chr_filtered_file} -b {blacklist_file} > {blacklist_filtered_file}
        ! wc -l {blacklist_filtered_file}

        print('Fragments and peaks intersection')
        intersection_file = os.path.join(td, 'intersection.bed')
        ! bedtools intersect -wa -wb -a {blacklist_filtered_file} -b {peaks_file} > {intersection_file}
        ! wc -l {intersection_file}

        idf = pd.read_csv(intersection_file, sep='\t', header=None)
        icolnames = ['chr', 'start', 'end', 'barcode', 'reads', 'peak_chr', 'peak_start', 'peak_end']
        assert len(idf.columns) >= len(icolnames)
        idf.rename(columns={o: n for (o, n) in zip(idf.columns[:len(icolnames)], icolnames)}, inplace=True)
        idf = idf[icolnames] # Reorder and drop others
    print('Done intersecting of fragments and peaks')
    return idf

idf = intersect_fragments_and_peaks(fragments_file, PEAKS_FILE, BLACKLIST_FILE)
Fragments
117241325 /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/BAAJ_Spleen/outs/fragments.tsv
Blacklist
164 /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/mm10.blacklist.bed
Peaks
80914 /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/merged_200_1.0E-10_2.peak.0E-10_2.peak
Filtering out non-standard chromosomes
114810913 /tmp/pipelinecihnz2n4/chromosome_filtered.bed
Blacklist regions filtration
114592939 /tmp/pipelinecihnz2n4/blacklist_filtered.bed
Fragments and peaks intersection
In [ ]:
def filter_cells(idf, noise_threshold=None, duplets_threshold=None):
    # 10x Genomics Cell Ranger ATAC-Seq marks duplicates
    # 'count' ignores multiple reads per barcode at same position
    pidf = pd.pivot_table(idf, values='reads', index=['barcode'], aggfunc='count')
    pidf.reset_index(level=0, inplace=True)
    
    print('Total barcodes', len(pidf))    
    plt.subplots(nrows=1, ncols=2, figsize=(15, 6))
    plt.subplot(1, 2, 1)

    # n is for shoulder graph
    counts = sorted(pidf['reads'], reverse=True)
    df_counts = pd.DataFrame(data={'count': counts, 'n': range(1, len(counts) + 1)})
    cells_filter = [True] * len(df_counts)
    
    if noise_threshold is not None:
        noise_filter = df_counts['count'] <= noise_threshold
        cells_filter = np.logical_and(cells_filter, np.logical_not(noise_filter))
        df_noise = df_counts.loc[noise_filter]
        print('Noise', len(df_noise))
        plt.plot(np.log10(df_noise['n']), 
                 np.log10(df_noise['count']), 
                 label='Noise', linewidth=3, color='red')        

    if duplets_threshold is not None:
        duplets_filter = df_counts['count'] >= duplets_threshold
        cells_filter = np.logical_and(cells_filter, np.logical_not(duplets_filter))
        df_duplets = df_counts.loc[duplets_filter]
        print('Duplets', len(df_duplets))
        plt.plot(np.log10(df_duplets['n']), 
                 np.log10(df_duplets['count']), 
                 label='Duplets', linewidth=3, color='orange')        

    df_cells = df_counts.loc[cells_filter]

    cells_filter = [True] * len(pidf)
    if noise_threshold is not None:
        cells_filter = np.logical_and(cells_filter, pidf['reads'] > noise_threshold)
    if duplets_threshold is not None:
        cells_filter = np.logical_and(cells_filter, pidf['reads'] < duplets_threshold)
        
    cells = set(pidf.loc[cells_filter]['barcode'])
    idfcells = idf.loc[[c in cells for c in idf['barcode']]]
    print('Estimated number of cells', len(cells))
    plt.title('Cell calling graph')
    plt.plot(np.log10(df_cells['n']), 
             np.log10(df_cells['count']), 
             label='Cell', linewidth=3, color='green')

    plt.xlabel('Log10 number of barcodes')
    plt.ylabel('Log10 fragments overlapping Peaks')
    plt.legend(loc='upper right')

    plt.subplot(1, 2, 2)
    sns.distplot(np.log10(df_cells['count']))
    plt.title('UMI summary coverage distribution')
    plt.xlabel('Log10 coverage')
    plt.ylabel('Frequency')
    
    return idfcells
In [ ]:
with PdfPages(os.path.join(figures_path, 'cell_calling.pdf')) as pdf:
    idfcells = filter_cells(idf, noise_threshold=200, duplets_threshold=8000)
    pdf.savefig()

Peak-Barcode Matrix

Pipeline Compute number of different UMIs which intersect with peak. Multiple copies of UMI are ignored.


Cell Ranger ATAC-Seq We produce a count matrix consisting of the counts of fragment ends (or cut sites) within each peak region for each barcode. This is the raw peak-barcode matrix and it captures the enrichment of open chromatin per barcode. The matrix is then filtered to consist of only cell barcodes, which is then used in subsequent analysis such as dimensionality reduction, clustering and visualization.

A barcoded fragment may get sequenced multiple times due to PCR amplification. We mark duplicates in order to identify the original fragments that constitute the library and contribute to its complexity.

In [ ]:
print('Barcode vs summary fragments overlap with peaks')
# 10x Genomics Cell Ranger ATAC-Seq marks duplicates
# 'count' ignores multiple reads per barcode at same position
fulldf = pd.pivot_table(idfcells, values='reads', 
                        index=['peak_chr', 'peak_start', 'peak_end', 'barcode'], 
                        aggfunc='count').reset_index()

fulldf['peak'] = fulldf['peak_chr'] + ':' + \
    fulldf['peak_start'].astype(str) + '-' + fulldf['peak_end'].astype(str)
fulldf.drop(columns=['peak_chr', 'peak_start', 'peak_end'], inplace=True)
# display(fulldf.head())
In [ ]:
print('Transforming dataframe to peaks x barcode format')
fulldf = pd.pivot_table(cell_peak_cov_df, index='peak', columns='barcode', values='reads').fillna(0)
# Remove extra labels from pivot_table columns
fulldf.columns = fulldf.columns.values
fulldf.index.name = None
# display(fulldf.head())
In [ ]:
with PdfPages(os.path.join(figures_path, 'umis.pdf')) as pdf:
    print('Summary UMI distribution per factor')
    factors = [FACTOR_FUNCTION(bc) for bc in fulldf.columns]
    for k, v in tqdm(FACTORS_MAP.items()):
        sns.kdeplot(np.log10(fulldf[fulldf.columns[np.flatnonzero(np.equal(factors, k))]].sum()), label=v)
    plt.title('UMI summary coverage distribution')
    plt.xlabel('Log10 coverage')
    plt.ylabel('Frequency')
In [48]:
print('Cells per peak with non-zero coverage distribution')
cells_per_peak = np.count_nonzero(fulldf, axis=1)
with PdfPages(os.path.join(figures_path, 'cells_per_peaks.pdf')) as pdf:
    sns.distplot(np.log10(cells_per_peak))
    plt.title('Cells with coverage > 0 per peak')
    plt.xlabel('Log10 cells')
    plt.ylabel('Frequency')
    pdf.savefig()
Cells per peak with non-zero coverage distribution
In [49]:
print('Peaks per cell with non-zero coverage filtration')
peaks_per_cell = np.count_nonzero(fulldf, axis=0)
with PdfPages(os.path.join(figures_path, 'peaks_per_cell.pdf')) as pdf:
    sns.distplot(np.log10(peaks_per_cell))
    plt.title('Peaks with coverage > 0 per cell')
    plt.xlabel('Log10 peaks')
    plt.ylabel('Frequency')
    pdf.savefig()
Peaks per cell with non-zero coverage filtration

Normalization

Pipeline

  • Global scaling to median coverage of UMI for coverage reporting
  • Peak length normalization for cluster analysis

Cell Ranger uses normalization to median coverage depth in each UMI - we do the same here.

Seurat we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.

SnapATAC does not require population-level peak annotation, and instead assembles chromatin landscapes by directly clustering cells based on the similarity of their genome-wide accessibility profile. Using a regression-based normalization procedure, SnapATAC adjusts for differing read depth between cells.

In [50]:
from diffexp import estimate_size_factors

print('UMI normalization by median fragments count per barcode (RPM)') 
# (peaks x barcodes) format
size_factors = estimate_size_factors(fulldf.values)
normdf = fulldf / size_factors # per column, i.e. barcode
UMI normalization by median fragments count per barcode (RPM)
In [51]:
print('Peak length normalization')
lengths = [int(re.split(':|-', p)[2]) - int(re.split(':|-', p)[1]) for p in fulldf.index]
rpkdf = fulldf.divide(lengths, axis=0) * 1000
Peak length normalization

Feature selection (Identification of highly variable features)

Pipeline

  • mean >99% as alignment errors or housekeeping genes
  • std <1% as noise

Cell Ranger ATAC removed features selection with zero variance.

SnapATAC The vast majority of the items in the cell-by-bin count matrix is “0”, some items have abnormally high coverage (often > 200) perhaps due to alignment error. Bins of exceedingly high coverage which likely represent the genomic regions that are invariable between cells such as housekeeping gene promoters were removed. We noticed that filtering bins of extremely low coverage perhaps due to random noise can also improve the robustness of the downstream clustering analysis.

  • Calculated the coverage of each bin using the binary matrix and
  • Removed the top 0.1% items of the highest coverage
  • Normalized the coverage by log10(count + 1).
  • Log-scaled coverage obey approximately a gaussian distribution which is then converted into zscore.
  • Bins with zscore beyond ±2 were filtered before further analysis.

Signac The largely binary nature of scATAC-seq data makes it challenging to perform ‘variable’ feature selection, as we do for scRNA-seq. Instead, we can choose to use only the top n% of features (peaks) for dimensional reduction, or remove features present in less that n cells with the FindTopFeatures function. Here, we will all features, though we note that we see very similar results when using only a subset of features (try setting min.cutoff to ‘q75’ to use the top 25% all peaks), with faster runtimes. Features used for dimensional reduction are automatically set as VariableFeatures for the Seurat object by this function. See for explanation: https://satijalab.org/signac/articles/pbmc_vignette.html

Seurat calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

In [52]:
def feature_selection(df, stdp=1, meanp=99):
    # df is (peaks x barcodes)
    means = df.T.mean()
    stds = df.T.std()
    print('Feature selection: filter out most highly covered peaks as alignment errors or housekeeping genes.')
    means_high = np.percentile(means, meanp)
    peaks_filter = means < means_high
    print('Feature selection: filter out non-variable peaks.')          
    stds_low = np.percentile(stds, stdp)
    peaks_filter = np.logical_and(stds_low < stds, means < means_high)
    return peaks_filter
In [53]:
# Normalization to peak length to take into account signal strength, not total coverage
with PdfPages(os.path.join(figures_path, 'feature_selection_before.pdf')) as pdf:
    # df is (peaks x barcodes)
    sns.jointplot(x='mean', y='std', data=pd.DataFrame({'mean': rpkdf.T.mean(), 'std': rpkdf.T.std()}))
    pdf.savefig()
    plt.show

with PdfPages(os.path.join(figures_path, 'feature_selection_after.pdf')) as pdf:
    peaks_filter = feature_selection(rpkdf)
    sns.jointplot(x='mean', y='std', data=pd.DataFrame({'mean': rpkdf.loc[peaks_filter].T.mean(), 
                                                        'std': rpkdf.loc[peaks_filter].T.std()}))
    pdf.savefig()
    plt.show

print('Total peaks', len(rpkdf))
print('Filtered peaks', sum(peaks_filter))
Feature selection: filter out most highly covered peaks as alignment errors or housekeeping genes.
Feature selection: filter out non-variable peaks.
Total peaks 79194
Filtered peaks 77610

Dimensionality Reduction, Clustering and t-SNE Projection

Pipeline Use Cell Ranger ATAC approach and IRLB code for dimensionality reduction.

In [54]:
def plot_colored(data, c,
                 clusters=False, show_centers=False, 
                 size=10, alpha=0.5, no_ticks=True):
    """ Plot colored scatterplot """
    cmap = plt.cm.get_cmap('jet', len(set(c))) if clusters else plt.cm.viridis
    fig, ax = plt.subplots(1, figsize=(10, 8))
    sc = plt.scatter(data[:, 0], data[:, 1], 
                     c=c, cmap=cmap,
                     marker='o',
                     alpha=alpha,
                     s=size)
    if no_ticks:
        plt.setp(ax, xticks=[], yticks=[])
    if clusters and show_centers:
        # For each cluster, we add a text with name
        for cluster in set(c):
            cluster_mean = data[np.flatnonzero(c == cluster), :].mean(axis=0)
            plt.text(cluster_mean[0], cluster_mean[1], str(cluster), 
                     horizontalalignment='center', size='large', color='black', weight='bold')            
    if clusters:
        cbar = plt.colorbar(boundaries=np.arange(len(set(c)) + 1) + min(c) - 0.5)
        cbar.set_ticks(np.arange(len(set(c))) + min(c))
        cbar.set_ticklabels(list(set(c)))
    else:
        plt.colorbar()

Check markers of interest are not filtered out

In [131]:
print('Check marker regions of interest')
peaks = fulldf.index[np.flatnonzero(peaks_filter)]
marker_regions_peaks = {}
for k, r in MARKERS_REGIONS.items():
    cr, sr, er = re.split(':|-', r)
    tp = [(c,s,e) for (c,s,e) in map(lambda p: re.split(':|-', p), peaks) 
        if c == cr and not (int(er) < int(s) or int(e) < int(sr))]
    assert tp, f'{k} not found'
    assert len(tp) == 1, f'Too many peaks for region {k}'
    cp, sp, ep = tp[0]
    marker_regions_peaks[k] = peak = f'{cp}:{sp}-{ep}'
    print('Region', k, 'Peak', peak)
Check marker regions of interest
Region GZMK classic promoter Peak chr13:113180800-113181000
Region GZMK alternative promoter Peak chr13:113183200-113183800
Region CD68 promoter Peak chr11:69665600-69667000
Region Cd3d promoter Peak chr9:44981200-44982800
Region CD19 promoter Peak chr7:126414200-126415200
Region NCR1 promoter Peak chr7:4337400-4337800

PCA (median and log scaling + classic algorithm)

  1. UMI normalization
  2. Log scale
  3. IRLBA SVD decomposition to dim 10 instead of PCA
In [56]:
from sklearn.decomposition import PCA

print('PCA on features peaks vs log10 sum coverage')
# Number of dimensions recommended by Cell Ranger ATAC-Seq algorithm
pca = PCA(n_components=15)
t = np.log1p(normdf.loc[peaks_filter].T) # (n_samples x n_features)
result_pca = pca.fit_transform(t)
print('Explained variation', np.sum(pca.explained_variance_ratio_))
with PdfPages(os.path.join(figures_path, 'pca.pdf')) as pdf:
    plot_colored(result_pca, np.log10(normdf.sum()), no_ticks=False)
    pdf.savefig()
PCA on features peaks vs log10 sum coverage
Explained variation 0.016781439160583392

LSA (IDF + IRLBA + L2 normalization)

Cell Ranger ATAC default method is LSA.

  1. IDF scaling
  2. IRLBA SVD decomposition to dim 15
  3. L2 normalization in low dimension projection
  4. Graph clustering
  5. T-SNE visualization

We found that the combination of these normalization techniques obviates the need to remove the first component. The number of dimensions is fixed to 15 as it was found to sufficiently separate clusters visually and in a biologically meaningful way when tested on peripheral blood mononuclear cells (PBMCs).

In [57]:
from sklearn.preprocessing import Normalizer
from irlb import irlb

def lsa(df, dim):
    print('Normalizing by IDF')
    idf = np.log1p(df.shape[1]) - np.log1p(np.count_nonzero(df, axis=1))
    df_idf = df.multiply(idf, axis=0).T # Transpose to (barcodes x peaks) format
    
    print('Performing IRLBA without scaling or centering')
    # Number of dimensions recommended by Cell Ranger ATAC-Seq algorithm
    (_, _, v, _, _) = irlb(df_idf, dim)
    # project the matrix to complete the transform: X --> X*v = u*d
    df_idf_irlb = df_idf.dot(v) 
    assert df_idf_irlb.shape == (df_idf.shape[0], dim), 'check dimensions'

    # IMPORTANT! We found that the combination of these normalization techniques 
    # obviates the need to remove the first component.
    print('Normalization each barcode data point to unit L2-norm in the lower dimensional space')
    idf_irlb_l2 = Normalizer(norm='l2').fit_transform(df_idf_irlb) # (n_samples, n_features) 
    return pd.DataFrame(idf_irlb_l2, index=df_idf_irlb.index, columns=df_idf_irlb.columns)

# NO UMI normalization required for LSA approach!
# IDF is invariant to scales, IRLBA don't need scaling.
# Normalization to peak length to take into account signal strength, not total coverage
result_lsa = lsa(rpkdf.loc[peaks_filter], 15)
Normalizing by IDF
Performing IRLBA without scaling or centering
Normalization each barcode data point to unit L2-norm in the lower dimensional space

UMAP for dimensionality reduction

In [58]:
# UMAP visualization
import umap
umap_coords = pd.DataFrame(umap.UMAP().fit_transform(result_lsa.values)[:, :2], 
                           index=result_lsa.index,
                           columns=['umap1', 'umap2'])

print('Saving UMAP coordinates to umap.tsv')
umap_coords.to_csv(os.path.join(OUTPUT_DIR, 'umap.tsv'), sep='\t')
/home/user/miniconda3/envs/py37/lib/python3.7/site-packages/numba/typed_passes.py:271: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../../../home/user/miniconda3/envs/py37/lib/python3.7/site-packages/umap/nndescent.py", line 47:
    @numba.njit(parallel=True)
    def nn_descent(
    ^

  state.func_ir.loc))
Saving UMAP coordinates to umap.tsv
In [59]:
print('UMAP on LSA normalized vs log10 coverate depth')
with PdfPages(os.path.join(figures_path, 'umap.pdf')) as pdf:
    plot_colored(umap_coords.values, np.log10(fulldf.sum()), size=20)
    pdf.savefig()
UMAP on LSA normalized vs log10 coverate depth
In [60]:
print('tSNE on LSA normalized vs', FACTOR, FACTORS_MAP)
with PdfPages(os.path.join(figures_path, f'umap_{FACTOR}.pdf')) as pdf:
    factors = [FACTOR_FUNCTION(bc) for bc in result_lsa.index]
    plot_colored(umap_coords.values, factors, clusters=True, size=20, alpha=0.1)
    pdf.savefig()
tSNE on LSA normalized vs age {1: 'old', 2: 'young'}
In [61]:
def n_rows_columns(n, maxcols=5):
    ncols = max([r for r in range(1, min(maxcols, math.ceil(math.sqrt(n)) + 1)) if n % r == 0])
    nrows = int(n / ncols)
    return nrows, ncols

    
def plot_colored_split(data, c, size=10, alpha=0.5, contrast = 100):
    """ Plot colored scatterplot split by factors"""
    cmap = plt.cm.get_cmap('jet', len(set(c)))
    nrows, ncols = n_rows_columns(len(set(c)))
    plt.subplots(nrows=nrows, ncols=ncols, figsize=(5 * ncols, 5 * nrows))
    xmin, xmax = np.min(data[:,0]), np.max(data[:,0])
    xlim = (xmin - 0.1 * (xmax - xmin), xmax + 0.1 * (xmax - xmin)) # +10% margin
    ymin, ymax = np.min(data[:,1]), np.max(data[:,1])
    ylim = (ymin - 0.1 * (ymax - ymin), ymax + 0.1 * (ymax - ymin)) # +10% margin
    for i, f in enumerate(tqdm(set(c))):
        ax = plt.subplot(nrows, ncols, i + 1)
        plt.setp(ax, xticks=[], yticks=[])
        plt.xlim(xlim)
        plt.ylim(ylim)
        plt.title(str(f))
        # Plot slightly visible other as gray
        nfdata = data[np.flatnonzero(np.logical_not(np.equal(c, f))), :]
        plt.scatter(nfdata[:,0], nfdata[:,1], 
            color='gray',
            marker='o',
            alpha=alpha / contrast,
            s=size)
        # And factor data
        fdata = data[np.flatnonzero(np.equal(c, f)), :]        
        plt.scatter(fdata[:,0], fdata[:,1], 
                    color=cmap(i),
                    marker='o',
                    alpha=alpha,
                    s=size)
In [62]:
with PdfPages(os.path.join(figures_path, f'umap_{FACTOR}_spit.pdf')) as pdf:
    plot_colored_split(umap_coords.values, factors, size=5, alpha=0.1)
    pdf.savefig()

Graph clustering of LSA normalized data

Clustering should:

  • produce clusters enough clusters 10-15 recommended by Cell Ranger
  • too small clusters are likely artifacts < 100 cells
  • centers of clusters should be inside the cluster, otherwise we should increase number of clusters OR increase lower dimension space size
In [63]:
from collections import Counter
from sklearn.cluster import AgglomerativeClustering

def graph_clustering(result, n_clusters):
    # Ward method as default
    print('Clustering', n_clusters)
    clusters = AgglomerativeClustering(n_clusters=n_clusters).fit_predict(result).astype(int)
    cluster_counter = Counter()
    for c in clusters:
        cluster_counter[c] += 1

    # reorder clusters by size descending
    clusters_reord = np.zeros(len(clusters), dtype=int)    
    for i, (c, n) in enumerate(cluster_counter.most_common()):
        clusters_reord[clusters == c] = i

    return clusters_reord

# 10-15 clusters is recommened values by Cell Ranger ATAC-Seq
n_clusters = 15
clusters = graph_clustering(result_lsa, n_clusters)
print(f'Saving clusters to clusters{n_clusters}.tsv')
clusters_df = pd.DataFrame({'cluster': clusters}, index=result_lsa.index)
clusters_df.to_csv(os.path.join(OUTPUT_DIR, f'clusters{n_clusters}.tsv'))
Clustering 15
Saving clusters to clusters15.tsv
In [140]:
print('QC clustermap of normalized LSA coordinates and clusters')
n = 100
print('Sampling', n, 'cells from each cluster')
ts = []
for cluster in tqdm(range(n_clusters)):
    t = result_lsa.loc[clusters == cluster].sample(n=n, axis=0, replace=True).copy()
    t['cluster'] = cluster / n_clusters # Normalize cluster number to 0-1
    ts.append(t)

with PdfPages(os.path.join(figures_path, 'cluster_lsa.pdf')) as pdf:
    sns.clustermap(pd.concat(ts), col_cluster=False, row_cluster=False, figsize=(8, 8))
    pdf.savefig()
QC clustermap of normalized LSA coordinates and clusters
Sampling 100 cells from each cluster

In [65]:
def clusters_sizes(df, value):
    df_sum = df[['cluster', value]].groupby(['cluster']).sum().reset_index()
    plt.figure(figsize=(10, 4))
    cmap=plt.cm.get_cmap('jet', len(df_sum))
    sns.barplot(x=df_sum.index, y=df_sum[value],
                palette=[cmap(i) for i in range(len(df_sum))])


def clusters_factors(df, value):
    df_sum = df[['cluster', value]].groupby(['cluster']).sum().reset_index()    
    df_sum_factor = df[['cluster', FACTOR, value]].groupby(['cluster', FACTOR]).sum().reset_index()
    plt.figure(figsize=(10, 4))
    cmap=plt.cm.get_cmap('jet', len(FACTORS_MAP))
    
    # Plot 1 - background - "total" 
    sns.barplot(x = df_sum.index, y = np.ones(len(df_sum)) * 100, color = cmap(0), alpha=0.5)
    
    percentages = np.zeros(len(df_sum))
    for i, (k,v) in enumerate(FACTORS_MAP.items()):
        if i > 0:
            percentages += np.array(df_sum_factor.loc[df_sum_factor[FACTOR] == k][value]) / np.array(df_sum[value]) * 100.0
            # Plot N - overlay - "bottom" series
            bottom_plot = sns.barplot(x=df_sum.index, y=percentages, color = cmap(i), alpha=0.5)


    legend_colors = [plt.Rectangle((0,0),1,1, fc=cmap(i), edgecolor = 'none') for i in range(len(FACTORS_MAP))]
    legend_names = [v for _,v in FACTORS_MAP.items()]
    l = plt.legend(legend_colors, legend_names, prop={'size': 16})
    l.draw_frame(False)

    #Optional code - Make plot look nicer
    sns.despine(left=True)
    bottom_plot.set_ylabel('Cluster composition ' + FACTOR + ' %')
    bottom_plot.set_xlabel('Cluster')
In [66]:
print('Computing summary for clusters')
t = fulldf.T.copy()
t[FACTOR] = [FACTOR_FUNCTION(bc) for bc in t.index]
t['cluster'] = clusters
t['counts'] = np.ones(len(t))

print('Summary by', FACTOR, FACTORS_MAP)
display(t[[FACTOR, 'counts']].groupby([FACTOR]).sum())

with PdfPages(os.path.join(figures_path, 'clusters_sizes.pdf')) as pdf:
    clusters_sizes(t, 'counts')
    pdf.savefig()

with PdfPages(os.path.join(figures_path, f'clusters_{FACTOR}.pdf')) as pdf:
    clusters_factors(t, 'counts')
    pdf.savefig()  
Computing summary for clusters
Summary by age {1: 'old', 2: 'young'}
counts
age
1 9695.0
2 7386.0
In [67]:
print('Summary UMI distribution per cluster')
with PdfPages(os.path.join(figures_path, 'clusters_umi.pdf')) as pdf:
    plt.figure(figsize=(10, 6))
    for c in tqdm(set(clusters)):
        sns.kdeplot(np.log10(fulldf.T.iloc[np.flatnonzero(clusters == c)].sum(axis=1)), 
                    label=str(c))

    plt.title('UMI summary coverage distribution')
    plt.xlabel('Summary coverage')
    plt.ylabel('Frequency')
    plt.legend()
    pdf.savefig()
Summary UMI distribution per cluster

In [68]:
print('Peaks with non-zero coverage distribution per cluster')
with PdfPages(os.path.join(figures_path, 'clusters_cell_peaks.pdf')) as pdf:
    plt.figure(figsize=(10, 6))
    for c in tqdm(set(clusters)):
        sns.kdeplot(np.log10(np.count_nonzero(fulldf.T.iloc[np.flatnonzero(clusters == c)], axis=1)), 
                    label=str(c))

    plt.title('Peaks per cell')
    plt.xlabel('Log10 peaks')
    plt.ylabel('Frequency')
    plt.legend()
    pdf.savefig()
Peaks with non-zero coverage distribution per cluster

Clusters visualization

In [69]:
print('UMAP on LSA normalized vs graph clusters')
with PdfPages(os.path.join(figures_path, 'umap_clusters.pdf')) as pdf:
    plot_colored(umap_coords.values, clusters, clusters=True, show_centers=True, size=20, alpha=0.3)
    pdf.savefig()
UMAP on LSA normalized vs graph clusters
In [70]:
print('UMAP on LSA normalized vs graph clusters')
with PdfPages(os.path.join(figures_path, 'umap_clusters_split.pdf')) as pdf:
    plot_colored_split(umap_coords.values, clusters, size=5)
    pdf.savefig()
UMAP on LSA normalized vs graph clusters

T-SNE for dimensionality reduction

In [71]:
from sklearn.manifold import TSNE
print('Processing t-SNE on L2-normalized data')
tsne_coords = pd.DataFrame(TSNE(n_components=2).fit_transform(result_lsa.values),
                           index=result_lsa.index,
                           columns=['tsne1', 'tsne2'])
print('t-SNE done!')
# display(tsne_coords.head())

print('Saving tSNE coordinates to tsne.tsv')
tsne_coords.to_csv(os.path.join(OUTPUT_DIR, 'tsne.tsv'), sep='\t')
Processing t-SNE on L2-normalized data
t-SNE done!
Saving tSNE coordinates to tsne.tsv
In [72]:
print('tSNE on LSA normalized vs log10 coverate depth')
with PdfPages(os.path.join(figures_path, 'tsne.pdf')) as pdf:
    plot_colored(tsne_coords.values, np.log10(fulldf.sum()), size=20)
    pdf.savefig()
tSNE on LSA normalized vs log10 coverate depth
In [73]:
print('tSNE on LSA normalized vs', FACTOR, FACTORS_MAP)
factors = np.array([FACTOR_FUNCTION(bc) for bc in result_lsa.index])
with PdfPages(os.path.join(figures_path, f'tsne_{FACTOR}.pdf')) as pdf:
    plot_colored(tsne_coords.values, factors, clusters=True, size=20, alpha=0.2)
    pdf.savefig()
tSNE on LSA normalized vs age {1: 'old', 2: 'young'}
In [74]:
with PdfPages(os.path.join(figures_path, f'tsne_{FACTOR}_split.pdf')) as pdf:
    plot_colored_split(tsne_coords.values, factors, alpha=0.1)
    pdf.savefig()

In [75]:
print('tSNE on LSA normalized vs graph clusters')
with PdfPages(os.path.join(figures_path, 'tsne_clusters.pdf')) as pdf:
    plot_colored(tsne_coords.values, clusters, clusters=True, show_centers=True, size=20, alpha=0.5)
    pdf.savefig()
tSNE on LSA normalized vs graph clusters
In [76]:
with PdfPages(os.path.join(figures_path, 'tsne_clusters_split.pdf')) as pdf:
    plot_colored_split(tsne_coords.values, clusters, size=5)
    pdf.savefig()

Analysis markers of interest

In [169]:
from matplotlib.colors import LinearSegmentedColormap

def regions_plot(regions, df, 
                 xs=None, 
                 ys=None,
                 factors=None,
                 clusters=None):
    print(f'Coverage t-SNE, distribution and fraction on non-zero covered peaks w.r.t {FACTOR} and clusters')
    if xs is None:
        xs = tsne_coords['tsne1'] 
    if ys is None:    
        ys = tsne_coords['tsne2']
    if factors is None:        
        factors = [FACTORS_MAP[FACTOR_FUNCTION(bc)] for bc in df.columns]
    if clusters is None:    
        clusters = clusters_df['cluster']
    
    plt.subplots(nrows=len(regions), ncols=4, figsize=(4*4, 4*len(regions)))
    for i, (k, peak) in enumerate(regions.items()):
        vals = df.loc[peak]
        print(k, peak, 'cov>0:', sum(vals > 0), 'of', len(vals), 'max cov:', max(vals))
        zs = zscore(vals)
        ax = plt.subplot(len(regions), 4, 4 * i + 1)        
        # Plot zscores scatter plot with 2 segments colormap blue-white-red
        cmap = LinearSegmentedColormap.from_list(name='zscore', 
            colors=[(0, 'darkblue'), ((0 - np.min(zs)) / (np.max(zs) - np.min(zs)), 'white'), (1, 'red')])
        plt.setp(ax, xticks=[], yticks=[])
        ax.scatter(xs, ys, marker='.', c=zs, cmap=cmap, s=3, alpha=0.3)
        plt.title(k)
        plt.xlabel('tSNE1')
        plt.ylabel('tSNE2')        
        ax = plt.subplot(len(regions), 4, 4 * i + 2)        
        sns.distplot(vals)
        ax = plt.subplot(len(regions), 4, 4 * i + 3)        
        fvdf = pd.DataFrame({'factor': factors, 
                             'val': [min(v, 1) for v in vals]})
        sns.barplot(x='factor', y='val', 
                    data=pd.pivot_table(fvdf, values='val', index=['factor'], aggfunc='mean').reset_index())
        ax = plt.subplot(len(regions), 4, 4 * i + 4)        
        cldf = pd.DataFrame({'cluster': clusters, 'val': [min(v, 1) for v in vals]})
        sns.barplot(x='cluster', y='val', 
                    data=pd.pivot_table(cldf, values='val', index=['cluster'], aggfunc='mean').reset_index())
In [170]:
print('Analysing REGIONS on interest')
with PdfPages(os.path.join(figures_path, 'tsne_markers.pdf')) as pdf:
    regions_plot(marker_regions_peaks, fulldf)
    pdf.savefig()
Analysing REGIONS on interest
Coverage t-SNE, distribution and fraction on non-zero covered peaks w.r.t age and clusters
GZMK classic promoter chr13:113180800-113181000 cov>0: 49 of 17081 max cov: 2.0
GZMK alternative promoter chr13:113183200-113183800 cov>0: 178 of 17081 max cov: 2.0
CD68 promoter chr11:69665600-69667000 cov>0: 547 of 17081 max cov: 3.0
Cd3d promoter chr9:44981200-44982800 cov>0: 493 of 17081 max cov: 3.0
CD19 promoter chr7:126414200-126415200 cov>0: 342 of 17081 max cov: 4.0
NCR1 promoter chr7:4337400-4337800 cov>0: 112 of 17081 max cov: 4.0

BigWig RPM normalized profiles

Pipeline Split UMI reads by clusters and use RPKM normalization.


Seurat We created pseudo-bulk ATAC-seq profiles by pooling together cells with for each cell type. Each cell type showed enriched accessibility near canonical marker genes. Chromatin accessibility tracks are normalized to sequencing depth (RPKM normalization) in each pooled group.

SnapATAC Next we aggregate cells from the each cluster to create an ensemble track for peak calling and visualization.

In [ ]:
print('Loading full fragments dataframe')
fdf = pd.read_csv(fragments_file, sep='\t', names=['chr', 'start', 'end', 'barcode', 'reads'])

print('Aggregating fragments and clusters')
df_for_bigwigs = pd.merge(left=fdf, right=clusters_df,
                          left_on='barcode', right_on=clusters_df.index)
display(df_for_bigwigs.head()) # already sorted by chromosomes

Writing BigWig files

In [ ]:
import glob
import pyBigWig

step = 100
bigwigs_path = os.path.join(OUTPUT_DIR, 'bigwig')
! mkdir -p {bigwigs_path}
    
for cluster in set(clusters):
    bw_path = os.path.join(bigwigs_path, f'cluster_{cluster}.bw')
    if os.path.exists(bw_path):
        print(f'Bigwig file {bw_path} exists, skipping...')
        continue
    print('Processing cluster', cluster, bw_path)
    df_for_bigwigs_cluster = df_for_bigwigs.loc[df_for_bigwigs['cluster'] == cluster]
    chr_lengths = df_for_bigwigs_cluster[['chr', 'end']].groupby('chr').max().reset_index()

    with pyBigWig.open(bw_path, 'w') as bw:
        bw.addHeader(list(zip(chr_lengths['chr'], chr_lengths['end'])))
        for chromosome in tqdm(chr_lengths['chr']):
            df_for_bigwigs_cluster_chr =\
                df_for_bigwigs_cluster.loc[df_for_bigwigs_cluster['chr'] == chromosome].sort_values(['start', 'end'])

            starts = list(df_for_bigwigs_cluster_chr['start'])
            ends = list(df_for_bigwigs_cluster_chr['end'])
            reads = list(df_for_bigwigs_cluster_chr['reads'])
            chr_length = int(chr_lengths.loc[chr_lengths['chr'] == chromosome]['end'])
            values = np.zeros(int(math.floor(chr_length / step)) + 1)

            for i in range(len(df_for_bigwigs_cluster_chr)):
                # Ignore PCR duplicates here!
                # v = reads[i] 
                v = 1.0
                si = int(math.floor(starts[i] / step)) 
                ei = int(math.floor(ends[i] / step)) 
                if ei == si:
                    values[si] += v
                else:
                    values[si] += v / 2 
                    values[ei] += v / 2
            
            non_zero = values > 0
            non_zero_inds = np.flatnonzero(non_zero)
            starts_np = non_zero_inds * step
            ends_np = (non_zero_inds + 1) * step
            values_np = values[non_zero]
            # RPM normalization
            values_np = values_np * (1000000.0 / sum(values_np))
            chroms_np = np.array([chromosome] * sum(non_zero))
            bw.addEntries(chroms_np, starts=starts_np, ends=ends_np, values=values_np)                     

print('Done')

Closest genes annotation

In [81]:
print('Preprocess GTF file')

gtf_file = GTF_FILE
transcripts_bed = gtf_file + '.bed'
if not os.path.exists(transcripts_bed):
    GTF_TO_BED_SH = """
TNF=$(cat $1 | grep transcript_id | grep -v '#' | head -n 1 | awk '{for (i=4; i<NF; i++) {if ($3=="transcript" && $i=="gene_name") print (i+1)}}')
cat $1 | awk -v TNF=$TNF -v OFS="\t" '{if ($3=="transcript") {print $1,$4-1,$5,$TNF,0,$7}}' | tr -d '";' | sort -k1,1 -k2,2n -k3,3n --unique > $2
    """
    with tempfile.NamedTemporaryFile(prefix='gtf_to_bed', suffix='.sh', delete=False) as f:
        f.write(GTF_TO_BED_SH.encode('utf-8'))
        f.close()
        ! bash {f.name} {gtf_file} {transcripts_bed} 
print('Transcripts bed file', transcripts_bed)

transcripts_tss = gtf_file + '.tss'
if not os.path.exists(transcripts_tss):
    BED_TO_TSS_SH = """
cat $1 | awk -v OFS="\t" '{if ($6=="+") {print $1, $2, $2+1, $4, 0, "+"} else {print $1,$3-1,$3, $4, 0, "-"}}' | sort -k1,1 -k2,2n -k3,3n --unique > $2
    """
    with tempfile.NamedTemporaryFile(prefix='bed_to_tss', suffix='.sh', delete=False) as f:
        f.write(BED_TO_TSS_SH.encode('utf-8'))
        f.close()
        ! bash {f.name} {transcripts_bed} {transcripts_tss} 
print('Transcripts tss file', transcripts_tss)
Preprocess GTF file
Transcripts bed file /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/gencode.vM22.annotation.gtf.bed
Transcripts tss file /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/gencode.vM22.annotation.gtf.tss

Pipeline Preprocess GENES to TSS and use bedtools closest -D b to report distance to TSS, upstream or downstream w.r.t. the gene strand.


Cell Ranger ATAC We use bedtools closest -D b to associate each peak with genes based on closest transcription start sites (packaged within the reference) such that the peak is within 1000 bases upstream or 100 bases downstream of the TSS.

In [82]:
def locations_closest_genes(locations, transcripts_tss):
    print('Annotate list of locations in format "chr:start-end" with closest tss of genes')
    with tempfile.NamedTemporaryFile(delete=False) as f:
        # Save index to restore original order after sorting
        for i, t in enumerate([re.split(':|-', p) for p in locations]):
            f.write('{}\t{}\t{}\t{}\n'.format(*t, i).encode('utf-8'))
        f.close()
        closest_file = f.name + '.closest'
        sorted_file = f.name + '.sorted'
        ! sort -k1,1 -k2,2n {f.name} > {sorted_file}
        ! bedtools closest -a {sorted_file} -b {transcripts_tss} -D b > {closest_file}
        closest_df = pd.read_csv(
            closest_file, sep='\t', 
            names=['chr', 'start', 'end', 'index', 
                   'gene_chr', 'gene_start', 'gene_end', 'gene', 'gene_score', 'gene_strand', 'distance'])
        # Restore original sorting as 
        closest_df.sort_values(by=['index'], inplace=True)
        # Pick only first closest gene in case of several
        closest_df.drop_duplicates(['index'], inplace=True)
        return closest_df[['chr', 'start', 'end', 'gene', 'distance']]


closest_genes = locations_closest_genes(list(normdf.index), transcripts_tss)
# Restore 'peak' to be able to merge on it
closest_genes['peak'] = closest_genes['chr'] +\
                        ':' + closest_genes['start'].astype(str) +\
                        '-' + closest_genes['end'].astype(str)
display(closest_genes.head())
Annotate list of locations in format "chr:start-end" with closest tss of genes
chr start end gene distance peak
12433 chr10 100016000 100016400 Kitl 85 chr10:100016000-100016400
12434 chr10 100148200 100148400 Gm25287 3947 chr10:100148200-100148400
12435 chr10 100160400 100160600 Gm25287 16147 chr10:100160400-100160600
12441 chr10 100486600 100488000 Cep290 0 chr10:100486600-100488000
12442 chr10 100507800 100508000 Cep290 -726 chr10:100507800-100508000

Save cluster mean values

In [83]:
# Transpose to (barcodes x peaks) format
t = normdf.T 

# Per cluster mean values for peaks
clusters_means = {str(c): t.loc[clusters == c].mean() for c in set(clusters)}
clusters_means_df = pd.DataFrame(clusters_means).reset_index()[[str(c) for c in set(clusters)]]
clusters_means_genes_df = pd.concat([closest_genes.reset_index(), clusters_means_df], axis=1).drop(columns=['index'])
# Save peak without dashes or colons
clusters_means_genes_df['peak'] = [re.sub('[:-]', '_', p) for p in clusters_means_genes_df['peak']]

print('Saving clusters peaks_means to clusters_peaks_values.tsv')
display(clusters_means_genes_df.head())
clusters_means_genes_df.to_csv(os.path.join(OUTPUT_DIR, 'clusters_peaks_values.tsv'), sep='\t', index=False)
Saving clusters peaks_means to clusters_peaks_values.tsv
chr start end gene distance peak 0 1 2 3 ... 5 6 7 8 9 10 11 12 13 14
0 chr10 100016000 100016400 Kitl 85 chr10_100016000_100016400 0.004028 0.000551 0.001265 0.000186 ... 0.000363 0.001882 0.002265 0.005023 0.000899 0.002557 0.003703 0.001369 0.000427 0.000000
1 chr10 100148200 100148400 Gm25287 3947 chr10_100148200_100148400 0.007615 0.000460 0.001468 0.000000 ... 0.000512 0.000000 0.000000 0.008215 0.006299 0.000000 0.000921 0.000000 0.000000 0.002975
2 chr10 100160400 100160600 Gm25287 16147 chr10_100160400_100160600 0.003169 0.000000 0.001028 0.000312 ... 0.000000 0.000000 0.000475 0.001015 0.002375 0.000000 0.000000 0.000000 0.001710 0.000000
3 chr10 100486600 100488000 Cep290 0 chr10_100486600_100488000 0.031504 0.029206 0.028765 0.037455 ... 0.020672 0.024397 0.027820 0.017437 0.019234 0.031801 0.037309 0.040517 0.034370 0.078381
4 chr10 100507800 100508000 Cep290 -726 chr10_100507800_100508000 0.005323 0.000000 0.001861 0.000000 ... 0.000000 0.000000 0.000000 0.002184 0.000000 0.000000 0.000000 0.000000 0.000693 0.000000

5 rows × 21 columns

In [84]:
print('Hierarchical clustering of clusters mean values')
with PdfPages(os.path.join(figures_path, 'clusters_hclust.pdf')) as pdf:
    sns.clustermap(clusters_means_genes_df[[str(i) for i in set(clusters)]], 
                   col_cluster=True, row_cluster=False, figsize=(10, 5), cmap=plt.cm.viridis)
    pdf.savefig()
Hierarchical clustering of clusters mean values

Differential peaks analysis (Markers)

Pipeline

Use Cell Ranger ATAC to get differential markers.


Cell Ranger ATAC For each peak we perform test cluster vs others, so that several clusters may yield significant results vs others, we take into account only the cluster with greatest mean value.\ Test is is perfromed using Cell Ranger default testing procedure based on Negative Binomial GLM.

In [85]:
# Launch 10x Genomics Cell Ranger DE
from diffexp import *

def run_10x_differential_expression(df, clusters, fdr=0.05):
    """ Compute differential expression for each cluster vs all other cells
        Args: df          - feature expression data (peak x barcode)
              clusters    - 0-based cluster labels
              fdr         - false discovery rate"""

    peaks = list(df.index)
    distinct_clusters = set(clusters)
    print('Detected clusters', len(distinct_clusters))

    print("Computing params...")
    m = df.values
    sseq_params = compute_sseq_params(m)
    
    cluster_markers = []
    for cluster in distinct_clusters:
        in_cluster = clusters == cluster
        group_a = np.flatnonzero(in_cluster)
        group_b = np.flatnonzero(np.logical_not(in_cluster))
        print(f'Computing DE for cluster {cluster}...')

        de_result = sseq_differential_expression(m, group_a, group_b, sseq_params)
        de_result['cluster'] = [cluster] * len(de_result)
        de_result['peak'] = peaks
        passed_fdr = de_result['adjusted_p_value'] < fdr
        passed_log2_fc = de_result['log2_fold_change'] > 0
        # Filter out different by statistical tests with log2 fold change > 0
        passed = np.logical_and(passed_fdr, passed_log2_fc)
        print('DE:', sum(passed), 'of', len(de_result), 
              'FDR:', sum(passed_fdr), 'log2fc>0:', sum(passed_log2_fc))

        passed_de = de_result.loc[passed]
        cluster_de_markers = passed_de[
            ['cluster', 'peak', 'norm_mean_a', 'norm_mean_b', 'log2_fold_change', 'p_value', 'adjusted_p_value']
        ].rename(columns={'norm_mean_a': 'norm_mean_cluster', 'norm_mean_b': 'norm_mean_others'}) 

        cluster_markers.append(cluster_de_markers)    

    return pd.concat(cluster_markers)
In [86]:
# Cell Ranger ATAC like differential analysis
# Each peak is tested independently, no normalization to peak length required.
markers = run_10x_differential_expression(fulldf.loc[peaks_filter], clusters)
print('Total markers', len(markers))
# display(markers.head())
Detected clusters 15
Computing params...
Computing DE for cluster 0...
Computing 77510 exact tests and 100 asymptotic tests.
DE: 14336 of 77610 FDR: 32971 log2fc>0: 36462
Computing DE for cluster 1...
Computing 77536 exact tests and 74 asymptotic tests.
DE: 12204 of 77610 FDR: 30574 log2fc>0: 33209
Computing DE for cluster 2...
Computing 77557 exact tests and 53 asymptotic tests.
DE: 10788 of 77610 FDR: 28557 log2fc>0: 36626
Computing DE for cluster 3...
Computing 77586 exact tests and 24 asymptotic tests.
DE: 10577 of 77610 FDR: 23083 log2fc>0: 35101
Computing DE for cluster 4...
Computing 77609 exact tests and 1 asymptotic tests.
DE: 12380 of 77610 FDR: 20543 log2fc>0: 41468
Computing DE for cluster 5...
Computing 77610 exact tests and 0 asymptotic tests.
DE: 6414 of 77610 FDR: 10929 log2fc>0: 33254
Computing DE for cluster 6...
Computing 77610 exact tests and 0 asymptotic tests.
DE: 3739 of 77610 FDR: 7609 log2fc>0: 37698
Computing DE for cluster 7...
Computing 77610 exact tests and 0 asymptotic tests.
DE: 8065 of 77610 FDR: 13780 log2fc>0: 35400
Computing DE for cluster 8...
Computing 77610 exact tests and 0 asymptotic tests.
DE: 205 of 77610 FDR: 219 log2fc>0: 50716
Computing DE for cluster 9...
Computing 77609 exact tests and 1 asymptotic tests.
DE: 7138 of 77610 FDR: 12581 log2fc>0: 41978
Computing DE for cluster 10...
Computing 77610 exact tests and 0 asymptotic tests.
DE: 5787 of 77610 FDR: 9815 log2fc>0: 36868
Computing DE for cluster 11...
Computing 77610 exact tests and 0 asymptotic tests.
DE: 3364 of 77610 FDR: 5778 log2fc>0: 39593
Computing DE for cluster 12...
Computing 77610 exact tests and 0 asymptotic tests.
DE: 5700 of 77610 FDR: 10017 log2fc>0: 36056
Computing DE for cluster 13...
Computing 77610 exact tests and 0 asymptotic tests.
DE: 7 of 77610 FDR: 40 log2fc>0: 45017
Computing DE for cluster 14...
Computing 77610 exact tests and 0 asymptotic tests.
DE: 2475 of 77610 FDR: 2703 log2fc>0: 55259
Total markers 103179

Markers summary

In [87]:
print('Number of differential markers per cluster')
t = markers[['cluster']].copy()
t['count'] = 1
display(pd.pivot_table(t, values='count', index=['cluster'], aggfunc=np.sum))
Number of differential markers per cluster
count
cluster
0 14336
1 12204
2 10788
3 10577
4 12380
5 6414
6 3739
7 8065
8 205
9 7138
10 5787
11 3364
12 5700
13 7
14 2475

Annotate markers with genes

In [88]:
markers_with_genes = pd.merge(left=markers[['cluster', 'norm_mean_cluster', 'norm_mean_others', 
                                            'log2_fold_change', 'p_value', 'adjusted_p_value', 'peak']],
                         right=closest_genes,
                         left_on='peak',
                         right_on='peak')

# Save peak without dashes or colons
markers_with_genes['peak'] = [re.sub('[:-]', '_', p) for p in markers_with_genes['peak']]

# rearrange columns, sort
markers_with_genes = markers_with_genes[['chr', 'start', 'end', 'peak',
                                         'cluster', 'norm_mean_cluster', 'norm_mean_others', 
                                         'log2_fold_change', 'p_value', 'adjusted_p_value', 
                                         'gene', 'distance']].sort_values(by=['cluster', 'adjusted_p_value'])
# display(markers_with_genes.head())

print('Saving all the markers', len(markers_with_genes), 'to markers.tsv')
markers_with_genes.to_csv(os.path.join(OUTPUT_DIR, 'markers.tsv'), sep='\t', index=None)
Saving all the markers 103179 to markers.tsv

Analyzing top markers

In [89]:
def get_top_markers(df, n_clusters=n_clusters, top=100):
    top_markers = pd.concat([
        df.loc[df['cluster'] == c].sort_values(by=['adjusted_p_value']).head(top) 
                    for c in range(0, n_clusters)])
    top_markers['peak'] = top_markers['chr'] +\
                          ':' + top_markers['start'].astype(str) + '-'\
                          + top_markers['end'].astype(str)
    return top_markers
In [115]:
print('Computing Z scores for mean values in cluster for top markers')
top_markers = get_top_markers(markers_with_genes, top=100)
t = normdf.T # Transpose to (barcodes x peaks) format
t = t[list(top_markers['peak'])] # top markers only

t = pd.concat([pd.DataFrame(t.loc[clusters == cluster].mean()).T for cluster in set(clusters)])
t.index = range(n_clusters)

# Transform to Z-score
for c in t.columns:
    t[c] = zscore(t[c])

print('Clustermap of Z scores')
with PdfPages(os.path.join(figures_path, 'markers_z.pdf')) as pdf:
    sns.clustermap(t.T, col_cluster=False, row_cluster=False, figsize=(10, 10), cmap=plt.cm.seismic)
    pdf.savefig()
Computing Z scores for mean values in cluster for top markers
Clustermap of Z scores

Top markers visualization on T-SNE

In [117]:
print('t-SNE based Z-SCORE visualizations for markers, mean z-score for 100 top markers')
t = normdf
top_markers = get_top_markers(markers_with_genes, top=100)
with PdfPages(os.path.join(figures_path, 'tsne_diff_markers_100.pdf')) as pdf:
    nrows, ncols = n_rows_columns(n_clusters)
    plt.subplots(nrows=nrows, ncols=ncols, figsize=(5 * ncols, 4 * nrows))
    for i, cluster in enumerate(tqdm(set(clusters))):
        # Table (marker for cluster x cells)
        markers_df = t.loc[top_markers.loc[top_markers['cluster'] == cluster]['peak']].T.copy()

        # Z scale for each marker peak across all cells
        for c in markers_df.columns:
            markers_df[c] = zscore(markers_df[c])

        # Average Z score for each cell
        zs = markers_df.T.mean()
        t1 = tsne_coords['tsne1']
        t2 = tsne_coords['tsne2']
        ax = plt.subplot(nrows, ncols, i + 1)
        # Average is much more stable than coverage, we can use linear colormap here
        plt.setp(ax, xticks=[], yticks=[])    
        # Plot zscores scatter plot with 2 segments colormap blue-white-red
        cmap = LinearSegmentedColormap.from_list(name='zscore', 
            colors=[(0, 'darkblue'), ((0 - np.min(zs)) / (np.max(zs) - np.min(zs)), 'white'), (1, 'red')])        
        sc = ax.scatter(t1, t2, s=5, c=zs, cmap=cmap, edgecolor='none', alpha=1)
#         sc = ax.scatter(t1, t2, s=5, c=zs, cmap=plt.cm.seismic, edgecolor='none', alpha=1)
        plt.colorbar(sc)
        plt.title(f'Cluster {cluster}')
        plt.xlabel('tSNE axis 1')
        plt.ylabel('tSNE axis 2')
    pdf.savefig()
t-SNE based Z-SCORE visualizations for markers, mean z-score for 100 top markers

In [114]:
print('t-SNE based Z-SCORE visualizations for markers, z-score for 1 top marker')
t = normdf
top_markers_1 = get_top_markers(markers_with_genes, top=1)
top_markers_map = {}
for i, row in top_markers_1.iterrows():
    top_markers_map[str(row['cluster'])] = row['peak']
    
with PdfPages(os.path.join(figures_path, 'tsne_diff_markers_1.pdf')) as pdf:
    regions_plot(top_markers_map, fulldf)
    pdf.savefig()
t-SNE based Z-SCORE visualizations for markers, z-score for 1 top marker
Coverage t-SNE, distribution and fraction on non-zero covered peaks w.r.t age and clusters
0 cov>0: 1591 of 17081 max cov: 5.0
1 cov>0: 309 of 17081 max cov: 5.0
2 cov>0: 1938 of 17081 max cov: 7.0
3 cov>0: 655 of 17081 max cov: 5.0
4 cov>0: 373 of 17081 max cov: 4.0
5 cov>0: 270 of 17081 max cov: 4.0
6 cov>0: 466 of 17081 max cov: 5.0
7 cov>0: 283 of 17081 max cov: 4.0
8 cov>0: 163 of 17081 max cov: 4.0
9 cov>0: 68 of 17081 max cov: 4.0
10 cov>0: 249 of 17081 max cov: 4.0
11 cov>0: 261 of 17081 max cov: 6.0
12 cov>0: 147 of 17081 max cov: 5.0
13 cov>0: 73 of 17081 max cov: 3.0
14 cov>0: 37 of 17081 max cov: 3.0

Save differential markers to BED

In [93]:
cluster_peaks_path = os.path.join(OUTPUT_DIR, 'markers')
! mkdir -p {cluster_peaks_path}
print('Cleanup peaks_clusters', cluster_peaks_path)
for f in glob.glob(f'{cluster_peaks_path}/markers_*.bed'):
    os.remove(f)

mlap_max = (-np.log10(markers.loc[markers['adjusted_p_value'] != 0]['adjusted_p_value'])).max()
print('BED scoring as -log10 adjusted pval')
print('Max of -log10 adjusted pval', mlap_max)

for c in set(clusters):
    bed_file = os.path.join(cluster_peaks_path, f'markers_{c}.bed')
    markers_cluster = markers.loc[markers['cluster'] == c].sort_values(
        ['log2_fold_change'], ascending=False).iloc[::-1]
    markers_cluster.index = range(len(markers_cluster))
    print('Writing cluster', c, 'differential peaks to', bed_file)
    with open(bed_file, 'w') as bed:
        for i in range(len(markers_cluster)):
            peak = markers_cluster['peak'][i]
            ap = markers_cluster['adjusted_p_value'][i]
            mlap = -np.log10(ap) if ap != 0.0 else mlap_max
            line = '{}\t{}\t{}\t.\n'.format(
                re.sub(':|-', '\t', peak), 
                c,
                mlap)
            bed.write(line)
print('Done')                
Cleanup peaks_clusters /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers
BED scoring as -log10 adjusted pval
Max of -log10 adjusted pval 123.7617523697129
Writing cluster 0 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_0.bed
Writing cluster 1 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_1.bed
Writing cluster 2 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_2.bed
Writing cluster 3 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_3.bed
Writing cluster 4 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_4.bed
Writing cluster 5 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_5.bed
Writing cluster 6 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_6.bed
Writing cluster 7 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_7.bed
Writing cluster 8 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_8.bed
Writing cluster 9 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_9.bed
Writing cluster 10 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_10.bed
Writing cluster 11 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_11.bed
Writing cluster 12 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_12.bed
Writing cluster 13 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_13.bed
Writing cluster 14 differential peaks to /mnt/stripe/shpynov/2019_gxfer1_DT1634_Denis/pipeline/markers/markers_14.bed
Done

Preparation data for single cell explorer

Preprocess data for single cell explorer https://ctlab.github.io/SCE

In [182]:
# Sanity check indexes
df = fulldf.loc[peaks_filter]
assert np.array_equal(tsne_coords.index, umap_coords.index)
assert np.array_equal(tsne_coords.index, fulldf.columns)
assert np.array_equal(tsne_coords.index, fulldf.columns)
assert np.array_equal(tsne_coords.index, fulldf.columns)
assert np.array_equal(clusters_df.index, fulldf.columns)
assert np.array_equal(clusters, clusters_df['cluster'])
In [184]:
import json
import h5py

sce_path = os.path.join(OUTPUT_DIR, 'sce')
! mkdir -p {sce_path}

print('Processing', 'dataset.json')
with open(os.path.join(sce_path, 'dataset.json'), 'w') as f:
    f.write(json.dumps({
        'token': '2019_scatacseq_Denis',
#         'public': True,
        'name': 'Aged mice spleen single cell atac-seq dataset by Denis Mogilenko',
        'organism': 'mm',
        'cells': df.shape[1]
    }))
    

print('Processing', 'plot_data.json')
fields = {
  "tSNE_1": {
    "type": "numeric",
    "range": [
      min(tsne_coords.loc[df.columns.values]['tsne1']),
      max(tsne_coords.loc[df.columns.values]['tsne1'])
    ]
  },
  "tSNE_2": {
    "type": "numeric",
    "range": [
      min(tsne_coords.loc[df.columns.values]['tsne2']),
      max(tsne_coords.loc[df.columns.values]['tsne2'])
    ]
  },
  "UMAP_1": {
    "type": "numeric",
    "range": [
      min(umap_coords.loc[df.columns.values]['umap1']),
      max(umap_coords.loc[df.columns.values]['umap1'])
    ]
  },
  "UMAP_2": {
    "type": "numeric",
    "range": [
      min(umap_coords.loc[df.columns.values]['umap2']),
      max(umap_coords.loc[df.columns.values]['umap2'])
    ]
  },
  "Cluster": {
    "type": "factor",
    "levels": [str(c) for c in set(clusters)]
  },
  "Age": {
    "type": "factor",
    "levels": list(FACTORS_MAP.values())
  },    
  "nUmi": {
    "type": "numeric",
    "range": [
      min(df.sum()),
      max(df.sum()),
    ]
  }
}

data = pd.DataFrame({      
    "tSNE_1": tsne_coords.loc[df.columns.values]['tsne1'].tolist(),
    "tSNE_2": tsne_coords.loc[df.columns.values]['tsne2'].tolist(),
    "UMAP_1": umap_coords.loc[df.columns.values]['umap1'].tolist(),
    "UMAP_2": umap_coords.loc[df.columns.values]['umap2'].tolist(),
    "Cluster": [str(c) for c in clusters_df.loc[df.columns.values]['cluster']],
    "Age": [FACTORS_MAP[FACTOR_FUNCTION(bc)] for bc in df.columns],
    "nUmi": df.sum().tolist(),
    "_row": list(df.columns)
}).to_dict('records')

annotations = {
  "tsne_Cluster_centers": {
    "type": "text",
    "value": "Cluster",
    "coords": [
      "tSNE_1",
      "tSNE_2"
    ],
    "data": [
      {
        "Cluster": str(c),
        "tSNE_1": float(np.mean(tsne_coords.loc[df.columns.values].values[
            np.flatnonzero(clusters_df.loc[df.columns.values]['cluster'] == c), 0])),
        "tSNE_2": float(np.mean(tsne_coords.loc[df.columns.values].values[
            np.flatnonzero(clusters_df.loc[df.columns.values]['cluster'] == c), 1])),
        "Text": str(c)
      } for c in set(clusters)
    ]    
  },
}

with open(os.path.join(sce_path, 'plot_data.json'), 'w') as f:
    f.write(json.dumps({
        "fields": fields,
        "data": data,
        "annotations": annotations,
    }))


# This file simply contains gene names, cell barcodes/names and total UMI per cell. 
# This file must reflect row and column names of matrix data.h5 which contains expression data.    
print('Processing', 'exp_data.json')
with open(os.path.join(sce_path, 'exp_data.json'), 'w') as f:
    f.write(json.dumps({
      "genes": list(df.index.values),
      "barcodes": list(df.columns.values),
      "totalCounts": df.sum(axis=0).astype(int).tolist()
    }))


print('Processing', 'data.h5')
! rm {sce_path}/data.h5
# Fix missing locks
os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
# Don't use pandas to_hdf5 because of unused overhead.
# IMPORTANT: transpose matrix because of different order!
# Column-major order is used by Fortran, Matlab, R, and most underlying core linear algebra libraries (BLAS). 
# Row-major ordering is sometimes called “C” style ordering and column-major ordering “Fortran” style. 
# Python/NumPy refers to the orderings in array flags as C_CONTIGUOUS and F_CONTIGUOUS, respectively.
# https://cran.r-project.org/web/packages/reticulate/vignettes/arrays.html
expdf = df.astype(int).clip(upper=1) # Clip coverage to 0-1 for better visualization
with h5py.File(os.path.join(sce_path, 'data.h5'), 'w') as hf:
    hf.create_dataset("expression/mat", 
                      data=expdf.T.values, # Column-major ordering!
                      compression="gzip", compression_opts=9, 
                      dtype='i4') # int4 is required here for SCE, see https://github.com/ctlab/SCE/issues/4


print('Processing', 'markers.json')
# Take into account only 100 top markers
with open(os.path.join(sce_path, 'markers.json'), 'w') as f:
    f.write(json.dumps({
        "Cluster": [ 
          {
              'X': row['gene'],
              'p_val': row['p_value'],
              'p_val_adj': row['adjusted_p_value'],
              'avg_logFC': row['log2_fold_change'],
              'pct.1': row['norm_mean_cluster'],
              'pct.2': row['norm_mean_others'],
              'cluster': str(row['cluster']),
              'gene': row['peak'],
              'distance': row['distance']
          } for index, row in get_top_markers(markers_with_genes, top=100).iterrows()
      ]
    }))
    
print('Done')
Processing dataset.json
Processing plot_data.json
Processing exp_data.json
Processing data.h5
Processing markers.json
Done

Original 10x Cell Ranger results

In [ ]:
clusters10xpath = CLUSTERS_FILE
clusters10xdf = pd.read_csv(clusters10xpath, sep=',')
clusters10xdf['Cluster'] = clusters10xdf['Cluster'].astype(int) - 1 # Clusters start with 1 in 10x
clusters10xdf[FACTOR] = [FACTOR_FUNCTION(bc) for bc in clusters10xdf['Barcode']]
# display(clusters10xdf.head())
print('Total clusters', len(set(clusters10xdf['Cluster'])))
In [ ]:
print('Size of clusters analysis')
t = clusters10xdf.copy()
t['counts'] = np.ones(len(t))
t.rename(columns={'Cluster': 'cluster'}, inplace=True)
t.index = t['Barcode']
print('Summary by', FACTOR, FACTORS_MAP)
display(t[[FACTOR, 'counts']].groupby([FACTOR]).sum())

clusters_sizes(t[['cluster', 'counts', FACTOR]], 'counts')
plt.show()

clusters_factors(t[['cluster', 'counts', FACTOR]], 'counts')
plt.show()
In [ ]:
print('Cell Ranger ATAC original T-SNE visualization')
tsne10xpath = TSNE_FILE
tsne10xdf = pd.read_csv(tsne10xpath, sep=',')

mergeddf = pd.merge(left=tsne10xdf,
                    right=clusters10xdf, 
                    left_on='Barcode', right_on='Barcode')

plot_colored(mergeddf[['TSNE-1', 'TSNE-2']].values, mergeddf['Cluster'], 
             clusters=True,
             show_centers=True,
             size=20, alpha=0.5)
plt.show()
In [ ]:
print('tSNE on LSA normalized vs', FACTOR, FACTORS_MAP)
plot_colored(mergeddf[['TSNE-1', 'TSNE-2']].values, clusters10xdf[FACTOR], clusters=True, size=20, alpha=0.2)
plt.show()

Clustering comparison vs 10x Cell Ranger ATAC

In [ ]:
print('sc-atacseq-explorer clusters into Cell Ranger ATAC coordinates') 
mergeddf = pd.merge(pd.DataFrame({'Barcode': tsne_coords.index, 'Cluster': clusters}),
                    right=tsne10xdf, 
                    left_on='Barcode', right_on='Barcode')

print('tSNE on LSA normalized vs Cell Ranger ATAC clusters')
plot_colored(mergeddf[['TSNE-1', 'TSNE-2']].values, 
             mergeddf['Cluster'], 
             clusters=True,
             show_centers=True,
             size=20, alpha=0.5)
plt.show()
In [ ]:
print('Clusters from Cell Ranger ATAC into sc-atacseq-explorer coordinates') 
mergeddf = pd.merge(pd.DataFrame({'Barcode': tsne_coords.index,
                                 'tsne1': tsne_coords['tsne1'],
                                 'tsne2': tsne_coords['tsne2']}),
                    right=clusters10xdf, 
                    left_on='Barcode', right_on='Barcode')

print('tSNE on LSA normalized vs Cell Ranger ATAC clusters')
plot_colored(mergeddf[['tsne1', 'tsne2']].values, 
             mergeddf['Cluster'], 
             clusters=True,
             show_centers=True,
             size=20, alpha=0.5)
plt.show()
In [ ]: