Single cell ATAC-seq explorer¶

This single cell ATAC-Seq analysis pipeline is designed for advanced analysis of dataset, produced by 10X Genomics Cell Ranger ATAC.\ Aggregated datasets are also supported!

In addition to 10X Genomics results it offers:

Capable to process aggregated data by 10x Genomics Cell Ranger ATAC
Summary on different conditions in case of aggregated dataset
Different types of clustering followed by heatmap and t-SNE/UMAP visualizations in low dimensions space
Top cluster markers visualization as heatmap on t-SNE/UMAP plot
Closest genes annotations for peaks and clusters
Annotated markers analysis
BigWig and BED files for clusters and markers ready-to-be-visualized in JBR Genome Browser by JetBrains Research
Data preparation for Single Cell Explorer by Artyomov Lab, Washington University in St.Louis
Save all the figures in PDF format - ready for publication

Required 10X Genomics Cell Ranger ATAC files:

fragments.tsv - fragments matrix file provided by Cell Ranger
peaks.bed - peaks file (union of peaks) in case of merging
clusters.csv - clustering provided by Cell Ranger
projection.csv - t-SNE 2d project of all the cells

Pipeline steps:

Cell Calling
Peak barcode matrix
Feature selection
Dimensionality reduction
t-SNE / UMAP
Differential markers analysis
Closest genes association
Supervised annotation of clusters by gene markers
Export data to Single Cell Explorer format

Other pipelines:

Cell Ranger ATAC-Seq https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/algorithms/overview
Seurat https://www.biorxiv.org/content/biorxiv/early/2018/11/02/460147.full.pdf (scRNA-seq only https://github.com/satijalab/seurat/issues/1422)
Scasat https://academic.oup.com/nar/article/47/2/e10/5134327
SnapATAC https://www.biorxiv.org/content/10.1101/615179v2
Scater https://academic.oup.com/bioinformatics/article/33/8/1179/2907823
Signac https://satijalab.org/signac/articles/pbmc_vignette.html
Single Cell ATAC benchmark https://github.com/pinellolab/scATAC-benchmarking

Source code is available at: https://github.com/JetBrains-Research/sc-atacseq-explorer \ Questions and comments are welcome at os at jetbrains dot com

Pipeline paths configuration¶

# # Configuration of 10X Genomics files
# FRAGMENTS_FILE_10X = 'fragments.tsv.gz'
# CLUSTERS_FILE_10X = 'clusters.csv'
# TSNE_FILE_10X = 'tsne.csv'
# PEAKS_FILE = 'peaks.bed'

# # See https://www.encodeproject.org/annotations/ENCSR636HFF/
# # Set this value to None if blacklist regions bed file is not available
# BLACKLIST_FILE = 'blacklist.bed'

# # For mm10 and hg38
# # ! wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz
# # ! wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz
# # ! gunzip *.gtf.gz
# GTF_FILE = 'gencodegenes.gtf'

# # Dowload DNAse hypersensitivity sites from encodeproject.org
# # For mm10
# # ! wget https://www.encodeproject.org/files/ENCFF108APF/@@download/ENCFF108APF.bed.gz
# # ! gunzip *.gz
# # Set this to None if DNASe file is not available
# DNASE_FILE = 'dnase.bed'


# # Output configuration
# OUTPUT_DIR = 'pipeline'

# # Output configuration
# OUTPUT_DIR = 'pipeline'
# Configuration of 10X Genomics files
FRAGMENTS_FILE_10X = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/ALAW-XNN-AS011/outs/fragments.tsv.gz'
CLUSTERS_FILE_10X = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/ALAW-XNN-AS011/outs/analysis/clustering/graphclust/clusters.csv'
TSNE_FILE_10X = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/ALAW-XNN-AS011/outs/analysis/tsne/2_components/projection.csv'
PEAKS_FILE = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/ALAW-XNN-AS011/outs/peaks.bed'

# See https://www.encodeproject.org/annotations/ENCSR636HFF/
# Set this value to None if blacklist regions bed file is not available
BLACKLIST_FILE = None

# For mm10 and hg38
# ! wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz
# ! wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz
# ! gunzip *.gtf.gz
GTF_FILE = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/gencode.v32.annotation.gtf'

# Dowload DNAse hypersensitivity sites from encodeproject.org
# For mm10
# ! wget https://www.encodeproject.org/files/ENCFF108APF/@@download/ENCFF108APF.bed.gz
# ! gunzip *.gz
# Set this to None if DNASe file is not available
DNASE_FILE = None

# Output configuration
OUTPUT_DIR = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/pipeline'

Pipeline parameters configuration¶

# # Factors configuration, number should be consistent with number of merged 
# FACTOR = 'factor'
# FACTORS_MAP = {1: 'f1', 2: 'f2'}

# # Aggregated barcodes always have suffix as a marker of original track
# def FACTOR_FUNCTION(x):
#     for k in [1, 2]:
#         if x.endswith(f'-{k}'):
#             return k

# # Cell calling filtration thresholds
# NOISE_THRESHOLD = 200
# DUPLETS_THRESHOLD = 8000

# # Default by Cell Ranger ATAC
# LSA_COMPONENTS = 15

# # 20 clusters is default by Cell Ranger ATAC
# N_CLUSTERS = 20
        
# # Markers configuration
# MARKERS_GENES = []

# # Export to Single Cell Explorer configuration
# SCE_TOKEN = '2020_scatacseq'
# SCE_NAME = 'Single cell ATAC-Seq'
# SCE_PEAK_GENE_TSS_MAX_DISTANCE = 10000
# SCE_PUBLIC = False
# SCE_ORGANISM = 'mm' # mm or hg are supported

# Factors configuration, number should be consistent with number of merged 
FACTOR = 'age'
FACTORS_MAP = {1: 'young', 2: 'young', 3: 'young', 4: 'old', 5:'old', 6: 'old'}

# Aggregated barcodes always have suffix as a marker of original track
def FACTOR_FUNCTION(x):
    for k in FACTORS_MAP.keys():
        if x.endswith(f'-{k}'):
            return k

# Cell calling filtration thresholds
NOISE_THRESHOLD = 2000
DUPLETS_THRESHOLD = 20000

# Default by Cell Ranger ATAC
LSA_COMPONENTS = 15

# 20 clusters is default by Cell Ranger ATAC
N_CLUSTERS = 20
        
# Markers configuration
MARKERS_GENES = ['Igkc', 'Jchain', 'Ifit3', 'Ifit1', 'Mki67', 'Cd4', 'Gzmb', 'Pld4', 'Ighe', 'Ighd', 'Ighm',
                 'Cd19', 'Cd79a', 'Cd8b', 'Il7r', 'Tcf7', 'Cd3e', 'Cd3d', 'Ccr7', 'Cd1c', 'Fcer1a', 'Cd34',
                 'Gata2', 'Pf4', 'Ppbp', 'Sox4', 'Eomes', 'Gzmk', 'Klrg1', 'Trav1-2', 'Ccl5', 'Sell',
                 'Ikzf2', 'Foxp3', 'Gzma', 'Ncam1', 'Prf1', 'Trdc', 'Nkg7', 'Ncr1', 'Batf3', 'Clec9a', 
                 'Lyz', 'Flt3', 'Fcrg3a', 'Cd14', 'Cd44', 'Seprina1', 'Trgv9', 'Trgv2']

# Export to Single Cell Explorer configuration
SCE_TOKEN = '2019_scatacseq_human_3vs3'
SCE_NAME = 'Single cell ATAC-Seq human 3vs3'
SCE_PEAK_GENE_TSS_MAX_DISTANCE = 10000
SCE_PUBLIC = False
SCE_ORGANISM = 'hg' # mm or hg are supported

Pipeline utils¶

# Ensure all the libraries are installed
%matplotlib inline
%config InlineBackend.figure_format ='retina'

import gc
import glob
import hashlib
import math
import matplotlib.pyplot as plt
import numpy as np
# Workaround for the issue: https://github.com/pandas-dev/pandas/issues/26314
# pip install pandas==0.21
import pandas as pd
import os
import re
import sys
import seaborn as sns
import umap
sns.set_style("whitegrid")
import re
import tempfile
import glob
import pyBigWig
import json
import h5py
from scipy import io, sparse
from scipy.cluster import hierarchy

import matplotlib.cm as cm
from matplotlib.colors import LinearSegmentedColormap
from collections import Counter
from sklearn.cluster import AgglomerativeClustering
from sklearn.manifold import TSNE
from sklearn.preprocessing import Normalizer

from pybedtools import BedTool
from tqdm.auto import tqdm
from scipy.stats import zscore
from matplotlib.backends.backend_pdf import PdfPages

# Import pipeline functions


# These functions are taken from the original Cell Ranger 10x ATAC pipeline
# See https://github.com/10XGenomics/cellranger/
from irlb import irlb
from diffexp import *

# Fix random seed
np.random.seed(0)

# Create pipeline output folders
figures_path = os.path.join(OUTPUT_DIR, 'figures')
! mkdir -p {figures_path}
! mkdir -p {OUTPUT_DIR}/cache

# Used memory analysis utility
def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f %s%s" % (num, 'Yi', suffix)

# Cleanup
gc.collect()

# Print all allocated variables
def print_mem_usage():
    for name, size in sorted(((name, sys.getsizeof(value)) for name, value in globals().items()),
                             key= lambda x: x[1],
                             reverse=True)[:10]:
        print("Global {:>30}: {:>8}".format(name, sizeof_fmt(size)))    
    for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                             key= lambda x: x[1],
                             reverse=True)[:10]:
        print("Local {:>30}: {:>8}".format(name, sizeof_fmt(size)))

Visualize 10x Cell Ranger results¶

# clusters10xdf = pd.read_csv(CLUSTERS_FILE_10X, sep=',')
# clusters10xdf['Cluster'] = clusters10xdf['Cluster'].astype(int) - 1 # Clusters start with 1 in 10x
# clusters10xdf[FACTOR] = [FACTOR_FUNCTION(bc) for bc in clusters10xdf['Barcode']]
# # display(clusters10xdf.head())
# print('Total clusters', len(set(clusters10xdf['Cluster'])))

def clusters_sizes(df, value):
    df_sum = df[['cluster', value]].groupby(['cluster']).sum().reset_index()
    plt.figure(figsize=(10, 4))
    cmap=plt.cm.get_cmap('jet', len(df_sum))
    sns.barplot(x=df_sum.index, y=df_sum[value],
                palette=[cmap(i) for i in range(len(df_sum))])


def clusters_factors(df, value, FACTORS_MAP):
    df_sum = df[['cluster', value]].groupby(['cluster']).sum().reset_index()    
    df_sum_factor = df[['cluster', FACTOR, value]].groupby(['cluster', FACTOR]).sum().reset_index()
    plt.figure(figsize=(10, 4))
    cmap=plt.cm.get_cmap('jet', len(FACTORS_MAP))
    
    bottom = np.zeros(len(df_sum))
    for i, (k, v) in enumerate(FACTORS_MAP.items()):            
            heights = np.array(df_sum_factor.loc[df_sum_factor[FACTOR] == k][value]) /\
                np.array(df_sum[value]) * 100.0
            plt.bar(x = df_sum.index, height = heights, bottom=bottom, color = cmap(i), width=0.8)
            bottom += heights

    # Cosmetics
    plt.xticks(df_sum.index)
    plt.gca().xaxis.grid(False)
    plt.gca().margins(x=0.005)
    
    legend_colors = [plt.Rectangle((0,0), 1, 1, fc=cmap(i), edgecolor = 'none') 
                         for i in range(len(FACTORS_MAP))]
    legend_names = [v for _,v in FACTORS_MAP.items()]
    l = plt.legend(legend_colors, legend_names, prop={'size': 10}, bbox_to_anchor=(1.01, 0.99))
    l.draw_frame(False)

    plt.ylabel('Cluster composition ' + FACTOR + ' %')
    plt.xlabel('Cluster')

# print('Size of 10x clusters analysis')
# t = clusters10xdf.copy()
# t['counts'] = np.ones(len(t))
# t.rename(columns={'Cluster': 'cluster'}, inplace=True)
# t.index = t['Barcode']
# print('Summary by', FACTOR, FACTORS_MAP)
# display(t[[FACTOR, 'counts']].groupby([FACTOR]).sum())

# clusters_sizes(t[['cluster', 'counts', FACTOR]], 'counts')
# plt.show()

# clusters_factors(t[['cluster', 'counts', FACTOR]], 'counts', FACTORS_MAP)
# plt.show()

# # Cleanup memory
# del t
# gc.collect()

def plot_colored(xs, ys, c,
                 clusters=False, show_centers=False, 
                 size=10, alpha=0.5, no_ticks=True, fig_ax=None):
    """ Plot colored scatterplot """
    cmap = plt.cm.get_cmap('jet', len(set(c))) if clusters else plt.cm.viridis
    if fig_ax is not None:
        fig, ax = fig_ax
    else:
        fig, ax = plt.subplots(1, figsize=(10, 8))
    sc = ax.scatter(xs, ys, 
                    c=c, cmap=cmap,
                    marker='o',
                    alpha=alpha,
                    s=size)
    mesh = ax.get_children()[0]
    if no_ticks:
        plt.setp(ax, xticks=[], yticks=[])
    if clusters and show_centers:
        # For each cluster, we add a text with name
        for cluster in set(c):
            cluster_x = np.mean(xs.iloc[np.flatnonzero(c == cluster)])
            cluster_y = np.mean(ys.iloc[np.flatnonzero(c == cluster)])
            ax.text(cluster_x, cluster_y, str(cluster), 
                     horizontalalignment='center', size='large', color='black', weight='bold')            
    if clusters:
        cbar = plt.colorbar(mesh, ax=ax, boundaries=np.arange(len(set(c)) + 1) + min(c) - 0.5)
        cbar.set_ticks(np.arange(len(set(c))) + min(c))
        cbar.set_ticklabels(list(set(c)))
    else:
        plt.colorbar(mesh, ax=ax)

def n_rows_columns(n, maxcols=5):
    ncols = max([r for r in range(1, min(maxcols, math.ceil(math.sqrt(n)) + 1)) if n % r == 0])
    nrows = int(n / ncols)
    return nrows, ncols

    
def plot_colored_split(xs, ys, c, size=10, alpha=0.5, contrast = 100):
    """ Plot colored scatterplot split by factors"""
    cmap = plt.cm.get_cmap('jet', len(set(c)))
    nrows, ncols = n_rows_columns(len(set(c)))
    plt.subplots(nrows=nrows, ncols=ncols, figsize=(5 * ncols, 5 * nrows))
    xmin, xmax = np.min(xs), np.max(xs)
    xlim = (xmin - 0.1 * (xmax - xmin), xmax + 0.1 * (xmax - xmin)) # +10% margin
    ymin, ymax = np.min(ys), np.max(ys)
    ylim = (ymin - 0.1 * (ymax - ymin), ymax + 0.1 * (ymax - ymin)) # +10% margin
    for i, f in enumerate(tqdm(set(c))):
        ax = plt.subplot(nrows, ncols, i + 1)
        plt.setp(ax, xticks=[], yticks=[])
        plt.xlim(xlim)
        plt.ylim(ylim)
        plt.title(str(f))
        # Plot slightly visible other as gray
        nfxs = xs.iloc[np.flatnonzero(np.logical_not(np.equal(c, f)))]
        nfys = ys.iloc[np.flatnonzero(np.logical_not(np.equal(c, f)))]
        plt.scatter(nfxs, nfys, 
            color='gray',
            marker='o',
            alpha=alpha / contrast,
            s=size)
        # And factor data
        fxs = xs.iloc[np.flatnonzero(np.equal(c, f))]
        fys = ys.iloc[np.flatnonzero(np.equal(c, f))]
        plt.scatter(fxs, fys, 
                    color=cmap(i),
                    marker='o',
                    alpha=alpha,
                    s=size)

# print('10x Cell Ranger ATAC original T-SNE visualization')
# tsne10xpath = TSNE_FILE_10X
# tsne10xdf = pd.read_csv(tsne10xpath, sep=',')

# tsne_clusters_10xdf = pd.merge(left=tsne10xdf,right=clusters10xdf, left_on='Barcode', right_on='Barcode')

# plot_colored(tsne_clusters_10xdf['TSNE-1'], tsne_clusters_10xdf['TSNE-1'], tsne_clusters_10xdf['Cluster'], 
#              clusters=True,
#              show_centers=True,
#              size=20, alpha=0.5)
# plt.show()

# print('10x tSNE on LSA normalized vs', FACTOR, FACTORS_MAP)
# plot_colored(tsne_clusters_10xdf['TSNE-1'], tsne_clusters_10xdf['TSNE-1'], clusters10xdf[FACTOR], 
#              clusters=True, size=20, alpha=0.1)
# plt.show()

# Uncomment to plot split by factor
# with PdfPages(os.path.join(figures_path, '10x_tsne_clusters_split.pdf')) as pdf:
#     plot_colored_split(tsne_clusters_10xdf['TSNE-1'], tsne_clusters_10xdf['TSNE-2'], clusters10xdf[FACTOR], 
#                        size=5, alpha=0.1)
#     pdf.savefig()

Peaks file stats¶

with PdfPages(os.path.join(figures_path, 'peaks_length.pdf')) as pdf:
    peaks_df = pd.read_csv(PEAKS_FILE, sep='\t', header=None)
    print('Peaks', len(peaks_df))
    sns.distplot(np.log10(peaks_df[2] - peaks_df[1]))
    plt.title('Peak length distribution')
    plt.xlabel('Log10 Length')
    plt.ylabel('Frequency')

Peaks 93236

Overlap of peaks with representative DNAse hypersensitive sites¶

if DNASE_FILE is not None:
    dnase = BedTool(DNASE_FILE)
    peaks_file = BedTool(PEAKS_FILE)
    peak_count = peaks_file.count()
    overlap = peaks_file.intersect(dnase, wa=True, u=True).count()
    peak_by_dnase = 0 if overlap == 0 else overlap * 100.0 / peak_count
    overlap = dnase.intersect(peaks_file, wa=True, u=True).count()
    dnase_by_peak = overlap * 100.0 / dnase.count()
    print('Fraction of peaks overlapping with representative DNAse', int(peak_by_dnase), '%')
    print('Fraction of representative DNAse overlapping with peaks', int(dnase_by_peak), '%')

Annotate peaks with closest genes¶

Pipeline Preprocess GENES to TSS and use bedtools closest -D b to report distance to TSS, upstream or downstream w.r.t. the gene strand.

Cell Ranger ATAC We use bedtools closest -D b to associate each peak with genes based on closest transcription start sites (packaged within the reference) such that the peak is within 1000 bases upstream or 100 bases downstream of the TSS.

def gtf_to_tss(gtf_file, transcripts_tss):
    print(f'GTF file {gtf_file} > {transcripts_tss}')
    GTF_TO_TSS_SH = """
# Compute transcript name field in GTF file
TNF=$(cat $1 | grep transcript_id | grep -v '#' | head -n 1 |\
    awk '{for (i=4; i<NF; i++) {if ($3=="transcript" && $i=="gene_name") print (i+1)}}')

# Process GTF file into transcripts
cat $1 |\
    awk -v TNF=$TNF -v OFS="\t" '{if ($3=="transcript") {print $1,$4-1,$5,$TNF,0,$7}}' |\
    tr -d '";' |\
    awk -v OFS="\t" '{if ($6=="+") {print $1,$2,$2+1,$4,0,"+"} else {print $1,$3-1,$3,$4,0,"-"}}' |\
    sort -k1,1 -k2,2n -k3,3n --unique > $2
    """
    with tempfile.NamedTemporaryFile(prefix='gtf_to_tss', suffix='.sh', delete=False) as f:
        f.write(GTF_TO_TSS_SH.encode('utf-8'))
        f.close()
        ! bash {f.name} {gtf_file} {transcripts_tss}

transcripts_tss = os.path.join(OUTPUT_DIR, 'cache', os.path.basename(GTF_FILE) + '.tss')
if not os.path.exists(transcripts_tss):
    ! bash gtf2tss.ssh {GTF_FILE}, {transcripts_tss}

def locations_closest_genes(locations, transcripts_tss):
    print('Annotate list of locations in format "chr:start-end" with closest tss of genes')
    with tempfile.NamedTemporaryFile(delete=False) as f:
        # Save index to restore original order after sorting
        for i, t in enumerate([re.split(':|-', p) for p in locations]):
            f.write('{}\t{}\t{}\t{}\n'.format(*t, i).encode('utf-8'))
        f.close()
        closest_file = f.name + '.closest'
        sorted_file = f.name + '.sorted'
        ! sort -k1,1 -k2,2n {f.name} > {sorted_file}
        ! bedtools closest -a {sorted_file} -b {transcripts_tss} -D b > {closest_file}
        closest_df = pd.read_csv(
            closest_file, sep='\t', 
            names=['chr', 'start', 'end', 'index', 
                   'gene_chr', 'gene_start', 'gene_end', 'gene', 'gene_score', 'gene_strand', 'distance'])
        closest_df['gene_length'] = [len(g) for g in closest_df['gene']]
        # Sort_by_peaks and gene length in case of several existing genes 
        closest_df.sort_values(by=['index', 'gene_length'], inplace=True)
        # HACK: Pick only first closest gene with shortest name in case of several
        # Example: hg38 GZMK and AC034238.1 -> GZMK
        closest_df.drop_duplicates(['index'], inplace=True)
        return closest_df[['chr', 'start', 'end', 'gene', 'distance']]

Find peaks closest to markers¶

def find_peaks_for_markers(peaks_df, markers_genes):
    print('Building genes peaks, closest peaks is mapped to gene, second closest with suffix _1, etc')
    closest_genes = locations_closest_genes(
        peaks_df[0] + ':' + peaks_df[1].astype(str) + '-' + peaks_df[2].astype(str),
        transcripts_tss
    )
    # Restore 'peak' to be able to merge on it
    closest_genes['peak'] = closest_genes['chr'] +\
                            ':' + closest_genes['start'].astype(str) +\
                            '-' + closest_genes['end'].astype(str)
    t = closest_genes[['gene', 'peak']].copy()
    t['distance'] = np.abs(closest_genes['distance'])

    # Filter peak close to gene tss of markers_genes
    marker_genes_lower = set([s.lower() for s in markers_genes])
    t = t.loc[np.logical_and(t['distance'] < 2000, [g.lower() in marker_genes_lower for g in t['gene']])]
    t.sort_values(by=['gene', 'distance'], inplace=True)

    marker_regions_peaks_map = {}
    prev_gene = None
    prev_count = 1
    seen_genes = set()
    for _, (gene, peak, distance) in t.iterrows():
        if prev_gene != gene:
            marker_regions_peaks_map[gene] = peak
            seen_genes.add(gene.lower())
            prev_gene = gene
            prev_count = 1
        elif prev_count < 3:  # Max regions per gene
            marker_regions_peaks_map[f'{gene}_{prev_count}'] = peak
            prev_count += 1

    print(f'Found {len(marker_regions_peaks_map)} marker peaks for {len(markers_genes)} peaks close to TSS')
    for g in markers_genes:
        if g.lower() not in seen_genes:
            print(f'Missing peak for {g}')
    for k, v in marker_regions_peaks_map.items():
        print(f'{k}: {v}')
    return marker_regions_peaks_map

# IMPORTANT: check peaks, important for potential clusters identification without bigwig visualization
marker_regions_peaks_map = find_peaks_for_markers(peaks_df, MARKERS_GENES)

Building genes peaks, closest peaks is mapped to gene, second closest with suffix _1, etc
Annotate list of locations in format "chr:start-end" with closest tss of genes
Found 69 marker peaks for 50 peaks close to TSS
Missing peak for Igkc
Missing peak for Jchain
Missing peak for Ighe
Missing peak for Ighd
Missing peak for Ighm
Missing peak for Fcer1a
Missing peak for Pf4
Missing peak for Ppbp
Missing peak for Trdc
Missing peak for Clec9a
Missing peak for Fcrg3a
Missing peak for Seprina1
BATF3: chr1:212698723-212703045
CCL5: chr17:35879692-35880845
CCL5_1: chr17:35878285-35878447
CCR7: chr17:40559971-40566397
CD14: chr5:140630684-140634020
CD14_1: chr5:140634855-140635445
CD19: chr16:28936042-28937103
CD19_1: chr16:28930671-28931126
CD1C: chr1:158290313-158291008
CD34: chr1:207890190-207890745
CD34_1: chr1:207910846-207911249
CD3D: chr11:118341714-118344926
CD3D_1: chr11:118338269-118340851
CD3E: chr11:118304170-118306326
CD3E_1: chr11:118311586-118311873
CD4: chr12:6786345-6787374
CD4_1: chr12:6788915-6789997
CD4_2: chr12:6812523-6813021
CD44: chr11:35137513-35148923
CD44_1: chr11:35214561-35215372
CD44_2: chr11:35176822-35178207
CD79A: chr19:41876828-41878415
CD79A_1: chr19:41875045-41876247
CD8B: chr2:86861618-86861998
EOMES: chr3:27719330-27724415
FLT3: chr13:28099927-28101128
FLT3_1: chr13:28101876-28102199
FOXP3: chrX:49268609-49270493
FOXP3_1: chrX:49257997-49258408
GATA2: chr3:128486587-128490741
GATA2_1: chr3:128491326-128493636
GATA2_2: chr3:128483048-128484300
GZMA: chr5:55101354-55103038
GZMA_1: chr5:55103938-55105442
GZMB: chr14:24631290-24634931
GZMK: chr5:55023932-55024403
IFIT1: chr10:89391803-89393430
IFIT3: chr10:89327448-89328601
IFIT3_1: chr10:89331992-89334548
IKZF2: chr2:213148644-213152799
IKZF2_1: chr2:213015737-213016210
IKZF2_2: chr2:213147508-213147954
IL7R: chr5:35852139-35860312
IL7R_1: chr5:35851098-35851220
KLRG1: chr12:8989337-8989976
KLRG1_1: chr12:8991755-8992722
KLRG1_2: chr12:8987048-8988241
LYZ: chr12:69348061-69349029
MKI67: chr10:128102851-128104120
MKI67_1: chr10:128125083-128127377
NCAM1: chr11:112960987-112963670
NCR1: chr19:54905524-54907338
NCR1_1: chr19:54903821-54904704
NKG7: chr19:51371846-51373956
PLD4: chr14:104924585-104925292
PLD4_1: chr14:104932864-104934354
PLD4_2: chr14:104926416-104927062
PRF1: chr10:70598809-70604586
SELL: chr1:169707874-169709099
SELL_1: chr1:169711115-169712476
SELL_2: chr1:169704633-169704651
SOX4: chr6:21593160-21597370
TCF7: chr5:134113227-134120351
TCF7_1: chr5:134121570-134126500
TCF7_2: chr5:134136866-134137895
TRAV1-2: chr14:21641156-21641451
TRGV2: chr7:38362340-38363879
TRGV9: chr7:38316583-38319257
TRGV9_1: chr7:38319804-38320340

The pipeline¶

Cell Calling¶

Pipeline uses noise_threshold and duplets_threshold.

Cell Ranger ATAC aggregation: when combining data from multiple GEM groups, the cellranger-atac aggr pipeline automatically equalizes the sensitivity of the groups before merging, which is the recommended approach in order to avoid the batch effect introduced by sequencing depth. Default method: Subsample fragments from higher-depth GEM wells until they all have an equal number of unique fragments per cell.

For each barcode:

we have the record of mapped high-quality fragments that passed all filters (the fragments.tsv file).
Having determined peaks prior to this, we use the number of fragments that overlap any peak regions.
Separate the signal from noise.

This works better in practice as compared to naively using the number of fragments per barcode. We first subtract a depth-dependent fixed count from all barcode counts to model whitelist contamination. This fixed count is the estimated number of fragments per barcode that originated from a different GEM, assuming a contamination rate of 0.02. Then we fit a mixture model of two negative binomial distributions to capture the signal and noise. Setting an odds ratio of 1000, we separate the barcodes that correspond to real cells from the non-cell barcodes.

Scasat If a cell has open peaks below a user defined threshold (default: 50 peaks) in the peak accessibility matrix we would remove that cell from subsequent analysis. Also peaks not observed across a user defined number of valid cells (default: 10 cells) are not considered for further downstream analysis.

SnapATAC alternative approach - 1kb, 5kb and 10kb resolution.

Existing computational methods rely on pre-defined regions of transposase accessibility identified from the aggregate signals. ...

Limitations:

It requires sufficient number of single cell profiles to create robust aggregate signal for peak calling.
The cell type identification is biased toward the most abundant cell types in the tissues.
These techniques lack the ability to reveal regulatory elements in the rare cell populations which are underrepresented in the aggregate signal. This concern is critical, for example, in brain tissue, where key neuron types may represent less than 1% of all cells while still playing a critical role in the neural circuit.

# Intersect using bedtools rather than pybedtools, because they are too slow for files of this size!
def intersect_fragments_and_peaks(fragments_file_gz, peaks_file, blacklist_file):
    print('Fragments')
    ! zcat {fragments_file_gz} | wc -l
    idf = None
    if blacklist_file is not None:
        print('Blacklist')
        ! wc -l {blacklist_file}
    print('Peaks')
    ! wc -l {peaks_file}

    with tempfile.TemporaryDirectory(prefix='pipeline', dir=OUTPUT_DIR) as td:
        print('Filtering out non-standard chromosomes')
        chr_filtered_file = os.path.join(td, 'chromosome_filtered.bed')
        ! zcat {fragments_file_gz} | grep -i -E 'chr[0-9mt]+[^_]' > {chr_filtered_file}
        ! wc -l {chr_filtered_file}

        if blacklist_file is not None:
            print('Blacklist regions filtration')
            blacklist_filtered_file = os.path.join(td, 'blacklist_filtered.bed')
            ! bedtools intersect -v -a {chr_filtered_file} -b {blacklist_file} > {blacklist_filtered_file}
            ! wc -l {blacklist_filtered_file}
            # Cleanup
            ! rm {chr_filtered_file}
        else:
            blacklist_filtered_file = chr_filtered_file

        print('Fragments and peaks intersection')
        intersection_file = os.path.join(td, 'intersection.bed')
        ! bedtools intersect -wa -wb -a {blacklist_filtered_file} -b {peaks_file} > {intersection_file}
        ! wc -l {intersection_file}
        # Cleanup
        ! rm {blacklist_filtered_file}

        idf = pd.read_csv(intersection_file, sep='\t', header=None)
        # Cleanup
        ! rm {intersection_file}        
        icolnames = ['chr', 'start', 'end', 'barcode', 'reads', 'peak_chr', 'peak_start', 'peak_end']
        assert len(idf.columns) >= len(icolnames)
        idf.rename(columns={o: n for (o, n) in zip(idf.columns[:len(icolnames)], icolnames)}, inplace=True)
        idf = idf[icolnames] # Reorder and drop others
        idf = idf.astype({'reads': 'int8'}) # Consume less memory
    print('Done intersecting of fragments and peaks')
    return idf

def load_or_compute_idf(fragments_file_gz, peaks_file, blacklist_file, cache=None):
    # Load or compute fragments and peaks intersection data frame
    if cache is None:
        cache = os.path.join(OUTPUT_DIR, 'cache')
    idfpath = os.path.join(cache, 'idf.csv.gz')
    if os.path.exists(idfpath):
        print(f'Loading IDF from {idfpath}')
        idf = pd.read_csv(idfpath, compression='gzip')
    else:
        print(f'No IDF file found {idfpath}')    
        idf = intersect_fragments_and_peaks(fragments_file_gz, peaks_file, blacklist_file)
        idf.to_csv(idfpath, index=False, compression='gzip')
        print(f'Saved IDF to {idfpath}')   
    return idf

def filter_cells(idf, noise_threshold=None, duplets_threshold=None):
    # 10x Genomics Cell Ranger ATAC-Seq marks duplicates
    # 'count' ignores multiple reads per barcode at same position
    pidf = pd.pivot_table(idf, values='reads', index=['barcode'], aggfunc='count')
    pidf.reset_index(level=0, inplace=True)
    
    print(f'Total barcodes intersecting peaks: {len(pidf)}')    
    plt.subplots(nrows=1, ncols=3, figsize=(20, 6))
    
    plt.subplot(1, 3, 1)
    sns.distplot(np.log10(pidf['reads']))
    plt.title('UMI summary coverage distribution')
    plt.xlabel('Log10 coverage')
    plt.ylabel('Frequency')    

    # n is for shoulder graph
    counts = sorted(pidf['reads'], reverse=True)
    df_counts = pd.DataFrame(data={'count': counts, 'n': range(1, len(counts) + 1)})
    cells_filter = [True] * len(df_counts)

    plt.subplot(1, 3, 2)    
    if noise_threshold is not None:
        noise_filter = df_counts['count'] <= noise_threshold
        cells_filter = np.logical_and(cells_filter, np.logical_not(noise_filter))
        df_noise = df_counts.loc[noise_filter]
        print(f'Noise <= {noise_threshold} reads per barcode: {len(df_noise)}')
        plt.plot(np.log10(df_noise['n']), 
                 np.log10(df_noise['count']), 
                 label='Noise', linewidth=3, color='red')        

    if duplets_threshold is not None:
        duplets_filter = df_counts['count'] > duplets_threshold
        cells_filter = np.logical_and(cells_filter, np.logical_not(duplets_filter))
        df_duplets = df_counts.loc[duplets_filter]
        print(f'Duplets > {duplets_threshold} reads per barcode: {len(df_duplets)}')        
        plt.plot(np.log10(df_duplets['n']), 
                 np.log10(df_duplets['count']), 
                 label='Duplets', linewidth=3, color='orange')        

    df_cells = df_counts.loc[cells_filter]

    cells_filter = [True] * len(pidf)
    if noise_threshold is not None:
        cells_filter = np.logical_and(cells_filter, pidf['reads'] > noise_threshold)
    if duplets_threshold is not None:
        cells_filter = np.logical_and(cells_filter, pidf['reads'] < duplets_threshold)
        
    cells = set(pidf.loc[cells_filter]['barcode'])
    idfcells = idf.loc[idf['barcode'].isin(cells)]
    print(f'Estimated number of cells: {len(cells)}')
    plt.title('Cells calling')
    plt.plot(np.log10(df_cells['n']), 
             np.log10(df_cells['count']), 
             label='Cell', linewidth=3, color='green')

    plt.xlabel('Log10 number of barcodes')
    plt.ylabel('Log10 fragments overlapping Peaks')
    plt.legend(loc='upper right')

    plt.subplot(1, 3, 3)
    sns.distplot(np.log10(df_cells['count']))
    plt.title('Cells UMI summary coverage distribution')
    plt.xlabel('Log10 coverage')
    plt.ylabel('Frequency')
    
    return idfcells

def idfcells_from_idf(fragments_tsv_gz, peaks_file, blacklist_file, cache=None):    
    # Intersection of fragments with peaks dataframe
    idf = load_or_compute_idf(fragments_tsv_gz, peaks_file, blacklist_file, cache)
    with PdfPages(os.path.join(figures_path, 'cell_calling.pdf')) as pdf:
        idfcells = filter_cells(idf, noise_threshold=NOISE_THRESHOLD, duplets_threshold=DUPLETS_THRESHOLD)
        pdf.savefig()
    return idfcells

Peak-Barcode Matrix¶

Pipeline Compute number of different UMIs which intersect with peak. Multiple copies of UMI are ignored.

Cell Ranger ATAC-Seq We produce a count matrix consisting of the counts of fragment ends (or cut sites) within each peak region for each barcode. This is the raw peak-barcode matrix and it captures the enrichment of open chromatin per barcode. The matrix is then filtered to consist of only cell barcodes, which is then used in subsequent analysis such as dimensionality reduction, clustering and visualization.

A barcoded fragment may get sequenced multiple times due to PCR amplification. We mark duplicates in order to identify the original fragments that constitute the library and contribute to its complexity.

def compute_coverage(idfcells):
    print('Barcode vs summary fragments overlap with peaks')
    # 10x Genomics Cell Ranger ATAC-Seq marks duplicates
    # 'count' ignores multiple reads per barcode at same position
    cell_peak_cov_df = pd.pivot_table(idfcells, values='reads', 
                            index=['peak_chr', 'peak_start', 'peak_end', 'barcode'], 
                            aggfunc='count').reset_index()

    cell_peak_cov_df['peak'] = cell_peak_cov_df['peak_chr'] + ':' + \
        cell_peak_cov_df['peak_start'].astype(str) + '-' + cell_peak_cov_df['peak_end'].astype(str)
    cell_peak_cov_df.drop(columns=['peak_chr', 'peak_start', 'peak_end'], inplace=True)

    print('Transforming dataframe to peaks x barcode format')
    coveragedf = pd.pivot_table(cell_peak_cov_df, index='peak', columns='barcode', values='reads').fillna(0)
    # Remove extra labels from pivot_table columns
    coveragedf.columns = coveragedf.columns.values
    coveragedf.index.name = None
    return coveragedf

def load_or_compute_coverage(idfcells_computable, cache=None):
    if cache is None:
        cache = os.path.join(OUTPUT_DIR, 'cache')
    cov_mtx_path = os.path.join(cache, 'coverage.npz')
    barcodes_path = os.path.join(cache, 'barcodes.txt')
    peaks_path = os.path.join(cache, 'peaks.txt')

    if os.path.exists(cov_mtx_path):
        print(f'Loading coverage matrix {cov_mtx_path}')
        csc = sparse.load_npz(cov_mtx_path)
        barcodes = [l.rstrip('\n') for l in open(barcodes_path)]
        peaks = [l.rstrip('\n') for l in open(peaks_path)]
        print(f'Loading into dataframe')
        coveragedf = pd.DataFrame(csc.toarray(), index=peaks, columns=barcodes)
        coveragedf.fillna(0, inplace=True)
    else:
        print(f'No coverage matrix found {cov_mtx_path}')
        idfcells = idfcells_computable()
        coveragedf = compute_coverage(idfcells)
        print(f'Saving coverage matrix')
        csc = sparse.csc_matrix(coveragedf.values, dtype=int)
        sparse.save_npz(cov_mtx_path, csc)
        print('Saving barcodes')
        with open(barcodes_path, 'w') as f:
            f.write('\n'.join(coveragedf.columns))
        print('Saving peaks')
        with open(peaks_path, 'w') as f:
            f.write('\n'.join(coveragedf.index))
            
    print('DF', coveragedf.shape[0], 'peaks x', coveragedf.shape[1], 'cells')
    return coveragedf

# Computing main coverage dataframe (Peaks x Barcodes)
coveragedf = load_or_compute_coverage(
    lambda: idfcells_from_idf(FRAGMENTS_FILE_10X, PEAKS_FILE, BLACKLIST_FILE)
)

No coverage matrix found /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/pipeline/cache/coverage.npz
Loading IDF from /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/pipeline/cache/idf.csv.gz
Total barcodes intersecting peaks: 1538845
Noise <= 2000 reads per barcode: 1489881
Duplets > 20000 reads per barcode: 340
Estimated number of cells: 48624
Barcode vs summary fragments overlap with peaks
Transforming dataframe to peaks x barcode format
Saving coverage matrix
Saving barcodes
Saving peaks
DF 90746 peaks x 48624 cells

Coverage dataframe analysis¶

print('Summary coverage per UMI')
umi_coverage_log10 = np.log10(coveragedf.sum())
with PdfPages(os.path.join(figures_path, 'umi_coverage.pdf')) as pdf:
    sns.distplot(umi_coverage_log10)
    plt.title('Coverage per UMI')
    plt.xlabel('Log10 barcodes')
    plt.ylabel('Frequency')
    pdf.savefig()

Summary coverage per UMI

print('Peaks per cell with non-zero coverage filtration')
peaks_per_cell = np.count_nonzero(coveragedf, axis=0)
with PdfPages(os.path.join(figures_path, 'peaks_per_cell.pdf')) as pdf:
    sns.distplot(np.log10(peaks_per_cell))
    plt.title('Peaks with coverage > 0 per cell')
    plt.xlabel('Log10 peaks')
    plt.ylabel('Frequency')
    pdf.savefig()

Peaks per cell with non-zero coverage filtration

with PdfPages(os.path.join(figures_path, 'umis.pdf')) as pdf:
    print('Summary UMI distribution per factor')
    factors = [FACTOR_FUNCTION(bc) for bc in coveragedf.columns]
    for k, v in tqdm(FACTORS_MAP.items()):
        sns.kdeplot(np.log10(coveragedf[coveragedf.columns[np.flatnonzero(np.equal(factors, k))]].sum()), 
                    label=v)
    plt.title('UMI summary coverage distribution')
    plt.xlabel('Log10 coverage')
    plt.ylabel('Frequency')

Summary UMI distribution per factor

print('Cells per peak with non-zero coverage distribution')
cells_per_peak = np.count_nonzero(coveragedf, axis=1)
with PdfPages(os.path.join(figures_path, 'cells_per_peaks.pdf')) as pdf:
    sns.distplot(np.log10(cells_per_peak))
    plt.title('Cells with coverage > 0 per peak')
    plt.xlabel('Log10 cells')
    plt.ylabel('Frequency')
    pdf.savefig()

Cells per peak with non-zero coverage distribution

# Plot peak coverage vs peak length
with PdfPages(os.path.join(figures_path, 'peaks_coverage_vs_length.pdf')) as pdf:
    sns.jointplot(x=np.log10(cells_per_peak), 
                  y=[int(re.split(':|-', p)[2]) - int(re.split(':|-', p)[1]) for p in coveragedf.index], 
                  alpha=0.1)
    plt.xlabel('Log10 cells')
    plt.ylabel('Peaks length')
    plt.suptitle('Peak coverage vs peak length')
    pdf.savefig()
    plt.show

Feature selection (Identification of highly variable features)¶

Pipeline

mean >99% as alignment errors or housekeeping genes
std <1% as noise

Cell Ranger ATAC removed features selection with zero variance.

SnapATAC The vast majority of the items in the cell-by-bin count matrix is “0”, some items have abnormally high coverage (often > 200) perhaps due to alignment error. Bins of exceedingly high coverage which likely represent the genomic regions that are invariable between cells such as housekeeping gene promoters were removed. We noticed that filtering bins of extremely low coverage perhaps due to random noise can also improve the robustness of the downstream clustering analysis.

Calculated the coverage of each bin using the binary matrix and
Removed the top 0.1% items of the highest coverage
Normalized the coverage by log10(count + 1).
Log-scaled coverage obey approximately a gaussian distribution which is then converted into zscore.
Bins with zscore beyond ±2 were filtered before further analysis.

Signac The largely binary nature of scATAC-seq data makes it challenging to perform ‘variable’ feature selection, as we do for scRNA-seq. Instead, we can choose to use only the top n% of features (peaks) for dimensional reduction, or remove features present in less that n cells with the FindTopFeatures function. Here, we will all features, though we note that we see very similar results when using only a subset of features (try setting min.cutoff to ‘q75’ to use the top 25% all peaks), with faster runtimes. Features used for dimensional reduction are automatically set as VariableFeatures for the Seurat object by this function. See for explanation: https://satijalab.org/signac/articles/pbmc_vignette.html

Seurat calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

print('Computing mean and std values for peaks...')
peaks_mean_std_df = pd.DataFrame(
    {'mean': np.mean(coveragedf.values, axis=1), 'std': np.std(coveragedf.values, axis=1)},
    index=coveragedf.index
)

Computing mean and std values for peaks...

with PdfPages(os.path.join(figures_path, 'feature_selection_before.pdf')) as pdf:
    # df is (peaks x barcodes)
    sns.jointplot(x='mean', y='std', data=peaks_mean_std_df, alpha=0.05)
    pdf.savefig()
    plt.show

def feature_selection(peaks_mean_std_df, stdp=1, meanp=99):
    print('Total peaks', len(peaks_mean_std_df))
    print('Feature selection: filter out most highly covered peaks as alignment errors or housekeeping genes.')
    means_high = np.percentile(peaks_mean_std_df['mean'], meanp)
    mean_filter = peaks_mean_std_df['mean'] < means_high
    print('Feature selection: filter out non-variable peaks.')          
    stds_low = np.percentile(peaks_mean_std_df['std'], stdp)
    std_filter = peaks_mean_std_df['std'] > stds_low
    peaks_filter = np.logical_and(mean_filter, std_filter)
    print('Filtered peaks', sum(peaks_filter))
    return peaks_filter

peaks_filter = feature_selection(peaks_mean_std_df)

Total peaks 90746
Feature selection: filter out most highly covered peaks as alignment errors or housekeeping genes.
Feature selection: filter out non-variable peaks.
Filtered peaks 88916

with PdfPages(os.path.join(figures_path, 'feature_selection_after.pdf')) as pdf:
    sns.jointplot(x='mean', y='std', data=pd.DataFrame({'mean': peaks_mean_std_df.loc[peaks_filter]['mean'], 
                                                        'std': peaks_mean_std_df.loc[peaks_filter]['std']}),
                  alpha=0.05)
    pdf.savefig()
    plt.show

# print('Check for markers')
# t = set(coveragedf.index[np.flatnonzero(peaks_filter)])
# for k,v in marker_regions_peaks_map.items():
#     if v not in t:
#         print(f'{k} {v} is missing in coveragedf')

# Feature selection is required only for LSA projection!
# # Filter out peaks
# coveragedf = coveragedf.loc[peaks_filter].copy()  # Copy is important here

Dimensionality Reduction, Clustering and visualization¶

Pipeline Use Cell Ranger ATAC approach and IRLB code for dimensionality reduction.

Normalization¶

Cell Ranger uses normalization to median coverage depth in each UMI for PCA analysis,\ and doesn't do any normalization for LSA (default scheme) - we do the same here.

Seurat we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.

SnapATAC does not require population-level peak annotation, and instead assembles chromatin landscapes by directly clustering cells based on the similarity of their genome-wide accessibility profile. Using a regression-based normalization procedure, SnapATAC adjusts for differing read depth between cells.

LSA (IDF + IRLBA + L2 normalization)¶

Cell Ranger ATAC default method is LSA.

IDF scaling
IRLBA SVD decomposition to dim 15
L2 normalization in low dimension projection
T-SNE visualization
Clustering

IRLBA: The augmented implicitly restarted Lanczos bidiagonalization algorithm finds a few approximate largest singular values and corresponding singular vectors of a sparse or dense matrix using a method of Baglama and Reichel.

PCA if your features are least sensitive (informative) towards the mean of the distribution, then it makes sense to subtract the mean. If the features are most sensitive towards the high values, then subtracting the mean does not make sense.

SVD does not subtract the means but often as a first step projects the data on the mean of all data points. In this way the SVD first takes care of global structure - this is important for IDF transformation.

We found that the combination of these normalization techniques obviates the need to remove the first component. The number of dimensions is fixed to 15 as it was found to sufficiently separate clusters visually and in a biologically meaningful way when tested on peripheral blood mononuclear cells (PBMCs).

def lsa(coveragedf, dim):
    print('Normalizing by IDF')
    idf = np.log1p(coveragedf.shape[1]) - np.log1p(np.count_nonzero(coveragedf, axis=1))
    df_idf = coveragedf.multiply(idf, axis=0).T # Transpose to (barcodes x peaks) format
    
    print('Performing IRLBA without scaling or centering')
    # Number of dimensions recommended by Cell Ranger ATAC-Seq algorithm
    (_, _, v, _, _) = irlb(df_idf, dim)
    # project the matrix to complete the transform: X --> X*v = u*d
    df_idf_irlb = df_idf.dot(v) 
    assert df_idf_irlb.shape == (df_idf.shape[0], dim), 'check dimensions'

    # IMPORTANT! We found that the combination of these normalization techniques 
    # obviates the need to remove the first component.
    print('Normalization each barcode data point to unit L2-norm in the lower dimensional space')
    idf_irlb_l2 = Normalizer(norm='l2').fit_transform(df_idf_irlb) # (n_samples, n_features) 
    return pd.DataFrame(idf_irlb_l2, index=df_idf_irlb.index, columns=df_idf_irlb.columns)

# Perform LSA on variable peaks. Peaks_filter is used only for this purpose!
lsa_coords = lsa(coveragedf.loc[peaks_filter], LSA_COMPONENTS)

Normalizing by IDF
Performing IRLBA without scaling or centering
Normalization each barcode data point to unit L2-norm in the lower dimensional space

TSNE visualization¶

print('Processing t-SNE on L2-normalized data')
tsne_coords = pd.DataFrame(TSNE(n_components=2).fit_transform(lsa_coords.values),
                           index=lsa_coords.index,
                           columns=['tsne1', 'tsne2'])

Processing t-SNE on L2-normalized data

print('tSNE of log10 coverate')
plot_colored(tsne_coords['tsne1'], tsne_coords['tsne2'], umi_coverage_log10, size=20)
plt.show()

tSNE of log10 coverate

Batch dupletes filtration¶

coverage_filter = coveragedf.columns[np.flatnonzero(umi_coverage_log10 <= 4.2)]
print(f'Total cells {coveragedf.shape[1]}, filtered cells {len(coverage_filter)}')
plot_colored(tsne_coords['tsne1'].loc[coverage_filter],
             tsne_coords['tsne2'].loc[coverage_filter],
             umi_coverage_log10.loc[coverage_filter], size=20)

Total cells 48624, filtered cells 47950

# Filter by proper indices
# coveragedf = coveragedf[coverage_filter].copy()
# Filter current values or reprocess from scratch
# lsa_coords = lsa_coords.loc[coverage_filter].copy()
# tsne_coords = tsne_coords.loc[coverage_filter].copy()

print('tSNE on LSA normalized vs', FACTOR, FACTORS_MAP)
factors = np.array([FACTOR_FUNCTION(bc) for bc in lsa_coords.loc[coverage_filter].index])
plot_colored(tsne_coords['tsne1'].loc[coverage_filter], 
             tsne_coords['tsne2'].loc[coverage_filter],
             factors, clusters=True, size=20, alpha=0.1)
plt.show()

tSNE on LSA normalized vs age {1: 'young', 2: 'young', 3: 'young', 4: 'old', 5: 'old', 6: 'old'}

# # Uncomment to plot split by factor
# print(FACTORS_MAP)
# with PdfPages(os.path.join(figures_path, f'tsne_{FACTOR}_split.pdf')) as pdf:
#     plot_colored_split(
#         tsne_coords['tsne1'].loc[coverage_filter], 
#         tsne_coords['tsne2'].loc[coverage_filter],
#         factors, alpha=0.1
#     )
#     pdf.savefig()

Ward clustering of LSA normalized data¶

Clustering should

produce clusters enough clusters 10-15 recommended by Cell Ranger
too small clusters are likely artifacts < 100 cells
centers of clusters should be inside the cluster, otherwise we should increase number of clusters OR increase lower dimension space size

Hierarchical clustering

Assumption: Ward's (Ward, 1963) and “complete” assume round equally-sized clusters like k-means
Used in pcaReduce (Žurauskienė and Yau, 2016), SINCERA (Guo et al., 2015), and CIDR (Lin et al., 2017)
Dominating Method: Ward's clustering

Density-based

Does not assume clusters of a particular shape or size, often assume that all clusters are equally dense
Since density-based clustering requires a large number of samples to accurately estimate densities, it works well on droplet-based datasets
The dominant method is DBSCAN (Ester et al., 1996)
Used in Seurat (Macosko et al., 2015) combined with dimensionality reduction or a rare cell-type sensitive feature selection method in GiniClust (Jiang et al., 2016)

Graph clustering

Extension of density-based clustering specifically for data represented as a graph
Advantage: ability scale to millions of cells (Blondel et al., 2008, Danon et al., 2005, Rosvall and Bergstrom, 2008, Schaeffer, 2007.
Density in a graph can be measured as the number of edges connecting a set of cells and can be readily compared to a null hypothesis, e.g. a fully random graph or a degree-controlled random graph, using a metric called modularity.
Main method: Louvain algorithm (Blondel et al., 2008, Lancichinetti and Fortunato, 2009)
Used in PhenoGraph (Levine et al., 2015) and version 1.4 of Seurat.
The main drawback: scRNASeq data does not have an inherent graph structure

See https://www.sciencedirect.com/science/article/pii/S0098299717300493

def reorder_clusters_by_size(clusters):
    clusters = np.array(clusters)
    cluster_counter = Counter()
    for c in clusters:
        cluster_counter[c] += 1
    clusters_reord = np.zeros(len(clusters), dtype=int)    
    for i, (c, n) in enumerate(cluster_counter.most_common()):
        clusters_reord[np.array(clusters) == c] = i
    return clusters_reord

def get_clusters(x, n_clusters):
    # Ward linkage method as default with euclidean distance
    print('Clustering', n_clusters)
    clusters = AgglomerativeClustering(n_clusters=n_clusters).fit_predict(x).astype(int)
    return reorder_clusters_by_size(clusters)

clusters = get_clusters(lsa_coords.loc[coverage_filter], N_CLUSTERS)
clusters_df = pd.DataFrame({'cluster': clusters}, index=coverage_filter)

Clustering 20

def clusters_summary(coveragedf, clusters):
    # Clusters summary by factors
    t = pd.DataFrame({FACTOR: [FACTOR_FUNCTION(bc) for bc in coveragedf.columns],
                     'cluster': clusters, 'counts': np.ones(len(coveragedf.columns))}, 
                     index=coveragedf.columns)
    clusters_sizes(t, 'counts')
    clusters_factors(t, 'counts', FACTORS_MAP)    

def clusters_hierarchy(X, clusters):
    # Hierarchical clustering of X - mean normalized LSA coordinates per cluster
    t = X.reset_index(drop=True).values
    clusters_means_X = pd.DataFrame({
        str(c): np.mean(t[np.flatnonzero(clusters == c), :], axis=0) for c in sorted(set(clusters))
    }).reset_index()
    clusters_linkage = hierarchy.linkage(
        clusters_means_X[[str(i) for i in set(clusters)]].T.values, method='ward'
    )
#     hierarchy.dendrogram(clusters_linkage, orientation='top')
#     hierarchy.set_link_color_palette(None)  # reset to default after use
    return clusters_linkage

clusters_summary(coveragedf[coverage_filter], clusters)

print('tSNE on LSA normalized vs graph clusters')
plot_colored(tsne_coords['tsne1'].loc[coverage_filter],
             tsne_coords['tsne2'].loc[coverage_filter],
             clusters, clusters=True, show_centers=True, size=20, alpha=0.5)
plt.show()

tSNE on LSA normalized vs graph clusters

def show_umi_coverage_per_cell_in_clusters(coveragedf, clusters):
    ts = []
    plt.figure(figsize=(int(len(set(clusters)) * 0.75), 6))
    for c in tqdm(set(clusters)):
        t = pd.DataFrame({'umi': np.log10(coveragedf.T.iloc[np.flatnonzero(clusters == c)].sum(axis=1) + 1)})
        t['cluster'] = str(c)
        ts.append(t)
    sns.violinplot(data=pd.concat(ts), x='cluster', y='umi', order=[str(c) for c in sorted(set(clusters))])
    plt.title('UMI summary coverage per cell')
    plt.xlabel('Cluster')
    plt.ylabel('Summary log10 coverage')
    plt.legend()

show_umi_coverage_per_cell_in_clusters(coveragedf[coverage_filter], clusters)
plt.show()

No handles with labels found to put in legend.

def show_cells_per_peak_in_clusters(coveragedf, clusters):
    ts = []
    plt.figure(figsize=(int(len(set(clusters)) * 0.75), 6))    
    for c in tqdm(set(clusters)):
        t = pd.DataFrame({
            'nzpeaks': np.log10(np.count_nonzero(coveragedf.T.iloc[np.flatnonzero(clusters == c)], axis=1) + 1)
        })
        t['cluster'] = str(c)
        ts.append(t)
    sns.violinplot(data=pd.concat(ts), x='cluster', y='nzpeaks', order=[str(c) for c in sorted(set(clusters))])
    plt.title('Peaks with non-zero coverage distribution per cell')
    plt.xlabel('Cluster')
    plt.ylabel('Frequency')
    plt.legend()

show_cells_per_peak_in_clusters(coveragedf[coverage_filter], clusters)
plt.show()

No handles with labels found to put in legend.

def clusters_mean_coverage(coveragedf, clusters):
    print('Compute mean coverage per cluster')
    # Transpose to (barcodes x peaks) format
    t = coveragedf.T 
    # Per cluster mean values for peaks
    clusters_means = {str(c): t.loc[clusters == c].mean() for c in sorted(set(clusters))}
    return pd.DataFrame(clusters_means).reset_index()
    
def show_markers_mean_values(coveragedf, clusters, clusters_linkage, marker_regions_peaks_map):
    clusters_means_df = clusters_mean_coverage(coveragedf, clusters)
    print('Hierarchical clustering of coverage in markers')
    t = clusters_means_df[[str(c) for c in set(clusters)]].copy()
    t.index = clusters_means_df['index']
    mt = [(k, v) for k, v in marker_regions_peaks_map.items() if v in t.index]
    t = t.loc[[v for _, v in mt]]
    t.index = [k for k, _ in mt]
    # zscore per row
    # t = pd.DataFrame(zscore(t, axis=1), index=t.index, columns=t.columns)
    # 0-1 scale per row
    t = t.divide(t.max(axis=1), axis='rows')
    sns.clustermap(t, figsize=(1 + int(len(set(clusters)) / 3), 1 + int(len(mt) / 4)), cmap='bwr',
                   col_linkage=clusters_linkage)

# Visualize markers mean values    
clusters_linkage = clusters_hierarchy(lsa_coords.loc[coverage_filter], clusters)
show_markers_mean_values(coveragedf[coverage_filter], 
                         clusters, clusters_linkage, marker_regions_peaks_map)
plt.show()

Compute mean coverage per cluster
Hierarchical clustering of coverage in markers

# Group detected by global clustering
refined_group = (0, 6, 9, 3, 5, 7, 8, 13, 2, 17)

print_mem_usage()

Global                     coveragedf: 32.9 GiB
Global                     lsa_coords: 10.3 MiB
Global              peaks_mean_std_df:  8.3 MiB
Global                   peaks_filter:  7.0 MiB
Global                       peaks_df:  6.9 MiB
Global             umi_coverage_log10:  5.1 MiB
Global                    tsne_coords:  5.1 MiB
Global                    clusters_df:  5.0 MiB
Global                coverage_filter:  4.7 MiB
Global                 cells_per_peak: 709.0 KiB
Local                           name:   63.0 B
Local                           size:   28.0 B

Pick clusters to refine
Create fragments.tsv.gz
Create reads in bed format for peak calling
Call peaks with MACS2 --keep-dup all --extsize 200 --shift 100 --nomodel
Clustering round 2 in 10, 20 clusters
For each granularity 20, 10
1. Remove dupletes
2. Fix meaningful clusters / groups of clusters
3. Propagate back to bigger granularity
Propagate back to previous clustering round

BigWig RPM normalized profiles¶

Pipeline Split UMI reads by clusters and use RPKM normalization.

Seurat We created pseudo-bulk ATAC-seq profiles by pooling together cells with for each cell type. Each cell type showed enriched accessibility near canonical marker genes. Chromatin accessibility tracks are normalized to sequencing depth (RPKM normalization) in each pooled group.

def write_bigwig(df, bw_path, step):
    chr_lengths = df[['chr', 'end']].groupby('chr').max().reset_index()
    with pyBigWig.open(bw_path, 'w') as bw:
        bw.addHeader(list(zip(chr_lengths['chr'], chr_lengths['end'])))
        for chromosome in tqdm(chr_lengths['chr']):
            df_chr = df.loc[df['chr'] == chromosome].sort_values(['start', 'end'])
            starts = list(df_chr['start'])
            ends = list(df_chr['end'])
            reads = list(df_chr['reads'])
            chr_length = int(chr_lengths.loc[chr_lengths['chr'] == chromosome]['end'])
            values = np.zeros(int(math.floor(chr_length / step)) + 1)

            for i in range(len(df_chr)):
                # Ignore PCR duplicates here!
                # v = reads[i] 
                v = 1.0
                si = int(math.floor(starts[i] / step)) 
                ei = int(math.floor(ends[i] / step)) 
                if ei == si:
                    values[si] += v
                else:
                    values[si] += v / 2 
                    values[ei] += v / 2

            non_zero = values > 0
            non_zero_inds = np.flatnonzero(non_zero)
            starts_np = non_zero_inds * step
            ends_np = (non_zero_inds + 1) * step
            values_np = values[non_zero]
            # RPM normalization
            values_np = values_np * (1_000_000.0 / sum(values_np))
            chroms_np = np.array([chromosome] * sum(non_zero))
            bw.addEntries(chroms_np, starts=starts_np, ends=ends_np, values=values_np)                     

def write_cluster_bigwigs(fragments_tsv_gz, barcodes, clusters, bigwigs_path, step):
    df_for_bigwigs = get_fragments_and_clusters(fragments_tsv_gz, barcodes, clusters)

    for cluster in sorted(set(clusters)):
        bw_path = os.path.join(bigwigs_path, f'cluster_{cluster}.bw')
        print('Processing', cluster, bw_path)
        df_for_bigwigs_cluster = df_for_bigwigs.loc[df_for_bigwigs['cluster'] == cluster]
        write_bigwig(df_for_bigwigs_cluster, bw_path, step)

    print('Done')

def get_fragments_and_clusters(fragments_tsv_gz, barcodes, clusters):
    print('Loading full fragments dataframe')
    fdf = pd.read_csv(fragments_tsv_gz, sep='\t', 
                    names=['chr', 'start', 'end', 'barcode', 'reads'], compression='gzip')
    print('Aggregating fragments and clusters')
    t = pd.merge(left=fdf, 
                 right=pd.DataFrame({'barcode': barcodes, 'cluster': clusters}), 
                 left_on='barcode', right_on='barcode')
    return t

fragments_and_clusters = get_fragments_and_clusters(FRAGMENTS_FILE_10X, clusters_df.index, clusters)

Loading full fragments dataframe
Aggregating fragments and clusters

# Process cluster fragments.tsv.gz and clusters peaks, long operation!
def create_fragments_peaks_bws(fragments_and_clusters, refined_group):
    rgn = '_'.join(map(lambda c: str(c), refined_group))
    print(f'Processing {rgn}')
    rgn_dir = os.path.join(OUTPUT_DIR, 'cache', rgn)
    ! mkdir {rgn_dir}
    bed_path = os.path.join(rgn_dir, f'cluster_{rgn}.bed')
    fragments_path = os.path.join(rgn_dir, f'cluster_{rgn}.tsv.gz')
    bw_path = os.path.join(rgn_dir, f'cluster_{rgn}.bw')
    t = fragments_and_clusters.loc[fragments_and_clusters['cluster'].isin(refined_group)]
    print('Saving reads', bed_path)
    t[['chr', 'start', 'end']].to_csv(bed_path, sep='\t', header=False, index=False)
    print('Saving fragments', fragments_path)
    t[['chr', 'start', 'end', 'barcode', 'reads']].to_csv(fragments_path, sep='\t', 
                                                          header=False, index=False,
                                                          compression='gzip')
    print('Saving visualization', bw_path)
    write_bigwig(t, bw_path, step=200)

# Prepare input for processing
create_fragments_peaks_bws(fragments_and_clusters, refined_group)

Processing 0_6_9_3_5_7_8_13_2_17
Saving reads /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/pipeline/cache/0_6_9_3_5_7_8_13_2_17/cluster_0_6_9_3_5_7_8_13_2_17.bed
Saving fragments /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/pipeline/cache/0_6_9_3_5_7_8_13_2_17/cluster_0_6_9_3_5_7_8_13_2_17.tsv.gz
Saving visualization /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/pipeline/cache/0_6_9_3_5_7_8_13_2_17/cluster_0_6_9_3_5_7_8_13_2_17.bw

# Memory cleanup
del fragments_and_clusters

rgn = '_'.join(map(lambda c: str(c), refined_group))
rgn_dir = os.path.join(OUTPUT_DIR, 'cache', rgn)

# Call peaks with MACS2
print('Launch MACS2 with command line')
print(f'cd {rgn_dir}; macs2 callpeak --keep-dup all --extsize 200 --shift 100 --nomodel \
    -t cluster_{rgn}.bed -f BED -n cluster_{rgn}')

Launch MACS2 with command line
cd /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/pipeline/cache/0_6_9_3_5_7_8_13_2_17; macs2 callpeak --keep-dup all --extsize 200 --shift 100 --nomodel     -t cluster_0_6_9_3_5_7_8_13_2_17.bed -f BED -n cluster_0_6_9_3_5_7_8_13_2_17

# IMPORTANT: check peaks, important for potential clusters identification without bigwig visualization
rgn_peaks_df = pd.read_csv(os.path.join(rgn_dir, f'cluster_{rgn}_peaks.narrowPeak'), sep='\t', header=None)
rgn_marker_regions_peaks_map = find_peaks_for_markers(rgn_peaks_df, MARKERS_GENES)

Building genes peaks, closest peaks is mapped to gene, second closest with suffix _1, etc
Annotate list of locations in format "chr:start-end" with closest tss of genes
Found 70 marker peaks for 50 peaks close to TSS
Missing peak for Igkc
Missing peak for Ighe
Missing peak for Ighd
Missing peak for Ighm
Missing peak for Cd8b
Missing peak for Cd1c
Missing peak for Fcer1a
Missing peak for Pf4
Missing peak for Ppbp
Missing peak for Ncam1
Missing peak for Trdc
Missing peak for Clec9a
Missing peak for Lyz
Missing peak for Fcrg3a
Missing peak for Seprina1
BATF3: chr1:212699083-212701630
CCL5: chr17:35879997-35880841
CCR7: chr17:40565323-40565875
CCR7_1: chr17:40561025-40561383
CCR7_2: chr17:40560312-40560777
CD14: chr5:140632125-140633435
CD14_1: chr5:140635270-140635501
CD19: chr16:28936327-28937077
CD19_1: chr16:28930795-28931321
CD34: chr1:207910924-207911398
CD3D: chr11:118342731-118343038
CD3D_1: chr11:118339648-118339866
CD3D_2: chr11:118338594-118339381
CD3E: chr11:118304467-118305448
CD3E_1: chr11:118305870-118306275
CD4: chr12:6786617-6787290
CD4_1: chr12:6789340-6789838
CD4_2: chr12:6790844-6791286
CD44: chr11:35138779-35140030
CD44_1: chr11:35177145-35178007
CD44_2: chr11:35185936-35186136
CD79A: chr19:41875442-41876167
EOMES: chr3:27721169-27722711
FLT3: chr13:28100057-28101219
FOXP3: chrX:49257907-49258573
FOXP3_1: chrX:49257172-49257649
GATA2: chr3:128487561-128488207
GATA2_1: chr3:128490406-128490652
GATA2_2: chr3:128492381-128493318
GZMA: chr5:55101642-55102986
GZMA_1: chr5:55104075-55104521
GZMB: chr14:24634163-24634813
GZMB_1: chr14:24631983-24632333
GZMK: chr5:55024065-55024502
IFIT1: chr10:89392076-89392957
IFIT1_1: chr10:89393225-89393481
IFIT3: chr10:89327862-89328620
IFIT3_1: chr10:89332301-89332814
IFIT3_2: chr10:89333929-89334560
IKZF2: chr2:213151479-213152596
IKZF2_1: chr2:213015660-213016556
IL7R: chr5:35852566-35852894
IL7R_1: chr5:35856750-35857306
IL7R_2: chr5:35853704-35854877
JCHAIN: chr4:70668138-70668386
KLRG1: chr12:8949492-8950572
KLRG1_1: chr12:8970111-8970567
KLRG1_2: chr12:8989573-8990001
MKI67: chr10:128103034-128104219
MKI67_1: chr10:128126135-128127272
NCR1: chr19:54905777-54906476
NCR1_1: chr19:54906929-54907423
NCR1_2: chr19:54904494-54904781
NKG7: chr19:51372720-51373601
PLD4: chr14:104933172-104933888
PRF1: chr10:70602641-70604039
SELL: chr1:169708143-169709007
SELL_1: chr1:169711377-169712524
SELL_2: chr1:169701282-169701916
SOX4: chr6:21593537-21595726
TCF7: chr5:134114169-134115033
TCF7_1: chr5:134115348-134115885
TCF7_2: chr5:134140532-134141012
TRAV1-2: chr14:21642314-21642683
TRAV1-2_1: chr14:21643635-21643907
TRAV1-2_2: chr14:21641301-21641621
TRGV2: chr7:38362614-38363928
TRGV9: chr7:38318877-38319310
TRGV9_1: chr7:38317554-38318138
TRGV9_2: chr7:38316926-38317290

# Computing coverage dataframe (Peaks x Barcodes)
rgn_coveragedf = load_or_compute_coverage(
    lambda: idfcells_from_idf(
        os.path.join(rgn_dir, f'cluster_{rgn}.tsv.gz'), 
        os.path.join(rgn_dir, f'cluster_{rgn}_peaks.narrowPeak'), 
        BLACKLIST_FILE,
        cache=rgn_dir
    ),
    cache=rgn_dir
)

Loading coverage matrix /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_3vs3_human/pipeline/cache/0_6_9_3_5_7_8_13_2_17/coverage.npz
Loading into dataframe
DF 90490 peaks x 31233 cells

def factors_summary(coveragedf):
    t = pd.DataFrame(
        {FACTOR: [FACTOR_FUNCTION(bc) for bc in coveragedf.columns],
         'counts': np.ones(len(coveragedf.columns))}, 
        index=coveragedf.columns
    )
    display(t[[FACTOR, 'counts']].groupby([FACTOR]).sum())

factors_summary(rgn_coveragedf)

# Feature selection
rgn_peaks_filter = feature_selection(pd.DataFrame(
    {'mean': np.mean(rgn_coveragedf.values, axis=1), 'std': np.std(rgn_coveragedf.values, axis=1)},
    index=rgn_coveragedf.index
))
# Perform LSA on variable peaks. Peaks_filter is used only for this purpose!
rgn_lsa_coords = lsa(rgn_coveragedf.loc[rgn_peaks_filter], LSA_COMPONENTS)

Total peaks 90490
Feature selection: filter out most highly covered peaks as alignment errors or housekeeping genes.
Feature selection: filter out non-variable peaks.
Filtered peaks 88679
Normalizing by IDF
Performing IRLBA without scaling or centering
Normalization each barcode data point to unit L2-norm in the lower dimensional space

print('Processing t-SNE on L2-normalized data')
rgn_tsne_coords = pd.DataFrame(
    TSNE(n_components=2).fit_transform(rgn_lsa_coords.values),
    index=rgn_lsa_coords.index,
    columns=['tsne1', 'tsne2']
)

Processing t-SNE on L2-normalized data

# Main difference is here, wether we remove non-various peaks before UMI selection or not!
rgn_umi_coverage_log10 = np.log10(rgn_coveragedf.loc[rgn_peaks_filter].sum())

plot_colored(rgn_tsne_coords['tsne1'], rgn_tsne_coords['tsne2'], rgn_umi_coverage_log10, size=20)

rgn_coverage_filter = rgn_coveragedf.columns[np.flatnonzero(rgn_umi_coverage_log10 <= 4.1)] 
print(f'Total cells {rgn_coveragedf.shape[1]}, filtered cells {len(rgn_coverage_filter)}')
plot_colored(rgn_tsne_coords['tsne1'].loc[rgn_coverage_filter], 
             rgn_tsne_coords['tsne2'].loc[rgn_coverage_filter], 
             rgn_umi_coverage_log10.loc[rgn_coverage_filter], size=20)

Total cells 31233, filtered cells 31160

def get_clusters_with_summary(coveragedf, lsa_coords, tsne_coords, marker_regions_peaks_map, n_clusters):
    clusters = get_clusters(lsa_coords, n_clusters)
    # Summary
    clusters_summary(coveragedf, clusters)
    plt.show()
    # Plot clusters on t-SNE
    plot_colored(tsne_coords['tsne1'], tsne_coords['tsne2'], 
                 clusters, clusters=True, show_centers=True, size=20, alpha=0.5)
    plt.show()
    # Visualize coverage in clusters
    show_umi_coverage_per_cell_in_clusters(coveragedf, clusters)
    plt.show()
    # Visualize markers mean values    
    clusters_linkage = clusters_hierarchy(lsa_coords, clusters)
    show_markers_mean_values(coveragedf, clusters, clusters_linkage, marker_regions_peaks_map) 
    plt.show()
    return clusters

rgn_clusters_map = {}
for n_clusters in [10, 20]:
    rgn_clusters_map[n_clusters] = get_clusters_with_summary(
        rgn_coveragedf[rgn_coverage_filter], 
        rgn_lsa_coords.loc[rgn_coverage_filter], 
        rgn_tsne_coords.loc[rgn_coverage_filter], 
        rgn_marker_regions_peaks_map, 
        n_clusters
    )

Clustering 10

No handles with labels found to put in legend.

Compute mean coverage per cluster
Hierarchical clustering of coverage in markers

Clustering 20

No handles with labels found to put in legend.

Compute mean coverage per cluster
Hierarchical clustering of coverage in markers

print_mem_usage()

Global                     coveragedf: 32.9 GiB
Global                 rgn_coveragedf: 21.1 GiB
Global             refined_coveragedf: 234.7 MiB
Global                   rgn_peaks_df: 24.3 MiB
Global              clusters_means_df: 20.7 MiB
Global                     lsa_coords: 10.3 MiB
Global              peaks_mean_std_df:  8.3 MiB
Global                 rgn_lsa_coords:  7.1 MiB
Global                   peaks_filter:  7.0 MiB
Global               rgn_peaks_filter:  7.0 MiB
Local                           name:   65.0 B
Local                           size:   28.0 B

Project clusters back¶

Remove dupletes from the 20 clustering - see UMI coverage and markers coverage
Fix some meaningful clusters based on markers on fine grained clusters
Merge some clusters and project onsmaller number of clusters
Switch to previous clustering
Iterate while we don't fix all the clusters for all the cells

# Configure clustering actions: remove cluster number 19 as dupletes, project others back
clustering_actions = [
    (rgn_clusters_map[20], {'preserve': [7, 12, 14, 15, 16], 'remove': [19]}),
    (rgn_clusters_map[10], {'preserve': list(set(rgn_clusters_map[10])), 'remove': []})
]

def reorder_clusters_map_by_size(clusters_map):
    result_clusters_map = {}
    clusters_kvs = list(clusters_map.items())
    preserve_clusters = [c for (_, c) in clusters_kvs if c != -1]
    for bc, c in zip([bc for (bc, c) in clusters_kvs if c != -1], 
                    reorder_clusters_by_size(preserve_clusters)):
        result_clusters_map[bc] = c
    return result_clusters_map

refined_clusters_map = {}  # Here we'll collect clustering
refined_n_clusters = 0  # This will be used for reordering
rgn_cells = list(rgn_coveragedf[rgn_coverage_filter].columns)

for (clustering, actions) in clustering_actions:
    # Remove cells
    for bc in [rgn_cells[i] for i in np.flatnonzero(np.isin(clustering, actions['remove']))]:
        refined_clusters_map[bc] = -1  # Mark for removal
    # Preserve clustering
    for i, c in enumerate(actions['preserve']):
        for bc in [rgn_cells[i] for i in np.flatnonzero(clustering == c)]:
            if bc not in refined_clusters_map:
                refined_clusters_map[bc] = refined_n_clusters + i
        refined_n_clusters += 1

# Reorder clusters by sizes
refined_clusters_map = reorder_clusters_map_by_size(refined_clusters_map)
refined_n_clusters = len(set(filter(lambda c: c!=-1, refined_clusters_map.values())))
print(f'Result clusters', refined_n_clusters)
print('Result cells', len([c for c in refined_clusters_map.values() if c != -1]))

Result clusters 12
Result cells 30830

Subclustering final¶

refined_cells = [bc for bc in rgn_coveragedf.columns if 
                     bc in refined_clusters_map and refined_clusters_map[bc] != -1]
refined_coveragedf = rgn_coveragedf[refined_cells]
refined_lsa_coords = rgn_lsa_coords.loc[refined_cells]
refined_tsne_coords = rgn_tsne_coords.loc[refined_cells]
refined_clusters = [refined_clusters_map[bc] for bc in refined_cells]

# Plot clusters on t-SNE
plot_colored(refined_tsne_coords['tsne1'], refined_tsne_coords['tsne2'], refined_clusters, 
             clusters=True, show_centers=True, size=20, alpha=0.5)
plt.show()
# Visualize markers mean values 
refined_clusters_linkage = clusters_hierarchy(refined_lsa_coords, refined_clusters)
show_markers_mean_values(refined_coveragedf, refined_clusters, 
                         refined_clusters_linkage, rgn_marker_regions_peaks_map) 
plt.show()

Compute mean coverage per cluster
Hierarchical clustering of coverage in markers

# print_mem_usage()

Project subclusters back to global clustering¶

# Dupletes to remove
final_clusters_to_remove = {12}

final_clusters_map = {}
n_clusters = len(set(clusters))
for bc, c in zip(coveragedf[coverage_filter].columns, clusters):
    if c in refined_group:
        if bc in refined_clusters_map:
            final_clusters_map[bc] = n_clusters + refined_clusters_map[bc]
    else:
        if c not in final_clusters_to_remove:
            final_clusters_map[bc] = c
# Reorder clusters by sizes
final_clusters_map = reorder_clusters_map_by_size(final_clusters_map)

print(f'Result clusters', len(set(final_clusters_map.values())))
print('Result cells', len([c for c in final_clusters_map.values() if c != -1]))

Result clusters 21
Result cells 45674

final_cells = [bc for bc in coveragedf.columns if 
                     bc in final_clusters_map and final_clusters_map[bc] != -1]
final_coveragedf = coveragedf[final_cells]
final_lsa_coords = lsa_coords.loc[final_cells]
final_tsne_coords = tsne_coords.loc[final_cells]
final_clusters = [final_clusters_map[bc] for bc in final_cells]

# Plot clusters on t-SNE
plot_colored(final_tsne_coords['tsne1'], final_tsne_coords['tsne2'], final_clusters, 
             clusters=True, show_centers=True, size=20, alpha=0.5)
plt.show()

# Visualize markers mean values 
final_clusters_linkage = clusters_hierarchy(final_lsa_coords, final_clusters)
show_markers_mean_values(final_coveragedf, final_clusters, 
                         final_clusters_linkage, marker_regions_peaks_map) 
plt.show()

Compute mean coverage per cluster
Hierarchical clustering of coverage in markers

# print_mem_usage()

Clusters visualization vs 10x Cell Ranger ATAC¶

# # print('These clusters into Cell Ranger ATAC coordinates') 
# mergeddf1 = pd.merge(pd.DataFrame({'Barcode': tsne_coords.index, 'Cluster': clusters}),
#                      right=tsne10xdf, 
#                      left_on='Barcode', right_on='Barcode')

# mergeddf2 = pd.merge(pd.DataFrame({'Barcode': tsne_coords.index,
#                                   'tsne1': tsne_coords['tsne1'],
#                                   'tsne2': tsne_coords['tsne2']}),
#                      right=clusters10xdf, 
#                      left_on='Barcode', right_on='Barcode')


# fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(18, 7))

# ax1.title.set_text('Clusters in 10x tsne coordinates')
# plot_colored(mergeddf1[['TSNE-1', 'TSNE-2']].values, 
#              mergeddf1['Cluster'], 
#              clusters=True,
#              show_centers=True,
#              size=20, alpha=0.3, fig_ax=(fig, ax1))

# ax2.title.set_text('10x clusters in tsne coordinates')
# plot_colored(mergeddf2[['tsne1', 'tsne2']].values, 
#              mergeddf2['Cluster'], 
#              clusters=True,
#              show_centers=True,
#              size=20, alpha=0.3, fig_ax=(fig, ax2))
# plt.show()

Differential peaks analysis (Markers)¶