Single cell ATAC-seq explorer

This single cell ATAC-Seq analysis pipeline is designed for advanced analysis of dataset, produced by 10X Genomics Cell Ranger ATAC.\ Aggregated datasets are also supported!

In addition to 10X Genomics results it offers:

  • Capable to process aggregated data by 10x Genomics Cell Ranger ATAC
  • Summary on different conditions in case of aggregated dataset
  • Different types of clustering followed by heatmap and t-SNE/UMAP visualizations in low dimensions space
  • Top cluster markers visualization as heatmap on t-SNE/UMAP plot
  • Closest genes annotations for peaks and clusters
  • Annotated markers analysis
  • BigWig and BED files for clusters and markers ready-to-be-visualized in JBR Genome Browser by JetBrains Research
  • Data preparation for Single Cell Explorer by Artyomov Lab, Washington University in St.Louis
  • Save all the figures in PDF format - ready for publication

Required 10X Genomics Cell Ranger ATAC files:

  • fragments.tsv - fragments matrix file provided by Cell Ranger
  • peaks.bed - peaks file (union of peaks) in case of merging
  • clusters.csv - Graph clustering after median normalization, IDF scaling, SVD projection and L2 normalization
  • projection.csv - t-SNE 2d project of all the cells

Pipeline steps:

  • Cell Calling
  • Peak barcode matrix
  • Feature selection
  • Dimensionality reduction
  • t-SNE / UMAP
  • Differential markers analysis
  • Closest genes association
  • Supervised annotation of clusters by gene markers
  • Export data to Single Cell Explorer format

Other pipelines:

Source code is available at: https://github.com/JetBrains-Research/sc-atacseq-explorer \ Questions and comments are welcome at os at jetbrains dot com

Pipeline paths configuration

In [1]:
# # Configuration of 10X Genomics files
# FRAGMENTS_FILE_10X = 'fragments.tsv.gz'
# CLUSTERS_FILE_10X = 'clusters.csv'
# TSNE_FILE_10X = 'tsne.csv'
# PEAKS_FILE = 'peaks.bed'

# # See https://www.encodeproject.org/annotations/ENCSR636HFF/
# # Set this value to None if blacklist regions bed file is not available
# BLACKLIST_FILE = 'blacklist.bed'

# # For mm10 and hg38
# # ! wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz
# # ! wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz
# # ! gunzip *.gtf.gz
# GTF_FILE = 'gencodegenes.gtf'

# # Dowload DNAse hypersensitivity sites from encodeproject.org
# # For mm10
# # ! wget https://www.encodeproject.org/files/ENCFF108APF/@@download/ENCFF108APF.bed.gz
# # ! gunzip *.gz
# # Set this to None if DNASe file is not available
# DNASE_FILE = 'dnase.bed'


# # Output configuration
# OUTPUT_DIR = 'pipeline'
# Configuration of 10X Genomics files
FRAGMENTS_FILE_10X = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/cra_BAAJ/outs/fragments.tsv.gz'
CLUSTERS_FILE_10X = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/cra_BAAJ/outs/analysis/clustering/graphclust/clusters.csv'
TSNE_FILE_10X = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/cra_BAAJ/outs/analysis/tsne/2_components/projection.csv'
PEAKS_FILE = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/cra_BAAJ/outs/peaks.bed'

# See https://www.encodeproject.org/annotations/ENCSR636HFF/
# Set this value to None if blacklist regions bed file is not available
BLACKLIST_FILE = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/mm10.blacklist.bed'

# For mm10 and hg38
# ! wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.annotation.gtf.gz
# ! wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz
# ! gunzip *.gtf.gz
GTF_FILE = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/gencode.vM22.annotation.gtf'

# Dowload DNAse hypersensitivity sites from encodeproject.org
# For mm10
# ! wget https://www.encodeproject.org/files/ENCFF108APF/@@download/ENCFF108APF.bed.gz
# ! gunzip *.gz
# Set this to None if DNASe file is not available
DNASE_FILE = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/ENCFF108APF.bed'

# Output configuration
OUTPUT_DIR = '/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline'

Pipeline parameters configuration

In [365]:
# # Factors configuration, number should be consistent with number of merged 
# FACTOR = 'factor'
# FACTORS_MAP = {1: 'f1', 2: 'f2'}

# # Aggregated barcodes always have suffix as a marker of original track
# def FACTOR_FUNCTION(x):
#     for k in [1, 2]:
#         if x.endswith(f'-{k}'):
#             return k

# # Cell calling filtration thresholds
# NOISE_THRESHOLD = 200
# DUPLETS_THRESHOLD = 8000

# # Default by Cell Ranger ATAC
# LSA_COMPONENTS = 15

# # 20 clusters is default by Cell Ranger ATAC
# N_CLUSTERS = 20
        
# # Markers configuration
# MARKERS_GENES = []

# # Export to Single Cell Explorer configuration
# SCE_TOKEN = '2020_scatacseq'
# SCE_NAME = 'Single cell ATAC-Seq'
# SCE_PEAK_GENE_TSS_MAX_DISTANCE = 10000
# SCE_PUBLIC = False
# SCE_ORGANISM = 'mm' # mm or hg are supported

# Factors configuration, number should be consistent with number of merged 
FACTOR = 'age'
FACTORS_MAP = {1: 'young', 2: 'old'}

# Aggregated barcodes always have suffix as a marker of original track
def FACTOR_FUNCTION(x):
    for k in FACTORS_MAP.keys():
        if x.endswith(f'-{k}'):
            return k

# Cell calling filtration thresholds
NOISE_THRESHOLD = 300
DUPLETS_THRESHOLD = 8000

# Default by Cell Ranger ATAC
LSA_COMPONENTS = 15

# 20 clusters is default by Cell Ranger ATAC
N_CLUSTERS = 20
        
# Markers configuration
MARKERS_GENES = ['Sell', 'Lcn2', 'Cd79a', 'Iglc1', 'Fcer1a', 'Gata2', 'Il4', 'Il1rl1', 'Arg1', 'Il5',
                'Fn1', 'Lyz2', 'Cx3cr1', 'Cd44', 'Cd36', 'Adgre1', 'Itgax', 'Siglecf', 'Gdf15', 'Mki67',
                'Flt3', 'Il12b', 'Cd209a', 'Foxp3', 'Tnfsf4', 'Cd4', 'Lef1', 'Tcf7', 'Cd3d', 'Il23r',
                'Tcrg-C1', 'Cd8b1', 'Cd8a', 'Gzmk', 'Ncr1', 'Eomes', 'Ccl5']

# Export to Single Cell Explorer configuration
SCE_TOKEN = '2019_scatacseq_mice_lungs_1vs1'
SCE_NAME = '2019_scatacseq_mice_lungs_1vs1'
SCE_PEAK_GENE_TSS_MAX_DISTANCE = 10000
SCE_PUBLIC = False
SCE_ORGANISM = 'mm10' # mm or hg are supported

Pipeline utils

In [3]:
# Ensure all the libraries are installed
%matplotlib inline
%config InlineBackend.figure_format ='retina'

import gc
import glob
import math
import matplotlib.pyplot as plt
import numpy as np
# Workaround for the issue: https://github.com/pandas-dev/pandas/issues/26314
# pip install pandas==0.21
import pandas as pd
import os
import re
import sys
import seaborn as sns
import umap
sns.set_style("whitegrid")
import re
import tempfile
import glob
import pyBigWig
import json
import h5py
from scipy import io, sparse
from scipy.cluster import hierarchy

from matplotlib.colors import LinearSegmentedColormap
from collections import Counter
from sklearn.cluster import AgglomerativeClustering
from sklearn.manifold import TSNE
from sklearn.preprocessing import Normalizer

from pybedtools import BedTool
from tqdm.auto import tqdm
from scipy.stats import zscore
from matplotlib.backends.backend_pdf import PdfPages

# These functions are taken from the original Cell Ranger 10x ATAC pipeline
# See https://github.com/10XGenomics/cellranger/
from irlb import irlb
from diffexp import *

# Fix random seed
np.random.seed(0)
In [4]:
# Create pipeline output folders
figures_path = os.path.join(OUTPUT_DIR, 'figures')
! mkdir -p {figures_path}
! mkdir -p {OUTPUT_DIR}/cache
In [5]:
# Used memory analysis utility
def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f %s%s" % (num, 'Yi', suffix)

# Cleanup
gc.collect()

# Print all allocated variables
def print_mem_usage():
    for name, size in sorted(((name, sys.getsizeof(value)) for name, value in globals().items()),
                             key= lambda x: x[1],
                             reverse=True)[:10]:
        print("Global {:>30}: {:>8}".format(name, sizeof_fmt(size)))    
    for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                             key= lambda x: x[1],
                             reverse=True)[:10]:
        print("Local {:>30}: {:>8}".format(name, sizeof_fmt(size)))

Visualize 10x Cell Ranger results

In [6]:
clusters10xdf = pd.read_csv(CLUSTERS_FILE_10X, sep=',')
clusters10xdf['Cluster'] = clusters10xdf['Cluster'].astype(int) - 1 # Clusters start with 1 in 10x
clusters10xdf[FACTOR] = [FACTOR_FUNCTION(bc) for bc in clusters10xdf['Barcode']]
# display(clusters10xdf.head())
print('Total clusters', len(set(clusters10xdf['Cluster'])))
Total clusters 16
In [7]:
def clusters_sizes(df, value):
    df_sum = df[['cluster', value]].groupby(['cluster']).sum().reset_index()
    plt.figure(figsize=(10, 4))
    cmap=plt.cm.get_cmap('jet', len(df_sum))
    sns.barplot(x=df_sum.index, y=df_sum[value],
                palette=[cmap(i) for i in range(len(df_sum))])


def clusters_factors(df, value, FACTORS_MAP):
    df_sum = df[['cluster', value]].groupby(['cluster']).sum().reset_index()    
    df_sum_factor = df[['cluster', FACTOR, value]].groupby(['cluster', FACTOR]).sum().reset_index()
    plt.figure(figsize=(10, 4))
    cmap=plt.cm.get_cmap('jet', len(FACTORS_MAP))
    
    bottom = np.zeros(len(df_sum))
    for i, (k,v) in enumerate(FACTORS_MAP.items()):            
            heights = np.array(df_sum_factor.loc[df_sum_factor[FACTOR] == k][value]) / np.array(df_sum[value]) * 100.0
            plt.bar(x = df_sum.index, height = heights, bottom=bottom, color = cmap(i), width=0.8)
            bottom += heights

    # Cosmetics
    plt.xticks(df_sum.index)
    plt.gca().xaxis.grid(False)
    plt.gca().margins(x=0.005)
    
    legend_colors = [plt.Rectangle((0,0), 1, 1, fc=cmap(i), edgecolor = 'none') 
                         for i in range(len(FACTORS_MAP))]
    legend_names = [v for _,v in FACTORS_MAP.items()]
    l = plt.legend(legend_colors, legend_names, prop={'size': 10}, bbox_to_anchor=(1.01, 0.99))
    l.draw_frame(False)

    plt.ylabel('Cluster composition ' + FACTOR + ' %')
    plt.xlabel('Cluster')
In [8]:
print('Size of clusters analysis')
t = clusters10xdf.copy()
t['counts'] = np.ones(len(t))
t.rename(columns={'Cluster': 'cluster'}, inplace=True)
t.index = t['Barcode']
print('Summary by', FACTOR, FACTORS_MAP)
display(t[[FACTOR, 'counts']].groupby([FACTOR]).sum())

clusters_sizes(t[['cluster', 'counts', FACTOR]], 'counts')
plt.show()

clusters_factors(t[['cluster', 'counts', FACTOR]], 'counts', FACTORS_MAP)
plt.show()

# Cleanup memory
del t
gc.collect()
Size of clusters analysis
Summary by age {1: 'young', 2: 'old'}
counts
age
1 7582.0
2 8838.0
Out[8]:
9980
In [9]:
def plot_colored(data, c,
                 clusters=False, show_centers=False, 
                 size=10, alpha=0.5, no_ticks=True, fig_ax=None):
    """ Plot colored scatterplot """
    cmap = plt.cm.get_cmap('jet', len(set(c))) if clusters else plt.cm.viridis
    if fig_ax is not None:
        fig, ax = fig_ax
    else:
        fig, ax = plt.subplots(1, figsize=(10, 8))
    sc = ax.scatter(data[:, 0], data[:, 1], 
                    c=c, cmap=cmap,
                    marker='o',
                    alpha=alpha,
                    s=size)
    mesh = ax.get_children()[0]
    if no_ticks:
        plt.setp(ax, xticks=[], yticks=[])
    if clusters and show_centers:
        # For each cluster, we add a text with name
        for cluster in set(c):
            cluster_mean = data[np.flatnonzero(c == cluster), :].mean(axis=0)
            ax.text(cluster_mean[0], cluster_mean[1], str(cluster), 
                     horizontalalignment='center', size='large', color='black', weight='bold')            
    if clusters:
        cbar = plt.colorbar(mesh, ax=ax, boundaries=np.arange(len(set(c)) + 1) + min(c) - 0.5)
        cbar.set_ticks(np.arange(len(set(c))) + min(c))
        cbar.set_ticklabels(list(set(c)))
    else:
        plt.colorbar(mesh, ax=ax)

def n_rows_columns(n, maxcols=5):
    ncols = max([r for r in range(1, min(maxcols, math.ceil(math.sqrt(n)) + 1)) if n % r == 0])
    nrows = int(n / ncols)
    return nrows, ncols

    
def plot_colored_split(data, c, size=10, alpha=0.5, contrast = 100):
    """ Plot colored scatterplot split by factors"""
    cmap = plt.cm.get_cmap('jet', len(set(c)))
    nrows, ncols = n_rows_columns(len(set(c)))
    plt.subplots(nrows=nrows, ncols=ncols, figsize=(5 * ncols, 5 * nrows))
    xmin, xmax = np.min(data[:,0]), np.max(data[:,0])
    xlim = (xmin - 0.1 * (xmax - xmin), xmax + 0.1 * (xmax - xmin)) # +10% margin
    ymin, ymax = np.min(data[:,1]), np.max(data[:,1])
    ylim = (ymin - 0.1 * (ymax - ymin), ymax + 0.1 * (ymax - ymin)) # +10% margin
    for i, f in enumerate(tqdm(set(c))):
        ax = plt.subplot(nrows, ncols, i + 1)
        plt.setp(ax, xticks=[], yticks=[])
        plt.xlim(xlim)
        plt.ylim(ylim)
        plt.title(str(f))
        # Plot slightly visible other as gray
        nfdata = data[np.flatnonzero(np.logical_not(np.equal(c, f))), :]
        plt.scatter(nfdata[:,0], nfdata[:,1], 
            color='gray',
            marker='o',
            alpha=alpha / contrast,
            s=size)
        # And factor data
        fdata = data[np.flatnonzero(np.equal(c, f)), :]        
        plt.scatter(fdata[:,0], fdata[:,1], 
                    color=cmap(i),
                    marker='o',
                    alpha=alpha,
                    s=size)
In [10]:
print('Cell Ranger ATAC original T-SNE visualization')
tsne10xpath = TSNE_FILE_10X
tsne10xdf = pd.read_csv(tsne10xpath, sep=',')

mergeddf = pd.merge(left=tsne10xdf,
                    right=clusters10xdf, 
                    left_on='Barcode', right_on='Barcode')

plot_colored(mergeddf[['TSNE-1', 'TSNE-2']].values, mergeddf['Cluster'], 
             clusters=True,
             show_centers=True,
             size=20, alpha=0.5)
plt.show()
Cell Ranger ATAC original T-SNE visualization
In [11]:
print('tSNE on LSA normalized vs', FACTOR, FACTORS_MAP)
plot_colored(mergeddf[['TSNE-1', 'TSNE-2']].values, clusters10xdf[FACTOR], clusters=True, size=20, alpha=0.1)
plt.show()
tSNE on LSA normalized vs age {1: 'young', 2: 'old'}
In [12]:
# Uncomment to plot split by factor
# with PdfPages(os.path.join(figures_path, '10x_tsne_clusters_split.pdf')) as pdf:
#     plot_colored_split(mergeddf[['TSNE-1', 'TSNE-2']].values, clusters10xdf[FACTOR], size=5, alpha=0.1)
#     pdf.savefig()

Peaks file stats

In [69]:
with PdfPages(os.path.join(figures_path, 'peaks_length.pdf')) as pdf:
    peaks_df = pd.read_csv(PEAKS_FILE, sep='\t', header=None)
    print('Peaks', len(peaks_df))
    sns.distplot(np.log10(peaks_df[2] - peaks_df[1]))
    plt.title('Peak length distribution')
    plt.xlabel('Log10 Length')
    plt.ylabel('Frequency')
Peaks 98449

Overlap of peaks with representative DNAse hypersensitive sites

In [70]:
if DNASE_FILE is not None:
    dnase = BedTool(DNASE_FILE)
    peaks_file = BedTool(PEAKS_FILE)
    peak_count = peaks_file.count()
    overlap = peaks_file.intersect(dnase, wa=True, u=True).count()
    peak_by_dnase = 0 if overlap == 0 else overlap * 100.0 / peak_count
    overlap = dnase.intersect(peaks_file, wa=True, u=True).count()
    dnase_by_peak = overlap * 100.0 / dnase.count()
    print('Fraction of peaks overlapping with representative DNAse', int(peak_by_dnase), '%')
    print('Fraction of representative DNAse overlapping with peaks', int(dnase_by_peak), '%')
Fraction of peaks overlapping with representative DNAse 94 %
Fraction of representative DNAse overlapping with peaks 14 %

Annotate peaks with closest genes

Pipeline Preprocess GENES to TSS and use bedtools closest -D b to report distance to TSS, upstream or downstream w.r.t. the gene strand.


Cell Ranger ATAC We use bedtools closest -D b to associate each peak with genes based on closest transcription start sites (packaged within the reference) such that the peak is within 1000 bases upstream or 100 bases downstream of the TSS.

In [371]:
print('Preprocess GTF file')

gtf_file = GTF_FILE
transcripts_bed = os.path.join(OUTPUT_DIR, 'cache', os.path.basename(gtf_file) + '.bed')
if not os.path.exists(transcripts_bed):
    GTF_TO_BED_SH = """
# Compute transcript name field in GTF file
TNF=$(cat $1 | grep transcript_id | grep -v '#' | head -n 1 | awk '{for (i=4; i<NF; i++) {if ($3=="transcript" && $i=="gene_name") print (i+1)}}')
# Process GTF file - transcripts of protein_coding genes
cat $1 | grep 'protein_coding' | awk -v TNF=$TNF -v OFS="\t" '{if ($3=="transcript") {print $1,$4-1,$5,$TNF,0,$7}}' | tr -d '";' | sort -k1,1 -k2,2n -k3,3n --unique > $2
    """
    with tempfile.NamedTemporaryFile(prefix='gtf_to_bed', suffix='.sh', delete=False) as f:
        f.write(GTF_TO_BED_SH.encode('utf-8'))
        f.close()
        ! bash {f.name} {gtf_file} {transcripts_bed} 
print('Transcripts bed file', transcripts_bed)

transcripts_tss = os.path.join(OUTPUT_DIR, 'cache', os.path.basename(gtf_file) + '.tss')
if not os.path.exists(transcripts_tss):
    BED_TO_TSS_SH = """
cat $1 | awk -v OFS="\t" '{if ($6=="+") {print $1, $2, $2+1, $4, 0, "+"} else {print $1,$3-1,$3, $4, 0, "-"}}' | sort -k1,1 -k2,2n -k3,3n --unique > $2
    """
    with tempfile.NamedTemporaryFile(prefix='bed_to_tss', suffix='.sh', delete=False) as f:
        f.write(BED_TO_TSS_SH.encode('utf-8'))
        f.close()
        ! bash {f.name} {transcripts_bed} {transcripts_tss} 
print('Transcripts tss file', transcripts_tss)
Preprocess GTF file
grep: write error
grep: write error
cat: write error: Broken pipe
Transcripts bed file /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/cache/gencode.vM22.annotation.gtf.bed
Transcripts tss file /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/cache/gencode.vM22.annotation.gtf.tss
In [372]:
def locations_closest_genes(locations, transcripts_tss):
    print('Annotate list of locations in format "chr:start-end" with closest tss of genes')
    with tempfile.NamedTemporaryFile(delete=False) as f:
        # Save index to restore original order after sorting
        for i, t in enumerate([re.split(':|-', p) for p in locations]):
            f.write('{}\t{}\t{}\t{}\n'.format(*t, i).encode('utf-8'))
        f.close()
        closest_file = f.name + '.closest'
        sorted_file = f.name + '.sorted'
        ! sort -k1,1 -k2,2n {f.name} > {sorted_file}
        ! bedtools closest -a {sorted_file} -b {transcripts_tss} -D b > {closest_file}
        closest_df = pd.read_csv(
            closest_file, sep='\t', 
            names=['chr', 'start', 'end', 'index', 
                   'gene_chr', 'gene_start', 'gene_end', 'gene', 'gene_score', 'gene_strand', 'distance'])
        # Restore original sorting as 
        closest_df.sort_values(by=['index'], inplace=True)
        # Pick only first closest gene in case of several
        closest_df.drop_duplicates(['index'], inplace=True)
        return closest_df[['chr', 'start', 'end', 'gene', 'distance']]

Find peaks closest to markers

In [397]:
print('Building genes peaks, closest peaks is mapped to gene, second closest with suffix _1, etc')
closest_genes = locations_closest_genes(
    peaks_df[0] + ':' + peaks_df[1].astype(str) + '-' + peaks_df[2].astype(str),
    transcripts_tss
)
# Restore 'peak' to be able to merge on it
closest_genes['peak'] = closest_genes['chr'] +\
                        ':' + closest_genes['start'].astype(str) +\
                        '-' + closest_genes['end'].astype(str)
t = closest_genes[['gene', 'peak']].copy()
t['distance'] = np.abs(closest_genes['distance'])

# Filter peak close to gene tss of MARKERS_GENES
t = t.loc[np.logical_and(t['distance'] < 2000, t['gene'].isin(MARKERS_GENES))]
t.sort_values(by=['gene', 'distance'], inplace=True)

marker_regions_peaks = {}
prev_gene = None
prev_count = 1
seen_genes = set()
for _, (gene, peak, distance) in t.iterrows():
    if prev_gene != gene:
        marker_regions_peaks[gene] = peak
        seen_genes.add(gene)
        prev_gene = gene
        prev_count = 1
    elif prev_count < 3:  # Max regions per gene
        marker_regions_peaks[f'{gene}_{prev_count}'] = peak
        prev_count += 1

print(f'Found {len(marker_regions_peaks)} marker peaks for {len(MARKERS_GENES)} peaks close to TSS')
for g in MARKERS_GENES:
    if g not in seen_genes:
        print(f'Missing peak for {g}')
for k, v in marker_regions_peaks.items():
    print(f'{k}: {v}')
Building genes peaks, closest peaks is mapped to gene, second closest with suffix _1, etc
Annotate list of locations in format "chr:start-end" with closest tss of genes
Found 46 marker peaks for 37 peaks close to TSS
Missing peak for Iglc1
Missing peak for Fcer1a
Missing peak for Il1rl1
Missing peak for Cd209a
Missing peak for Il23r
Missing peak for Tcrg-C1
Missing peak for Cd8b1
Adgre1: chr17:57357648-57360318
Arg1: chr10:24927559-24928588
Ccl5: chr11:83530201-83530919
Cd36: chr5:17835474-17836353
Cd3d: chr9:44981247-44982948
Cd3d_1: chr9:44984591-44987779
Cd4: chr6:124888061-124888422
Cd4_1: chr6:124885026-124886486
Cd44: chr2:102891370-102902279
Cd79a: chr7:24896134-24898376
Cd79a_1: chr7:24899055-24899354
Cd8a: chr6:71373171-71373707
Cx3cr1: chr9:120068000-120068987
Eomes: chr9:118476559-118479571
Flt3: chr5:147399724-147401356
Fn1: chr1:71652564-71653969
Foxp3: chrX:7586716-7586735
Gata2: chr6:88193465-88194436
Gata2_1: chr6:88198394-88198862
Gata2_2: chr6:88203277-88203640
Gdf15: chr8:70627862-70632106
Gdf15_1: chr8:70633937-70636173
Gzmk: chr13:113225812-113226325
Gzmk_1: chr13:113180923-113181119
Gzmk_2: chr13:113227022-113228154
Il12b: chr11:44398688-44399420
Il4: chr11:53617166-53617684
Il5: chr11:53719752-53720008
Itgax: chr7:128128712-128129903
Itgax_1: chr7:128130634-128134704
Lcn2: chr2:32387565-32388359
Lcn2_1: chr2:32384996-32385439
Lcn2_2: chr2:32389069-32389613
Lef1: chr3:131109105-131114154
Lyz2: chr10:117281323-117282601
Mki67: chr7:135715135-135717364
Mki67_1: chr7:135710234-135710711
Ncr1: chr7:4337407-4337914
Sell: chr1:164061184-164064453
Sell_1: chr1:164068995-164079614
Sell_2: chr1:164067890-164067902
Siglecf: chr7:43351062-43351737
Siglecf_1: chr7:43349488-43350414
Siglecf_2: chr7:43352989-43353647
Tcf7: chr11:52280678-52284680
Tnfsf4: chr1:161395198-161397036

The pipeline

Cell Calling

Pipeline uses noise_threshold and duplets_threshold.


Cell Ranger ATAC aggregation: when combining data from multiple GEM groups, the cellranger-atac aggr pipeline automatically equalizes the sensitivity of the groups before merging, which is the recommended approach in order to avoid the batch effect introduced by sequencing depth. Default method: Subsample fragments from higher-depth GEM wells until they all have an equal number of unique fragments per cell.

For each barcode:

  • we have the record of mapped high-quality fragments that passed all filters (the fragments.tsv file).
  • Having determined peaks prior to this, we use the number of fragments that overlap any peak regions.
  • Separate the signal from noise.

This works better in practice as compared to naively using the number of fragments per barcode. We first subtract a depth-dependent fixed count from all barcode counts to model whitelist contamination. This fixed count is the estimated number of fragments per barcode that originated from a different GEM, assuming a contamination rate of 0.02. Then we fit a mixture model of two negative binomial distributions to capture the signal and noise. Setting an odds ratio of 1000, we separate the barcodes that correspond to real cells from the non-cell barcodes.

Scasat If a cell has open peaks below a user defined threshold (default: 50 peaks) in the peak accessibility matrix we would remove that cell from subsequent analysis. Also peaks not observed across a user defined number of valid cells (default: 10 cells) are not considered for further downstream analysis.

SnapATAC alternative approach - 1kb, 5kb and 10kb resolution.

Existing computational methods rely on pre-defined regions of transposase accessibility identified from the aggregate signals. ...

Limitations:

  1. It requires sufficient number of single cell profiles to create robust aggregate signal for peak calling.
  2. The cell type identification is biased toward the most abundant cell types in the tissues.
  3. These techniques lack the ability to reveal regulatory elements in the rare cell populations which are underrepresented in the aggregate signal. This concern is critical, for example, in brain tissue, where key neuron types may represent less than 1% of all cells while still playing a critical role in the neural circuit.
In [18]:
# Intersect using bedtools rather than pybedtools, because they are too slow for files of this size!
def intersect_fragments_and_peaks(fragments_file_gz, peaks_file, blacklist_file):
    print('Fragments')
    ! zcat {fragments_file_gz} | wc -l
    idf = None
    if blacklist_file is not None:
        print('Blacklist')
        ! wc -l {blacklist_file}
    print('Peaks')
    ! wc -l {peaks_file}

    with tempfile.TemporaryDirectory(prefix='pipeline', dir=OUTPUT_DIR) as td:
        print('Filtering out non-standard chromosomes')
        chr_filtered_file = os.path.join(td, 'chromosome_filtered.bed')
        ! zcat {fragments_file_gz} | grep -i -E 'chr[0-9mt]+[^_]' > {chr_filtered_file}
        ! wc -l {chr_filtered_file}

        if blacklist_file is not None:
            print('Blacklist regions filtration')
            blacklist_filtered_file = os.path.join(td, 'blacklist_filtered.bed')
            ! bedtools intersect -v -a {chr_filtered_file} -b {blacklist_file} > {blacklist_filtered_file}
            ! wc -l {blacklist_filtered_file}
            # Cleanup
            ! rm {chr_filtered_file}
        else:
            blacklist_filtered_file = chr_filtered_file

        print('Fragments and peaks intersection')
        intersection_file = os.path.join(td, 'intersection.bed')
        ! bedtools intersect -wa -wb -a {blacklist_filtered_file} -b {peaks_file} > {intersection_file}
        ! wc -l {intersection_file}
        # Cleanup
        ! rm {blacklist_filtered_file}

        idf = pd.read_csv(intersection_file, sep='\t', header=None)
        # Cleanup
        ! rm {intersection_file}        
        icolnames = ['chr', 'start', 'end', 'barcode', 'reads', 'peak_chr', 'peak_start', 'peak_end']
        assert len(idf.columns) >= len(icolnames)
        idf.rename(columns={o: n for (o, n) in zip(idf.columns[:len(icolnames)], icolnames)}, inplace=True)
        idf = idf[icolnames] # Reorder and drop others
        idf = idf.astype({'reads': 'int8'}) # Consume less memory
    print('Done intersecting of fragments and peaks')
    return idf
In [200]:
# Load or compute fragments and peaks intersection data frame
idfpath = os.path.join(OUTPUT_DIR, 'cache', 'idf.csv.gz')
if os.path.exists(idfpath):
    print(f'Loading IDF from {idfpath}')
    idf = pd.read_csv(idfpath, compression='gzip')
else:
    print(f'No IDF file found {idfpath}')    
    idf = intersect_fragments_and_peaks(FRAGMENTS_FILE_10X, PEAKS_FILE, BLACKLIST_FILE)
    idf.to_csv(idfpath, index=False, compression='gzip')
    print(f'Saved IDF to {idfpath}')   
Loading IDF from /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/cache/idf.csv.gz
In [201]:
def filter_cells(pidf, noise_threshold=None, duplets_threshold=None):
    # 10x Genomics Cell Ranger ATAC-Seq marks duplicates
    # 'count' ignores multiple reads per barcode at same position
    pidf = pd.pivot_table(idf, values='reads', index=['barcode'], aggfunc='count')
    pidf.reset_index(level=0, inplace=True)
    
    print(f'Total barcodes intersecting peaks: {len(pidf)}')    
    plt.subplots(nrows=1, ncols=3, figsize=(20, 6))
    
    plt.subplot(1, 3, 1)
    sns.distplot(np.log10(pidf['reads']))
    plt.title('UMI summary coverage distribution')
    plt.xlabel('Log10 coverage')
    plt.ylabel('Frequency')    

    # n is for shoulder graph
    counts = sorted(pidf['reads'], reverse=True)
    df_counts = pd.DataFrame(data={'count': counts, 'n': range(1, len(counts) + 1)})
    cells_filter = [True] * len(df_counts)

    plt.subplot(1, 3, 2)    
    if noise_threshold is not None:
        noise_filter = df_counts['count'] <= noise_threshold
        cells_filter = np.logical_and(cells_filter, np.logical_not(noise_filter))
        df_noise = df_counts.loc[noise_filter]
        print(f'Noise <= {noise_threshold} reads per barcode: {len(df_noise)}')
        plt.plot(np.log10(df_noise['n']), 
                 np.log10(df_noise['count']), 
                 label='Noise', linewidth=3, color='red')        

    if duplets_threshold is not None:
        duplets_filter = df_counts['count'] > duplets_threshold
        cells_filter = np.logical_and(cells_filter, np.logical_not(duplets_filter))
        df_duplets = df_counts.loc[duplets_filter]
        print(f'Duplets > {duplets_threshold} reads per barcode: {len(df_duplets)}')        
        plt.plot(np.log10(df_duplets['n']), 
                 np.log10(df_duplets['count']), 
                 label='Duplets', linewidth=3, color='orange')        

    df_cells = df_counts.loc[cells_filter]

    cells_filter = [True] * len(pidf)
    if noise_threshold is not None:
        cells_filter = np.logical_and(cells_filter, pidf['reads'] > noise_threshold)
    if duplets_threshold is not None:
        cells_filter = np.logical_and(cells_filter, pidf['reads'] < duplets_threshold)
        
    cells = set(pidf.loc[cells_filter]['barcode'])
    idfcells = idf.loc[[c in cells for c in idf['barcode']]]
    print(f'Estimated number of cells: {len(cells)}')
    plt.title('Cells calling')
    plt.plot(np.log10(df_cells['n']), 
             np.log10(df_cells['count']), 
             label='Cell', linewidth=3, color='green')

    plt.xlabel('Log10 number of barcodes')
    plt.ylabel('Log10 fragments overlapping Peaks')
    plt.legend(loc='upper right')

    plt.subplot(1, 3, 3)
    sns.distplot(np.log10(df_cells['count']))
    plt.title('Cells UMI summary coverage distribution')
    plt.xlabel('Log10 coverage')
    plt.ylabel('Frequency')
    
    return idfcells
In [202]:
with PdfPages(os.path.join(figures_path, 'cell_calling.pdf')) as pdf:
    idfcells = filter_cells(idf, noise_threshold=NOISE_THRESHOLD, duplets_threshold=DUPLETS_THRESHOLD)
    pdf.savefig()
Total barcodes intersecting peaks: 383417
Noise <= 300 reads per barcode: 366769
Duplets > 8000 reads per barcode: 2177
Estimated number of cells: 14471
In [203]:
# Cleanup memory
del idf
gc.collect()
Out[203]:
85879

Peak-Barcode Matrix

Pipeline Compute number of different UMIs which intersect with peak. Multiple copies of UMI are ignored.


Cell Ranger ATAC-Seq We produce a count matrix consisting of the counts of fragment ends (or cut sites) within each peak region for each barcode. This is the raw peak-barcode matrix and it captures the enrichment of open chromatin per barcode. The matrix is then filtered to consist of only cell barcodes, which is then used in subsequent analysis such as dimensionality reduction, clustering and visualization.

A barcoded fragment may get sequenced multiple times due to PCR amplification. We mark duplicates in order to identify the original fragments that constitute the library and contribute to its complexity.

In [204]:
def cellsdf_to_coverage(idfcells):
    print('Barcode vs summary fragments overlap with peaks')
    # 10x Genomics Cell Ranger ATAC-Seq marks duplicates
    # 'count' ignores multiple reads per barcode at same position
    cell_peak_cov_df = pd.pivot_table(idfcells, values='reads', 
                            index=['peak_chr', 'peak_start', 'peak_end', 'barcode'], 
                            aggfunc='count').reset_index()

    cell_peak_cov_df['peak'] = cell_peak_cov_df['peak_chr'] + ':' + \
        cell_peak_cov_df['peak_start'].astype(str) + '-' + cell_peak_cov_df['peak_end'].astype(str)
    cell_peak_cov_df.drop(columns=['peak_chr', 'peak_start', 'peak_end'], inplace=True)

    print('Transforming dataframe to peaks x barcode format')
    coveragedf = pd.pivot_table(cell_peak_cov_df, index='peak', columns='barcode', values='reads').fillna(0)
    # Remove extra labels from pivot_table columns
    coveragedf.columns = coveragedf.columns.values
    coveragedf.index.name = None
    return coveragedf
In [205]:
cov_mtx_path = os.path.join(OUTPUT_DIR, 'cache', 'coverage.mtx')
barcodes_path = os.path.join(OUTPUT_DIR, 'cache', 'barcodes.txt')
peaks_path = os.path.join(OUTPUT_DIR, 'cache', 'peaks.txt')

if os.path.exists(cov_mtx_path):
    print(f'Loading coverage matrix {cov_mtx_path}')
    csc = io.mmread(cov_mtx_path)
    barcodes = [l.rstrip('\n') for l in open(barcodes_path)]
    peaks = [l.rstrip('\n') for l in open(peaks_path)]
    print(f'Loading into dataframe')
    coveragedf = pd.DataFrame(csc.toarray(), index=peaks, columns=barcodes)
    coveragedf.fillna(0, inplace=True)
    # Cleanup memory
    del csc
    del barcodes
    del peaks
    gc.collect()    
else:
    print(f'No coverage matrix found {cov_mtx_path}')
    coveragedf = cellsdf_to_coverage(idfcells)
    print(f'Saving coverage matrix')
    csc = sparse.csc_matrix(coveragedf.values, dtype=int)
    io.mmwrite(cov_mtx_path, csc)

    print('Saving barcodes')
    with open(barcodes_path, 'w') as f:
        for b in coveragedf.columns.values:
            f.write(b + '\n')

    print('Saving peaks')
    with open(peaks_path, 'w') as f:
        for p in coveragedf.index.values:
            f.write(p + '\n')    
    # Cleanup memory
    del csc
    gc.collect()    

print('DF', coveragedf.shape[0], 'peaks x', coveragedf.shape[1], 'cells')
No coverage matrix found /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/cache/coverage.mtx
Barcode vs summary fragments overlap with peaks
Transforming dataframe to peaks x barcode format
Saving coverage matrix
Saving barcodes
Saving peaks
DF 96544 peaks x 14471 cells
In [206]:
# Cleanup memory
del idfcells
gc.collect()
Out[206]:
21

Coverage dataframe analysis

In [207]:
print('Peaks per cell with non-zero coverage filtration')
peaks_per_cell = np.count_nonzero(coveragedf, axis=0)
with PdfPages(os.path.join(figures_path, 'peaks_per_cell.pdf')) as pdf:
    sns.distplot(np.log10(peaks_per_cell))
    plt.title('Peaks with coverage > 0 per cell')
    plt.xlabel('Log10 peaks')
    plt.ylabel('Frequency')
    pdf.savefig()
Peaks per cell with non-zero coverage filtration
In [208]:
with PdfPages(os.path.join(figures_path, 'umis.pdf')) as pdf:
    print('Summary UMI distribution per factor')
    factors = [FACTOR_FUNCTION(bc) for bc in coveragedf.columns]
    for k, v in tqdm(FACTORS_MAP.items()):
        sns.kdeplot(np.log10(coveragedf[coveragedf.columns[np.flatnonzero(np.equal(factors, k))]].sum()), 
                    label=v)
    plt.title('UMI summary coverage distribution')
    plt.xlabel('Log10 coverage')
    plt.ylabel('Frequency')
Summary UMI distribution per factor

In [209]:
print('Cells per peak with non-zero coverage distribution')
cells_per_peak = np.count_nonzero(coveragedf, axis=1)
with PdfPages(os.path.join(figures_path, 'cells_per_peaks.pdf')) as pdf:
    sns.distplot(np.log10(cells_per_peak))
    plt.title('Cells with coverage > 0 per peak')
    plt.xlabel('Log10 cells')
    plt.ylabel('Frequency')
    pdf.savefig()
Cells per peak with non-zero coverage distribution
In [210]:
# Plot peak coverage vs peak length
with PdfPages(os.path.join(figures_path, 'peaks_coverage_vs_length.pdf')) as pdf:
    sns.jointplot(x=np.log10(cells_per_peak), 
                  y=[int(re.split(':|-', p)[2]) - int(re.split(':|-', p)[1]) for p in coveragedf.index], 
                  alpha=0.1)
    plt.xlabel('Log10 cells')
    plt.ylabel('Peaks length')
    plt.suptitle('Peak coverage vs peak length')
    pdf.savefig()
    plt.show

Normalization

Cell Ranger uses normalization to median coverage depth in each UMI for PCA analysis,\ and doesn't do any normalization for LSA (default scheme) - we do the same here.

Seurat we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.

SnapATAC does not require population-level peak annotation, and instead assembles chromatin landscapes by directly clustering cells based on the similarity of their genome-wide accessibility profile. Using a regression-based normalization procedure, SnapATAC adjusts for differing read depth between cells.

Feature selection (Identification of highly variable features)

Pipeline

  • mean >99% as alignment errors or housekeeping genes
  • std <1% as noise

Cell Ranger ATAC removed features selection with zero variance.

SnapATAC The vast majority of the items in the cell-by-bin count matrix is “0”, some items have abnormally high coverage (often > 200) perhaps due to alignment error. Bins of exceedingly high coverage which likely represent the genomic regions that are invariable between cells such as housekeeping gene promoters were removed. We noticed that filtering bins of extremely low coverage perhaps due to random noise can also improve the robustness of the downstream clustering analysis.

  • Calculated the coverage of each bin using the binary matrix and
  • Removed the top 0.1% items of the highest coverage
  • Normalized the coverage by log10(count + 1).
  • Log-scaled coverage obey approximately a gaussian distribution which is then converted into zscore.
  • Bins with zscore beyond ±2 were filtered before further analysis.

Signac The largely binary nature of scATAC-seq data makes it challenging to perform ‘variable’ feature selection, as we do for scRNA-seq. Instead, we can choose to use only the top n% of features (peaks) for dimensional reduction, or remove features present in less that n cells with the FindTopFeatures function. Here, we will all features, though we note that we see very similar results when using only a subset of features (try setting min.cutoff to ‘q75’ to use the top 25% all peaks), with faster runtimes. Features used for dimensional reduction are automatically set as VariableFeatures for the Seurat object by this function. See for explanation: https://satijalab.org/signac/articles/pbmc_vignette.html

Seurat calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

In [211]:
print('Computing mean and std values for peaks...')
peaksdf = pd.DataFrame({'mean': np.mean(coveragedf.values, axis=1), 
                        'std': np.std(coveragedf.values, axis=1)},
                       index=coveragedf.index)
Computing mean and std values for peaks...
In [212]:
with PdfPages(os.path.join(figures_path, 'feature_selection_before.pdf')) as pdf:
    # df is (peaks x barcodes)
    sns.jointplot(x='mean', y='std', data=peaksdf, alpha=0.05)
    pdf.savefig()
    plt.show
In [213]:
def feature_selection(peaksdf, stdp=1, meanp=99):
    print('Total peaks', len(peaksdf))
    print('Feature selection: filter out most highly covered peaks as alignment errors or housekeeping genes.')
    means_high = np.percentile(peaksdf['mean'], meanp)
    mean_filter = peaksdf['mean'] < means_high
    print('Feature selection: filter out non-variable peaks.')          
    stds_low = np.percentile(peaksdf['std'], stdp)
    std_filter = peaksdf['std'] > stds_low
    peaks_filter = np.logical_and(mean_filter, std_filter)
    print('Filtered peaks', sum(peaks_filter))
    return peaks_filter
In [214]:
peaks_filter = feature_selection(peaksdf)
Total peaks 96544
Feature selection: filter out most highly covered peaks as alignment errors or housekeeping genes.
Feature selection: filter out non-variable peaks.
Filtered peaks 94612
In [215]:
with PdfPages(os.path.join(figures_path, 'feature_selection_after.pdf')) as pdf:
    sns.jointplot(x='mean', y='std', data=pd.DataFrame({'mean': peaksdf.loc[peaks_filter]['mean'], 
                                                        'std': peaksdf.loc[peaks_filter]['std']}),
                  alpha=0.05)
    pdf.savefig()
    plt.show
In [216]:
print('Check for markers')
for k,v in marker_regions_peaks.items():
    if v not in coveragedf.index:
        print(f'{k} {v} is missing in coveragedf')
Check for markers
Foxp3 chrX:7586716-7586735 is missing in coveragedf

Dimensionality Reduction, Clustering and visualization

Pipeline Use Cell Ranger ATAC approach and IRLB code for dimensionality reduction.

LSA (IDF + IRLBA + L2 normalization)

Cell Ranger ATAC default method is LSA.

  1. IDF scaling
  2. IRLBA SVD decomposition to dim 15
  3. L2 normalization in low dimension projection
  4. Graph clustering
  5. T-SNE visualization

IRLBA: The augmented implicitly restarted Lanczos bidiagonalization algorithm finds a few approximate largest singular values and corresponding singular vectors of a sparse or dense matrix using a method of Baglama and Reichel.

PCA if your features are least sensitive (informative) towards the mean of the distribution, then it makes sense to subtract the mean. If the features are most sensitive towards the high values, then subtracting the mean does not make sense.

SVD does not subtract the means but often as a first step projects the data on the mean of all data points. In this way the SVD first takes care of global structure - this is important for IDF transformation.

We found that the combination of these normalization techniques obviates the need to remove the first component. The number of dimensions is fixed to 15 as it was found to sufficiently separate clusters visually and in a biologically meaningful way when tested on peripheral blood mononuclear cells (PBMCs).

In [217]:
def lsa(df, dim):
    print('Normalizing by IDF')
    idf = np.log1p(df.shape[1]) - np.log1p(np.count_nonzero(df, axis=1))
    df_idf = df.multiply(idf, axis=0).T # Transpose to (barcodes x peaks) format
    del idf # Cleanup    
    
    print('Performing IRLBA without scaling or centering')
    # Number of dimensions recommended by Cell Ranger ATAC-Seq algorithm
    (_, _, v, _, _) = irlb(df_idf, dim)
    # project the matrix to complete the transform: X --> X*v = u*d
    df_idf_irlb = df_idf.dot(v) 
    assert df_idf_irlb.shape == (df_idf.shape[0], dim), 'check dimensions'
    del df_idf # Cleanup

    # IMPORTANT! We found that the combination of these normalization techniques 
    # obviates the need to remove the first component.
    print('Normalization each barcode data point to unit L2-norm in the lower dimensional space')
    idf_irlb_l2 = Normalizer(norm='l2').fit_transform(df_idf_irlb) # (n_samples, n_features) 
    return pd.DataFrame(idf_irlb_l2, index=df_idf_irlb.index, columns=df_idf_irlb.columns)

# NO UMI normalization required for LSA approach!
# IDF is invariant to scales, IRLBA don't need scaling.
result_lsa = lsa(coveragedf.loc[peaks_filter], LSA_COMPONENTS)
Normalizing by IDF
Performing IRLBA without scaling or centering
Normalization each barcode data point to unit L2-norm in the lower dimensional space

TSNE visualization

In [218]:
print('Processing t-SNE on L2-normalized data')
tsne_coords = pd.DataFrame(TSNE(n_components=2).fit_transform(result_lsa.values),
                           index=result_lsa.index,
                           columns=['tsne1', 'tsne2'])
print('t-SNE done!')
# display(tsne_coords.head())

print('Saving tSNE coordinates to tsne.tsv')
tsne_coords.to_csv(os.path.join(OUTPUT_DIR, 'tsne.tsv'), sep='\t')
Processing t-SNE on L2-normalized data
t-SNE done!
Saving tSNE coordinates to tsne.tsv
In [219]:
print('tSNE on LSA normalized vs log10 coverate depth')
with PdfPages(os.path.join(figures_path, 'tsne.pdf')) as pdf:
    plot_colored(tsne_coords.values, np.log10(coveragedf.sum()), size=20)
    pdf.savefig()
tSNE on LSA normalized vs log10 coverate depth

Additional dupletes filtration

In [ ]:
def process_feature_selection_dim_reduction(coveragedf):
    # Feature selection
    peaksdf = pd.DataFrame({'mean': np.mean(coveragedf.values, axis=1), 
                            'std': np.std(coveragedf.values, axis=1)},
                            index=coveragedf.index)
    peaks_filter = feature_selection(peaksdf)
    # Normalization and dimensionality reduction
    result_lsa = lsa(coveragedf.loc[peaks_filter], LSA_COMPONENTS)
    return result_lsa
In [246]:
# Reprocess coveragedf removing dupletes
# Be careful: it can remove all the markers at all!
# print('Clip dupletes by cutoff')
# coveragedf = coveragedf.iloc[np.flatnonzero(np.log10(coveragedf.sum()) <= 3.8), :]

# print('Normalization and dimensionality reduction')
# result_lsa = process_feature_selection_dim_reduction(coveragedf)

# print('Processing t-SNE on L2-normalized data')
# tsne_coords = pd.DataFrame(TSNE(n_components=2).fit_transform(result_lsa.values),
#                            index=result_lsa.index,
#                            columns=['tsne1', 'tsne2'])
# plot_colored(tsne_coords.values, np.log10(coveragedf.sum()), size=20)
In [247]:
print('tSNE on LSA normalized vs', FACTOR, FACTORS_MAP)
factors = np.array([FACTOR_FUNCTION(bc) for bc in result_lsa.index])
with PdfPages(os.path.join(figures_path, f'tsne_{FACTOR}.pdf')) as pdf:
    plot_colored(tsne_coords.values, factors, clusters=True, size=20, alpha=0.1)
    pdf.savefig()
tSNE on LSA normalized vs age {1: 'young', 2: 'old'}
In [248]:
# Uncomment to plot split by factor
# print(FACTORS_MAP)
# with PdfPages(os.path.join(figures_path, f'tsne_{FACTOR}_split.pdf')) as pdf:
#     plot_colored_split(tsne_coords.values, factors, alpha=0.1)
#     pdf.savefig()

Graph clustering of LSA normalized data

Clustering should:

  • produce clusters enough clusters 10-15 recommended by Cell Ranger
  • too small clusters are likely artifacts < 100 cells
  • centers of clusters should be inside the cluster, otherwise we should increase number of clusters OR increase lower dimension space size
In [299]:
def graph_clustering(result, n_clusters):
    # Ward linkage method as default with euclidean distance
    print('Clustering', n_clusters)
    clusters = AgglomerativeClustering(n_clusters=n_clusters).fit_predict(result).astype(int)

    # reorder clusters by size descending
    cluster_counter = Counter()
    for c in clusters:
        cluster_counter[c] += 1
    clusters_reord = np.zeros(len(clusters), dtype=int)    
    for i, (c, n) in enumerate(cluster_counter.most_common()):
        clusters_reord[clusters == c] = i

    return clusters_reord

n_clusters = N_CLUSTERS
clusters = graph_clustering(result_lsa, n_clusters)
print(f'Saving clusters to clusters{n_clusters}.tsv')
clusters_df = pd.DataFrame({'cluster': clusters}, index=result_lsa.index)
clusters_df.to_csv(os.path.join(OUTPUT_DIR, f'clusters{n_clusters}.tsv'))
Clustering 20
Saving clusters to clusters20.tsv
In [386]:
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

def clusters_summary(coveragedf, clusters):
    # Clusters summary by factors
    t = pd.DataFrame({FACTOR: [FACTOR_FUNCTION(bc) for bc in coveragedf.columns],
                     'cluster': clusters, 'counts': np.ones(len(coveragedf.columns))}, 
                     index=coveragedf.columns)
    clusters_sizes(t, 'counts')
    clusters_factors(t, 'counts', FACTORS_MAP)    

def clusters_hierarchy(X, clusters):
    # Hierarchical clustering of X - mean normalized LSA coordinates per cluster
    t = X.reset_index(drop=True).values
    clusters_means = {
        str(c): np.mean(t[np.flatnonzero(clusters == c), :], axis=0) for c in sorted(set(clusters))
    }
    clusters_means_df = pd.DataFrame(clusters_means).reset_index()
    clusters_linkage = hierarchy.linkage(
        clusters_means_df[[str(i) for i in set(clusters)]].T.values, method='ward'
    )
    hierarchy.dendrogram(clusters_linkage, orientation='top')
    hierarchy.set_link_color_palette(None)  # reset to default after use
    return clusters_linkage
    
def clusters_silhouette(X, clusters):
    # X - mean normalized LSA coordinates per cluster
    # The silhouette_score gives the average value for all the samples
    silhouette_avg = silhouette_score(X, clusters)
    print("For n_clusters =", len(set(clusters)), "the average silhouette_score is ", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, clusters)

    plt.figure(figsize=(6, 10))
    ax = plt.gca()

    y_lower = 10
    for i, c in enumerate(sorted(set(clusters))):
        # Aggregate the silhouette scores for samples belongiterto cluster c, and sort them
        cluster_silhouette_values = sample_silhouette_values[clusters == c]
        cluster_silhouette_values.sort()

        size_cluster = cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax.fill_betweenx(np.arange(y_lower, y_upper),
                          0, cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax.text(-0.05, y_lower + 0.5 * size_cluster, str(c))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax.set_title("The silhouette plot for the various clusters.")
    ax.set_xlabel("The silhouette coefficient values")
    ax.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax.set_yticks([])  # Clear the yaxis labels / ticks
In [387]:
print('Computing summary for clusters by', FACTOR)
print(FACTORS_MAP)

t = pd.DataFrame({FACTOR: [FACTOR_FUNCTION(bc) for bc in coveragedf.columns],
                 'cluster': clusters, 'counts': np.ones(len(coveragedf.columns))}, 
                 index=coveragedf.columns)
display(t[[FACTOR, 'counts']].groupby([FACTOR]).sum())

with PdfPages(os.path.join(figures_path, 'clusters_sizes.pdf')) as pdf:
    clusters_sizes(t, 'counts')
    pdf.savefig()

with PdfPages(os.path.join(figures_path, f'clusters_{FACTOR}.pdf')) as pdf:
    clusters_factors(t, 'counts', FACTORS_MAP)
    pdf.savefig() 
Computing summary for clusters by age
{1: 'young', 2: 'old'}
counts
age
1 6368.0
2 8103.0
In [302]:
with PdfPages(os.path.join(figures_path, 'clusters_umi.pdf')) as pdf:
    ts = []
    plt.figure(figsize=(int(n_clusters * 0.75), 6))
    for c in tqdm(set(clusters)):
        t = pd.DataFrame({'umi': np.log10(coveragedf.T.iloc[np.flatnonzero(clusters == c)].sum(axis=1) + 1)})
        t['cluster'] = str(c)
        ts.append(t)
    sns.violinplot(data=pd.concat(ts), x='cluster', y='umi', order=[str(c) for c in sorted(set(clusters))])
    plt.title('UMI summary coverage per cell')
    plt.xlabel('Cluster')
    plt.ylabel('Summary log10 coverage')
    plt.legend()
    pdf.savefig()

No handles with labels found to put in legend.
In [303]:
with PdfPages(os.path.join(figures_path, 'clusters_cell_peaks.pdf')) as pdf:
    ts = []
    plt.figure(figsize=(int(n_clusters * 0.75), 6))    
    for c in tqdm(set(clusters)):
        t = pd.DataFrame({
            'nzpeaks': np.log10(np.count_nonzero(coveragedf.T.iloc[np.flatnonzero(clusters == c)], axis=1) + 1)
        })
        t['cluster'] = str(c)
        ts.append(t)
    sns.violinplot(data=pd.concat(ts), x='cluster', y='nzpeaks', order=[str(c) for c in sorted(set(clusters))])
    plt.title('Peaks with non-zero coverage distribution per cell')
    plt.xlabel('Cluster')
    plt.ylabel('Frequency')
    plt.legend()
    pdf.savefig()

No handles with labels found to put in legend.

Clusters visualization

In [389]:
clusters_linkage = clusters_hierarchy(result_lsa, clusters)
In [318]:
# Silhouetter for clusters
clusters_silhouette(result_lsa, clusters)
For n_clusters = 20 the average silhouette_score is  0.2232267087605424
In [319]:
print('tSNE on LSA normalized vs graph clusters')
with PdfPages(os.path.join(figures_path, 'tsne_clusters.pdf')) as pdf:
    plot_colored(tsne_coords.values, clusters, clusters=True, show_centers=True, size=20, alpha=0.5)
    pdf.savefig()
tSNE on LSA normalized vs graph clusters
In [320]:
## Uncomment to plot split by cluster
# with PdfPages(os.path.join(figures_path, 'tsne_clusters_split.pdf')) as pdf:
#     plot_colored_split(tsne_coords.values, clusters, size=5)
#     pdf.savefig()

Analyse markers of regions vs clustering

In [392]:
print('Hierarchical clustering of mean coverage per peak in cluster')
# Transpose to (barcodes x peaks) format
t = coveragedf.T 
# Per cluster mean values for peaks
clusters_means = {str(c): t.loc[clusters == c].mean() for c in sorted(set(clusters))}
clusters_means_df = pd.DataFrame(clusters_means).reset_index()

clusters_linkage_covdf = hierarchy.linkage(
    clusters_means_df[[str(i) for i in set(clusters)]].T.values, method='ward'
)
hierarchy.dendrogram(clusters_linkage_covdf, orientation='top')
hierarchy.set_link_color_palette(None)  # reset to default after use
Hierarchical clustering of mean coverage per peak in cluster
In [395]:
print('Hierarchical clustering of coverage in markers')
t = clusters_means_df[[str(c) for c in set(clusters)]].copy()
t.index = clusters_means_df['index']
mt = [(k, v) for k, v in marker_regions_peaks.items() if v in t.index]
t = t.loc[[v for _, v in mt]]
t.index = [k for k, _ in mt]
# zscore per row
# t = pd.DataFrame(zscore(t, axis=1), index=t.index, columns=t.columns)
# 0-1 scale per row
t = t.divide(t.max(axis=1), axis='rows')
sns.clustermap(t, figsize=(1 + int(n_clusters / 3), 1 + int(len(mt) / 4)), cmap='bwr',
              col_linkage=clusters_linkage)
plt.show()
Hierarchical clustering of coverage in markers
In [323]:
def regions_plot(regions, df, xs, ys, xlabel, ylabel, factors, clusters, binary):
    print(f'Coverage z-score, distribution and fraction on non-zero covered peaks w.r.t {FACTOR} and clusters')
    fregions = []
    for k, peak in regions.items():
        if peak in df.index:
            fregions.append((k, peak))
        else:
            print(f'{k} {peak} is missing')
            
    fig = plt.figure(figsize=(4*4, 4*len(fregions)))
    grid = plt.GridSpec(nrows=len(fregions), ncols=8, wspace=0.5, hspace=0.5, figure=fig)
    for i, (k, peak) in enumerate(fregions):
        if peak not in df.index:
            print(f'{k} {peak} is missing')
            continue
        vals = df.loc[peak]
        print(k, peak, 'cov>0:', sum(vals > 0), 'of', len(vals), 'max cov:', max(vals))

        ax = plt.subplot(grid[i, 0:2])
        # Plot zscores scatter plot with 2 segments colormap blue-white-red
        zs = zscore(vals)
        cmap = LinearSegmentedColormap.from_list(name='zscore', 
            colors=[(0, 'darkblue'), ((0 - np.min(zs)) / (np.max(zs) - np.min(zs)), 'white'), (1, 'red')])
        plt.setp(ax, xticks=[], yticks=[])
        ax.scatter(xs, ys, marker='.', c=zs, cmap=cmap, s=2, alpha=0.2)
        plt.title(k)
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)        
        
        t = pd.DataFrame({'cluster': clusters, 'factor': factors, 
                          'val': [min(v, 1) for v in vals] if binary else vals})
        
        
        ax = plt.subplot(grid[i, 2])
        # mean non-zero covered peaks by type
        tf = pd.pivot_table(t, values='val', index=['factor'], aggfunc='mean').reset_index()
        sns.barplot(x='factor', y='val', data=tf)
        ax.tick_params(axis='x', labelrotation=90)
        plt.ylabel('')

        ax = plt.subplot(grid[i, 3:])
        # mean non-zero covered peaks by cluster and factor
        tc = t.groupby(['cluster', 'factor']).mean().reset_index()
        sns.barplot(x='cluster', y='val', data=tc, hue='factor')
        plt.ylabel('')
In [326]:
# regions_plot(marker_regions_peaks, coveragedf, 
#              tsne_coords['tsne1'], tsne_coords['tsne2'],
#              'tsne1', 'tsne2',
#             [FACTORS_MAP[f] for f in factors], clusters, True)

Iteractive clustering refinement

  • Using clustering on LSA normalized data refine clusters
  • Create BAM files for each group
  • Call peaks on each cluster
  • Process clustering iteration
  • Merge clustering into a final result
In [418]:
def get_fragments(fragments_tsv_gz):
    print('Loading full fragments dataframe')
    return pd.read_csv(fragments_tsv_gz, sep='\t', 
                       names=['chr', 'start', 'end', 'barcode', 'reads'], compression='gzip')

def get_fragments_and_clusters(fragments_tsv_gz, barcodes, clusters):
    fdf = get_fragments(fragments_tsv_gz)

    print('Aggregating fragments and clusters')
    t = pd.merge(left=fdf, 
                 right=pd.DataFrame({'barcode': barcodes, 'cluster': clusters}), 
                 left_on='barcode', right_on='barcode')
    return t
In [ ]:
t = get_fragments_and_clusters(FRAGMENTS_FILE_10X, clusters_df.index, clusters)
In [ ]:
refinement_groups = [(1, 2, 3)]
In [327]:
def write_bigwig(df, bw_path, step):
    chr_lengths = df[['chr', 'end']].groupby('chr').max().reset_index()
    with pyBigWig.open(bw_path, 'w') as bw:
        bw.addHeader(list(zip(chr_lengths['chr'], chr_lengths['end'])))
        for chromosome in tqdm(chr_lengths['chr']):
            df_chr = df.loc[df['chr'] == chromosome].sort_values(['start', 'end'])
            starts = list(df_chr['start'])
            ends = list(df_chr['end'])
            reads = list(df_chr['reads'])
            chr_length = int(chr_lengths.loc[chr_lengths['chr'] == chromosome]['end'])
            values = np.zeros(int(math.floor(chr_length / step)) + 1)

            for i in range(len(df_chr)):
                # Ignore PCR duplicates here!
                # v = reads[i] 
                v = 1.0
                si = int(math.floor(starts[i] / step)) 
                ei = int(math.floor(ends[i] / step)) 
                if ei == si:
                    values[si] += v
                else:
                    values[si] += v / 2 
                    values[ei] += v / 2

            non_zero = values > 0
            non_zero_inds = np.flatnonzero(non_zero)
            starts_np = non_zero_inds * step
            ends_np = (non_zero_inds + 1) * step
            values_np = values[non_zero]
            # RPM normalization
            values_np = values_np * (1_000_000.0 / sum(values_np))
            chroms_np = np.array([chromosome] * sum(non_zero))
            bw.addEntries(chroms_np, starts=starts_np, ends=ends_np, values=values_np)                     

def write_cluster_bigwigs(fragments_tsv_gz, barcodes, clusters, bigwigs_path, step):
    df_for_bigwigs = get_fragments_and_clusters(fragments_tsv_gz, barcodes, clusters)

    for cluster in sorted(set(clusters)):
        bw_path = os.path.join(bigwigs_path, f'cluster_{cluster}.bw')
        print('Processing', cluster, bw_path)
        df_for_bigwigs_cluster = df_for_bigwigs.loc[df_for_bigwigs['cluster'] == cluster]
        write_bigwig(df_for_bigwigs_cluster, bw_path, step)

    print('Done')
In [ ]:
import pandas as pd
import time 
import pysam
import os

beds_path = os.path.join(OUTPUT_DIR, 'cache', '1')
! mkdir {beds_path}

for refined_group in refinement_groups:
    rgn = '_'.join(map(lambda c: str(c), refined_group))
    bed_path = os.path.join(beds_path, f'cluster_{rgn}.bed')
    fragments_path = os.path.join(beds_path, f'cluster_{rgn}.tsv.gz')
    bw_path = os.path.join(beds_path, f'cluster_{rgn}.bw')
    print('Processing', refined_group)
    tg = t.loc[t['cluster'].isin(refined_group)]
    print('Saving reads', bed_path)
    tg[['chr', 'start', 'end']].to_csv(bed_path, sep='\t', header=False, index=False)
    print('Saving fragments', fragments_path)
    tg[['chr', 'start', 'end', 'barcode', 'reads']].to_csv(fragments_path, sep='\t', 
                                                           header=False, index=False,
                                                           compression='gzip')
    print('Saving visualization', bw_path)
    write_bigwig(tg, bw_path, step=200)
    del tg
    

# # variable to hold barcode index
# CB_hold = 'unset'
# itr = 0
# # read in upsplit file and process reads line by line
# samfile = pysam.AlignmentFile(unsplit_file, "rb")

# start_time = time.time()
# for read in samfile.fetch(until_eof=True):
#     if(len([x for x in read.tags if x[0] == "CB"])>0):
#         # barcode itr for current read
#         CB_itr = read.get_tag( 'CB')
#         if(CB_itr in df_barcodes[0].tolist()):
#             # if change in barcode or first line; open new file  
#             if(CB_itr!=CB_hold or itr==0):
#                 # close previous split file, only if not first read in file
#                 if(itr!=0):
#                     split_file.close()
#                 CB_hold = CB_itr
#                 itr+=1
#                 split_file = pysam.AlignmentFile(out_dir + "{}.bam".format(CB_hold), "wb", template=samfile)
#             # write reads with the same barcode to file
#             split_file.write(read)    
# split_file.close()
# samfile.close()
# end_time = time.time()


# print(end_time - start_time)
In [ ]:
# Process cluster fragments.tsv.gz and clusters peaks, long operation!
idf = intersect_fragments_and_peaks(
    os.path.join(beds_path, 'cluster_1_2_3.tsv.gz'),
    os.path.join(beds_path, 'cluster_1_2_3_peaks.narrowPeak', 
    BLACKLIST_FILE
)
In [ ]:
def process_coverage(idf):
    idfcells = filter_cells(idf, noise_threshold=NOISE_THRESHOLD, duplets_threshold=DUPLETS_THRESHOLD)
    # Compute coverage df
    coveragedf = cellsdf_to_coverage(idfcells)
coveragedf = process_coverage(idf)
In [ ]:
# Cleanup memory
del idf
gc.collect()
In [ ]:
results_lsa = process_feature_selection_dim_reduction(coveragedf)
In [ ]:
def factors_summary(coveragedf):
    t = pd.DataFrame({FACTOR: [FACTOR_FUNCTION(bc) for bc in coveragedf.columns],
                     'counts': np.ones(len(coveragedf.columns))}, 
                     index=coveragedf.columns)
    display(t[[FACTOR, 'counts']].groupby([FACTOR]).sum())
    
factors_summary(coveragedf)
In [ ]:
print('Processing t-SNE on L2-normalized data')
tsne_coords = pd.DataFrame(TSNE(n_components=2).fit_transform(result_lsa.values),
                           index=result_lsa.index,
                           columns=['tsne1', 'tsne2'])
plot_colored(tsne_coords.values, np.log10(coveragedf.sum()), size=20)
In [ ]:
# Reprocess coveragedf removing dupletes

print('Clip dupletes by cutoff')
coveragedf = coveragedf.iloc[np.flatnonzero(np.log10(coveragedf.sum()) <= 4.1), :]

print('Normalization and dimensionality reduction')
result_lsa = process_feature_selection_dim_reduction(coveragedf)

print('Processing t-SNE on L2-normalized data')
tsne_coords = pd.DataFrame(TSNE(n_components=2).fit_transform(result_lsa.values),
                           index=result_lsa.index,
                           columns=['tsne1', 'tsne2'])
plot_colored(tsne_coords.values, np.log10(coveragedf.sum()), size=20)
In [ ]:
for n_clusters in [5, 10, 20]:
    clusters = graph_clustering(result_lsa, n_clusters) 
    clusters_summary(coveragedf, clusters)
    plt.show()
    clusters_hierarchy(result_lsa, clusters)
    plt.show()
    clusters_silhouette(result_lsa, clusters)
    plot_colored(tsne_coords.values, clusters, clusters=True, show_centers=True, size=20, alpha=0.5)
    plt.show()

Create bigwig per cluster

In [ ]:
fdf = get_fragments(os.path.join(beds_path, 'cluster_1_2_3.tsv.gz'))
In [ ]:
for n_clusters in [5, 10, 20]:
    print('Processing', n_clusters)
    group_bigwigs_path = os.path.join(OUTPUT_DIR, f'cache/1/cluster_1_2_3/{n_clusters}')
    if os.path.exists(group_bigwigs_path):
        continue  # Already processed
    ! mkdir {group_bigwigs_path}
    clusters = graph_clustering(result_lsa, n_clusters)
    print('Aggregating fragments and clusters')
    df_for_bigwigs = pd.merge(left=fdf, 
                              right=pd.DataFrame({'barcode': result_lsa.index, 'cluster': clusters}), 
                              left_on='barcode', right_on='barcode')
    for cluster in sorted(set(clusters)):
        bw_path = os.path.join(group_bigwigs_path, f'cluster_{cluster}.bw')
        print('Processing', cluster, bw_path)
        write_bigwig(df_for_bigwigs.loc[df_for_bigwigs['cluster'] == cluster], bw_path, step)
In [ ]:
# Cleanup
del fdf
del df_for_bigwigs
gc.collect()

Clusters visualization vs 10x Cell Ranger ATAC

In [ ]:
# # print('These clusters into Cell Ranger ATAC coordinates') 
# mergeddf1 = pd.merge(pd.DataFrame({'Barcode': tsne_coords.index, 'Cluster': clusters}),
#                      right=tsne10xdf, 
#                      left_on='Barcode', right_on='Barcode')

# mergeddf2 = pd.merge(pd.DataFrame({'Barcode': tsne_coords.index,
#                                   'tsne1': tsne_coords['tsne1'],
#                                   'tsne2': tsne_coords['tsne2']}),
#                      right=clusters10xdf, 
#                      left_on='Barcode', right_on='Barcode')


# fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(18, 7))

# ax1.title.set_text('Clusters in 10x tsne coordinates')
# plot_colored(mergeddf1[['TSNE-1', 'TSNE-2']].values, 
#              mergeddf1['Cluster'], 
#              clusters=True,
#              show_centers=True,
#              size=20, alpha=0.3, fig_ax=(fig, ax1))

# ax2.title.set_text('10x clusters in tsne coordinates')
# plot_colored(mergeddf2[['tsne1', 'tsne2']].values, 
#              mergeddf2['Cluster'], 
#              clusters=True,
#              show_centers=True,
#              size=20, alpha=0.3, fig_ax=(fig, ax2))
# plt.show()

UMAP visualization

In [331]:
# UMAP visualization
umap_coords = pd.DataFrame(umap.UMAP().fit_transform(result_lsa.values)[:, :2], 
                           index=result_lsa.index,
                           columns=['umap1', 'umap2'])

print('Saving UMAP coordinates to umap.tsv')
umap_coords.to_csv(os.path.join(OUTPUT_DIR, 'umap.tsv'), sep='\t')
/home/user/anaconda/envs/py37/lib/python3.6/site-packages/numba/typed_passes.py:271: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../../../home/user/anaconda/envs/py37/lib/python3.6/site-packages/umap/rp_tree.py", line 135:
@numba.njit(fastmath=True, nogil=True, parallel=True)
def euclidean_random_projection_split(data, indices, rng_state):
^

  state.func_ir.loc))
/home/user/anaconda/envs/py37/lib/python3.6/site-packages/umap/nndescent.py:92: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../../../home/user/anaconda/envs/py37/lib/python3.6/site-packages/umap/utils.py", line 409:
@numba.njit(parallel=True)
def build_candidates(current_graph, n_vertices, n_neighbors, max_candidates, rng_state):
^

  current_graph, n_vertices, n_neighbors, max_candidates, rng_state
/home/user/anaconda/envs/py37/lib/python3.6/site-packages/numba/typed_passes.py:271: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../../../home/user/anaconda/envs/py37/lib/python3.6/site-packages/umap/nndescent.py", line 47:
    @numba.njit(parallel=True)
    def nn_descent(
    ^

  state.func_ir.loc))
Saving UMAP coordinates to umap.tsv
In [332]:
print('UMAP on LSA normalized vs log10 coverate depth')
with PdfPages(os.path.join(figures_path, 'umap.pdf')) as pdf:
    plot_colored(umap_coords.values, np.log10(coveragedf.sum()), size=20)
    pdf.savefig()
UMAP on LSA normalized vs log10 coverate depth
In [333]:
print('tSNE on LSA normalized vs', FACTOR, FACTORS_MAP)
with PdfPages(os.path.join(figures_path, f'umap_{FACTOR}.pdf')) as pdf:
    factors = [FACTOR_FUNCTION(bc) for bc in result_lsa.index]
    plot_colored(umap_coords.values, factors, clusters=True, size=20, alpha=0.01)
    pdf.savefig()
tSNE on LSA normalized vs age {1: 'young', 2: 'old'}
In [334]:
# # Uncomment to plot split by factor
# print(FACTORS_MAP)
# with PdfPages(os.path.join(figures_path, f'umap_{FACTOR}_spit.pdf')) as pdf:
#     plot_colored_split(umap_coords.values, factors, size=5, alpha=0.1)
#     pdf.savefig()
In [335]:
print('UMAP on LSA normalized vs clusters')
with PdfPages(os.path.join(figures_path, 'umap_clusters.pdf')) as pdf:
    plot_colored(umap_coords.values, clusters, clusters=True, show_centers=True, size=20, alpha=0.3)
    pdf.savefig()
UMAP on LSA normalized vs graph clusters
In [ ]:
# # Uncomment to plot split by cluster
# print('UMAP on LSA normalized vs graph clusters')
# with PdfPages(os.path.join(figures_path, 'umap_clusters_split.pdf')) as pdf:
#     plot_colored_split(umap_coords.values, clusters, size=5)
#     pdf.savefig()

Visualize markers of interest vs clusters

In [ ]:
print('Analysing REGIONS of interest TSNE')
with PdfPages(os.path.join(figures_path, 'tsne_markers.pdf')) as pdf:
    regions_plot(marker_regions_peaks, coveragedf, 
                 tsne_coords['tsne1'], tsne_coords['tsne2'],
                 'tsne1', 'tsne2',
                [FACTORS_MAP[f] for f in factors], clusters, True)
    pdf.savefig()
In [ ]:
print('Analysing REGIONS of interest UMAP')
with PdfPages(os.path.join(figures_path, 'umap_markers.pdf')) as pdf:
    regions_plot(marker_regions_peaks, coveragedf, 
                 umap_coords['umap1'], umap_coords['umap2'],
                 'umap1', 'umap2',
                 [FACTORS_MAP[f] for f in factors], clusters, True)
    pdf.savefig()

BigWig RPM normalized profiles

Pipeline Split UMI reads by clusters and use RPKM normalization.


Seurat We created pseudo-bulk ATAC-seq profiles by pooling together cells with for each cell type. Each cell type showed enriched accessibility near canonical marker genes. Chromatin accessibility tracks are normalized to sequencing depth (RPKM normalization) in each pooled group.

In [328]:
bigwigs_path = os.path.join(OUTPUT_DIR, 'bigwig')
! mkdir -p {bigwigs_path}

write_cluster_bigwigs(FRAGMENTS_FILE_10X, clusters_df.index, clusters, bigwigs_path, step=200)
Loading full fragments dataframe
Aggregating fragments and clusters
Processing 0 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_0.bw
Processing 1 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_1.bw
Processing 2 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_2.bw
Processing 3 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_3.bw
Processing 4 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_4.bw
Processing 5 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_5.bw
Processing 6 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_6.bw
Processing 7 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_7.bw
Processing 8 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_8.bw
Processing 9 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_9.bw
Processing 10 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_10.bw
Processing 11 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_11.bw
Processing 12 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_12.bw
Processing 13 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_13.bw
Processing 14 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_14.bw
Processing 15 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_15.bw
Processing 16 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_16.bw
Processing 17 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_17.bw
Processing 18 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_18.bw
Processing 19 /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/bigwig/cluster_19.bw
Done

Save cluster mean values to table with closest genes information

In [329]:
closest_genes = locations_closest_genes(list(coveragedf.index), transcripts_tss)
# Restore 'peak' to be able to merge on it
closest_genes['peak'] = closest_genes['chr'] +\
                        ':' + closest_genes['start'].astype(str) +\
                        '-' + closest_genes['end'].astype(str)
# display(closest_genes.head())
Annotate list of locations in format "chr:start-end" with closest tss of genes
In [398]:
clusters_means_genes_df = pd.concat([closest_genes.reset_index(), clusters_means_df], axis=1).drop(columns=['index'])
# Save peak without dashes or colons
clusters_means_genes_df['peak'] = [re.sub('[:-]', '_', p) for p in clusters_means_genes_df['peak']]

print('Saving clusters peaks_means to clusters_peaks_values.tsv')
# display(clusters_means_genes_df.head())
clusters_means_genes_df.to_csv(os.path.join(OUTPUT_DIR, 'clusters_peaks_values.tsv'), sep='\t', index=False)
Saving clusters peaks_means to clusters_peaks_values.tsv

Differential peaks analysis (Markers)

Pipeline

Use Cell Ranger ATAC to get differential markers.


Cell Ranger ATAC For each peak we perform test cluster vs others, so that several clusters may yield significant results vs others, we take into account only the cluster with greatest mean value.\ Test is is perfromed using Cell Ranger default testing procedure based on Negative Binomial GLM.

In [400]:
# Launch 10x Genomics Cell Ranger DE
def run_10x_differential_expression(df, clusters, fdr=0.05):
    """ Compute differential expression for each cluster vs all other cells
        Args: df          - feature expression data (peak x barcode)
              clusters    - 0-based cluster labels
              fdr         - false discovery rate"""

    print("Computing params...")
    m = df.values
    sseq_params = compute_sseq_params(m)
    peaks = list(df.index)    

    cluster_markers = []
    for cluster in tqdm(sorted(set(clusters))):
        in_cluster = clusters == cluster
        group_a = np.flatnonzero(in_cluster)
        group_b = np.flatnonzero(np.logical_not(in_cluster))
        print(f'Computing DE for cluster {cluster}...')

        de_result = sseq_differential_expression(m, group_a, group_b, sseq_params)
        de_result['cluster'] = [cluster] * len(de_result)
        de_result['peak'] = peaks
        passed_fdr = de_result['adjusted_p_value'] < fdr
        passed_log2_fc = de_result['log2_fold_change'] > 0
        # Filter out different by statistical tests with log2 fold change > 0
        passed = np.logical_and(passed_fdr, passed_log2_fc)
        print('DE:', sum(passed), 'of', len(de_result), 
              'FDR:', sum(passed_fdr), 'log2fc>0:', sum(passed_log2_fc))

        passed_de = de_result.loc[passed]
        cluster_de_markers = passed_de[
            ['cluster', 'peak', 'norm_mean_a', 'norm_mean_b', 'log2_fold_change', 'p_value', 'adjusted_p_value']
        ].rename(columns={'norm_mean_a': 'norm_mean_cluster', 'norm_mean_b': 'norm_mean_others'}) 

        cluster_markers.append(cluster_de_markers)    

    return pd.concat(cluster_markers)
In [401]:
# Cell Ranger ATAC like differential analysis
# Each peak is tested independently, no normalization to peak length required.
diff_markers = run_10x_differential_expression(coveragedf.loc[peaks_filter], clusters)
print('Total diff markers', len(diff_markers))
Computing params...
Computing DE for cluster 0...
Computing 94609 exact tests and 3 asymptotic tests.
DE: 14364 of 94612 FDR: 36871 log2fc>0: 40980
Computing DE for cluster 1...
Computing 94611 exact tests and 1 asymptotic tests.
DE: 16469 of 94612 FDR: 32187 log2fc>0: 42733
Computing DE for cluster 2...
Computing 94480 exact tests and 132 asymptotic tests.
DE: 13803 of 94612 FDR: 49973 log2fc>0: 27029
Computing DE for cluster 3...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 6617 of 94612 FDR: 11801 log2fc>0: 49092
Computing DE for cluster 4...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 17589 of 94612 FDR: 31049 log2fc>0: 45513
Computing DE for cluster 5...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 11391 of 94612 FDR: 21962 log2fc>0: 43057
Computing DE for cluster 6...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 10387 of 94612 FDR: 19324 log2fc>0: 43817
Computing DE for cluster 7...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 6260 of 94612 FDR: 15725 log2fc>0: 46847
Computing DE for cluster 8...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 6989 of 94612 FDR: 13281 log2fc>0: 49723
Computing DE for cluster 9...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 5647 of 94612 FDR: 7889 log2fc>0: 54695
Computing DE for cluster 10...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 4508 of 94612 FDR: 6185 log2fc>0: 60058
Computing DE for cluster 11...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 12 of 94612 FDR: 12 log2fc>0: 62556
Computing DE for cluster 12...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 3030 of 94612 FDR: 4469 log2fc>0: 48290
Computing DE for cluster 13...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 50 of 94612 FDR: 51 log2fc>0: 66224
Computing DE for cluster 14...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 4201 of 94612 FDR: 5430 log2fc>0: 62910
Computing DE for cluster 15...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 51 of 94612 FDR: 249 log2fc>0: 70795
Computing DE for cluster 16...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 2627 of 94612 FDR: 2911 log2fc>0: 74505
Computing DE for cluster 17...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 1619 of 94612 FDR: 1741 log2fc>0: 73166
Computing DE for cluster 18...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 632 of 94612 FDR: 769 log2fc>0: 66802
Computing DE for cluster 19...
Computing 94612 exact tests and 0 asymptotic tests.
DE: 1057 of 94612 FDR: 1057 log2fc>0: 82054

Total diff markers 127303

Markers summary

In [402]:
print('Number of differential markers per cluster')
t = diff_markers[['cluster']].copy()
t['count'] = 1
display(pd.pivot_table(t, values='count', index=['cluster'], aggfunc=np.sum))
Number of differential markers per cluster
count
cluster
0 14364
1 16469
2 13803
3 6617
4 17589
5 11391
6 10387
7 6260
8 6989
9 5647
10 4508
11 12
12 3030
13 50
14 4201
15 51
16 2627
17 1619
18 632
19 1057

Annotate markers with genes

In [403]:
markers_with_genes = pd.merge(left=diff_markers[['cluster', 'norm_mean_cluster', 'norm_mean_others', 
                                                 'log2_fold_change', 'p_value', 'adjusted_p_value', 'peak']],
                         right=closest_genes,
                         left_on='peak',
                         right_on='peak')

# Save peak without dashes or colons
t = markers_with_genes['peak'].copy()
markers_with_genes['peak'] = [re.sub('[:-]', '_', p) for p in markers_with_genes['peak']]

# rearrange columns, sort
markers_with_genes = markers_with_genes[['chr', 'start', 'end', 'peak',
                                         'cluster', 'norm_mean_cluster', 'norm_mean_others', 
                                         'log2_fold_change', 'p_value', 'adjusted_p_value', 
                                         'gene', 'distance']].sort_values(by=['cluster', 'adjusted_p_value'])
# display(markers_with_genes.head())

print('Saving all the markers', len(markers_with_genes), 'to markers.tsv')
markers_with_genes.to_csv(os.path.join(OUTPUT_DIR, 'markers.tsv'), sep='\t', index=None)

# Restore peaks
markers_with_genes['peak'] = t
Saving all the markers 127303 to markers.tsv

Analyzing top markers

In [404]:
def get_top_markers(df, n_clusters, top=100):
    top_markers = pd.concat([
        df.loc[df['cluster'] == c].sort_values(by=['adjusted_p_value']).head(top) 
                    for c in range(0, n_clusters)])
    return top_markers
In [405]:
print('Computing Z scores for mean values in cluster for top markers')
top_markers = get_top_markers(markers_with_genes, 15, top=100)
t = coveragedf.T # Transpose to (barcodes x peaks) format
t = t[list(top_markers['peak'])] # top markers only

t = pd.concat([pd.DataFrame(t.loc[clusters == cluster].mean()).T for cluster in set(clusters)])
t.index = range(n_clusters)

# Transform to Z-score
for c in t.columns:
    t[c] = zscore(t[c])

print('Clustermap of Z scores')
with PdfPages(os.path.join(figures_path, 'markers_z.pdf')) as pdf:
    sns.clustermap(t.T, col_cluster=False, row_cluster=False, figsize=(10, 10), cmap=plt.cm.seismic)
    pdf.savefig()
Computing Z scores for mean values in cluster for top markers
Clustermap of Z scores

Top markers visualization on t-SNE

In [406]:
def show_top_markers_z(t, xs, ys, markers_with_genes, clusters, n_clusters, top=100):
    top_markers = get_top_markers(markers_with_genes, n_clusters, top)
    nrows, ncols = n_rows_columns(n_clusters)
    plt.subplots(nrows=nrows, ncols=ncols, figsize=(5 * ncols, 4 * nrows))
    for i, cluster in enumerate(tqdm(set(clusters))):
        # Table (marker for cluster x cells)
        markers_df = t.loc[top_markers.loc[top_markers['cluster'] == cluster]['peak']].T.copy()

        # Z scale for each marker peak across all cells
        for c in markers_df.columns:
            markers_df[c] = zscore(markers_df[c])

        # Average Z score for each cell
        zs = markers_df.T.mean()
        ax = plt.subplot(nrows, ncols, i + 1)
        # Average is much more stable than coverage, we can use linear colormap here
        plt.setp(ax, xticks=[], yticks=[])    
        # Plot zscores scatter plot with 2 segments colormap blue-white-red
        cmap = LinearSegmentedColormap.from_list(name='zscore', 
            colors=[(0, 'darkblue'), ((0 - np.min(zs)) / (np.max(zs) - np.min(zs)), 'white'), (1, 'red')])        
        sc = ax.scatter(xs, ys, s=5, c=zs, cmap=cmap, edgecolor='none', alpha=1)
#         sc = ax.scatter(xs, ys, s=5, c=zs, cmap=plt.cm.seismic, edgecolor='none', alpha=1)
        plt.colorbar(sc)
        plt.title(f'Cluster {cluster}')
#             plt.xlabel('tSNE axis 1')
#             plt.ylabel('tSNE axis 2')
In [407]:
print('t-SNE based Z-SCORE visualizations for markers, mean z-score for 100 top markers')
    
with PdfPages(os.path.join(figures_path, f'tsne_diff_markers_100.pdf')) as pdf:    
    show_top_markers_z(coveragedf, tsne_coords['tsne1'], tsne_coords['tsne2'], 
                       markers_with_genes, clusters, n_clusters, 100)
    pdf.savefig()
t-SNE based Z-SCORE visualizations for markers, mean z-score for 100 top markers

In [408]:
print('t-SNE based Z-SCORE visualizations for markers, z-score for 1 top marker')
t = coveragedf
top_markers_1 = get_top_markers(markers_with_genes, n_clusters, top=1)
top_markers_map = {}
for i, row in top_markers_1.iterrows():
    top_markers_map[str(row['cluster'])] = row['peak']
    
with PdfPages(os.path.join(figures_path, 'tsne_diff_markers_1.pdf')) as pdf:
    regions_plot(top_markers_map, coveragedf, 
                 tsne_coords['tsne1'], tsne_coords['tsne2'],
                 'tsne1', 'tsne2',
                 factors, clusters, True)
    pdf.savefig()
t-SNE based Z-SCORE visualizations for markers, z-score for 1 top marker
Coverage z-score, distribution and fraction on non-zero covered peaks w.r.t age and clusters
0 chr4:44666690-44675453 cov>0: 1264 of 14471 max cov: 9.0
1 chr2:160365287-160368341 cov>0: 655 of 14471 max cov: 7.0
2 chr12:26184365-26189432 cov>0: 1540 of 14471 max cov: 13.0
3 chr1:21039967-21041448 cov>0: 340 of 14471 max cov: 3.0
4 chr10:21607385-21611283 cov>0: 564 of 14471 max cov: 5.0
5 chr1:153084958-153086006 cov>0: 582 of 14471 max cov: 4.0
6 chr4:149282114-149283092 cov>0: 315 of 14471 max cov: 4.0
7 chr17:34141403-34148053 cov>0: 1904 of 14471 max cov: 6.0
8 chr11:75318407-75319423 cov>0: 190 of 14471 max cov: 4.0
9 chr10:61260994-61261978 cov>0: 188 of 14471 max cov: 6.0
10 chr14:25421452-25422471 cov>0: 242 of 14471 max cov: 3.0
11 chr3:137957946-137958320 cov>0: 32 of 14471 max cov: 6.0
12 chr5:90880879-90882610 cov>0: 1279 of 14471 max cov: 7.0
13 chr7:128569303-128570642 cov>0: 153 of 14471 max cov: 3.0
14 chr18:21299483-21300880 cov>0: 368 of 14471 max cov: 6.0
15 chr10:115395910-115396460 cov>0: 112 of 14471 max cov: 2.0
16 chr1:88211647-88215023 cov>0: 170 of 14471 max cov: 10.0
17 chr10:127176074-127180682 cov>0: 430 of 14471 max cov: 12.0
18 chr2:173132838-173133237 cov>0: 43 of 14471 max cov: 2.0
19 chr6:125452755-125461983 cov>0: 745 of 14471 max cov: 13.0

Save differential markers to BED

In [413]:
cluster_peaks_path = os.path.join(OUTPUT_DIR, 'markers')
! mkdir -p {cluster_peaks_path}
print('Cleanup peaks_clusters', cluster_peaks_path)
for f in glob.glob(f'{cluster_peaks_path}/markers_*.bed'):
    os.remove(f)

mlap_max = (-np.log10(diff_markers.loc[diff_markers['adjusted_p_value'] != 0]['adjusted_p_value'])).max()
print('BED scoring as -log10 adjusted pval')
print('Max of -log10 adjusted pval', mlap_max)

for c in set(clusters):
    bed_file = os.path.join(cluster_peaks_path, f'markers_{c}.bed')
    markers_cluster = diff_markers.loc[diff_markers['cluster'] == c].sort_values(
        ['log2_fold_change'], ascending=False
    ).iloc[::-1]
    markers_cluster.index = range(len(markers_cluster))
    print('Saving cluster', c, 'marker peaks to', bed_file)
    with open(bed_file, 'w') as bed:
        for i in range(len(markers_cluster)):
            peak = markers_cluster['peak'][i]
            ap = markers_cluster['adjusted_p_value'][i]
            mlap = -np.log10(ap) if ap != 0.0 else mlap_max
            line = '{}\t{}\t{}\t.\n'.format(
                re.sub(':|-', '\t', peak), 
                c,
                mlap)
            bed.write(line)
print('Done')                
Cleanup peaks_clusters /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers
BED scoring as -log10 adjusted pval
Max of -log10 adjusted pval 201.37745804466806
Saving cluster 0 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_0.bed
Saving cluster 1 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_1.bed
Saving cluster 2 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_2.bed
Saving cluster 3 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_3.bed
Saving cluster 4 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_4.bed
Saving cluster 5 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_5.bed
Saving cluster 6 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_6.bed
Saving cluster 7 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_7.bed
Saving cluster 8 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_8.bed
Saving cluster 9 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_9.bed
Saving cluster 10 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_10.bed
Saving cluster 11 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_11.bed
Saving cluster 12 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_12.bed
Saving cluster 13 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_13.bed
Saving cluster 14 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_14.bed
Saving cluster 15 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_15.bed
Saving cluster 16 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_16.bed
Saving cluster 17 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_17.bed
Saving cluster 18 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_18.bed
Saving cluster 19 marker peaks to /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/markers/markers_19.bed
Done

Preparation data for single cell explorer

Preprocess data for single cell explorer https://ctlab.github.io/SCE

In [414]:
def export_to_cse(df, clusters_df, tsne_coords, umap_coords, markers,
                  sce_path, token, name, public, organism,
                  markers_x_column='gene', markers_gene_column='peak'):
    print('SCE path', sce_path)
    print('SCE token', token)
    print('SCE name', name)
    print('SCE public', public)
    print('SCE organism', organism)
    
    # Sanity check indexes
    assert np.array_equal(tsne_coords.index, umap_coords.index)
    assert np.array_equal(tsne_coords.index, df.columns)
    assert np.array_equal(tsne_coords.index, df.columns)
    assert np.array_equal(tsne_coords.index, df.columns)
    assert np.array_equal(clusters_df.index, df.columns)
    clusters = clusters_df['cluster']

    print('Processing', 'dataset.json')
    with open(os.path.join(sce_path, 'dataset.json'), 'w') as f:
        f.write(json.dumps({
            'token': token,
            'public': public,
            'name': name,
            'organism': organism,
            'cells': df.shape[1]
        }))


    print('Processing', 'plot_data.json')
    fields = {
      "tSNE_1": {
        "type": "numeric",
        "range": [
          min(tsne_coords.loc[df.columns.values]['tsne1']),
          max(tsne_coords.loc[df.columns.values]['tsne1'])
        ]
      },
      "tSNE_2": {
        "type": "numeric",
        "range": [
          min(tsne_coords.loc[df.columns.values]['tsne2']),
          max(tsne_coords.loc[df.columns.values]['tsne2'])
        ]
      },
      "UMAP_1": {
        "type": "numeric",
        "range": [
          min(umap_coords.loc[df.columns.values]['umap1']),
          max(umap_coords.loc[df.columns.values]['umap1'])
        ]
      },
      "UMAP_2": {
        "type": "numeric",
        "range": [
          min(umap_coords.loc[df.columns.values]['umap2']),
          max(umap_coords.loc[df.columns.values]['umap2'])
        ]
      },
      "Cluster": {
        "type": "factor",
        "levels": [str(c) for c in set(clusters)]
      },
      "Age": {
        "type": "factor",
        "levels": list(FACTORS_MAP.values())
      },    
      "nUmi": {
        "type": "numeric",
        "range": [
          min(df.sum()),
          max(df.sum()),
        ]
      }
    }

    data = pd.DataFrame({      
        "tSNE_1": tsne_coords.loc[df.columns.values]['tsne1'].tolist(),
        "tSNE_2": tsne_coords.loc[df.columns.values]['tsne2'].tolist(),
        "UMAP_1": umap_coords.loc[df.columns.values]['umap1'].tolist(),
        "UMAP_2": umap_coords.loc[df.columns.values]['umap2'].tolist(),
        "Cluster": [str(c) for c in clusters_df.loc[df.columns.values]['cluster']],
        "Age": [FACTORS_MAP[FACTOR_FUNCTION(bc)] for bc in df.columns],
        "nUmi": df.sum().tolist(),
        "_row": list(df.columns)
    }).to_dict('records')

    annotations = {
      "tsne_Cluster_centers": {
        "type": "text",
        "value": "Cluster",
        "coords": [
          "tSNE_1",
          "tSNE_2"
        ],
        "data": [
          {
            "Cluster": str(c),
            "tSNE_1": float(np.mean(tsne_coords.loc[df.columns.values].values[
                np.flatnonzero(clusters_df.loc[df.columns.values]['cluster'] == c), 0])),
            "tSNE_2": float(np.mean(tsne_coords.loc[df.columns.values].values[
                np.flatnonzero(clusters_df.loc[df.columns.values]['cluster'] == c), 1])),
            "Text": str(c)
          } for c in set(clusters)
        ]    
      },
    }

    with open(os.path.join(sce_path, 'plot_data.json'), 'w') as f:
        f.write(json.dumps({
            "fields": fields,
            "data": data,
            "annotations": annotations,
        }))


    # This file simply contains gene names, cell barcodes/names and total UMI per cell. 
    # This file must reflect row and column names of matrix data.h5 which contains expression data.    
    print('Processing', 'exp_data.json')
    with open(os.path.join(sce_path, 'exp_data.json'), 'w') as f:
        f.write(json.dumps({
          "genes": list(df.index.values),
          "barcodes": list(df.columns.values),
          "totalCounts": df.sum(axis=0).astype(int).tolist()
        }))


    print('Processing', 'data.h5')
    ! rm {sce_path}/data.h5
    # Fix missing locks
    os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
    # Don't use pandas to_hdf5 because of unused overhead.
    # IMPORTANT: transpose matrix because of different order!
    # Column-major order is used by Fortran, Matlab, R, and most underlying core linear algebra libraries (BLAS). 
    # Row-major ordering is sometimes called “C” style ordering and column-major ordering “Fortran” style. 
    # Python/NumPy refers to the orderings in array flags as C_CONTIGUOUS and F_CONTIGUOUS, respectively.
    # https://cran.r-project.org/web/packages/reticulate/vignettes/arrays.html
    expdf = df.astype(int).clip(upper=1) # Clip coverage to 0-1 for better visualization
    with h5py.File(os.path.join(sce_path, 'data.h5'), 'w') as hf:
        hf.create_dataset("expression/mat", 
                          data=expdf.T.values, # Column-major ordering!
                          compression="gzip", compression_opts=9, 
                          dtype='i4') # int4 is required here for SCE, see https://github.com/ctlab/SCE/issues/4


    print('Processing', 'markers.json')
    # Take into account only 100 top markers
    with open(os.path.join(sce_path, 'markers.json'), 'w') as f:
        f.write(json.dumps({
            "Cluster": [ 
              {
                  'X': row[markers_x_column],
                  'p_val': row['p_value'],
                  'p_val_adj': row['adjusted_p_value'],
                  'avg_logFC': row['log2_fold_change'],
                  'pct.1': row['norm_mean_cluster'],
                  'pct.2': row['norm_mean_others'],
                  'cluster': str(row['cluster']),
                  'gene': row[markers_gene_column],
                  'distance': row['distance']
              } for index, row in markers.iterrows()
          ]
        }))

    print('Done')
In [415]:
sce_path = os.path.join(OUTPUT_DIR, 'sce')
! mkdir -p {sce_path}
filtereddf = coveragedf.loc[peaks_filter]
export_to_cse(filtereddf, clusters_df, tsne_coords, umap_coords, 
              markers=markers_with_genes,
              sce_path=sce_path,
              token=SCE_TOKEN, 
              name=SCE_NAME,
              public=SCE_PUBLIC,
              organism=SCE_ORGANISM)
SCE path /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/sce
SCE token 2019_scatacseq_mice_lungs_1vs1
SCE name 2019_scatacseq_mice_lungs_1vs1
SCE public False
SCE organism mm10
Processing dataset.json
Processing plot_data.json
Processing exp_data.json
Processing data.h5
rm: cannot remove ‘/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/sce/data.h5’: No such file or directory
Processing markers.json
Done

Save single cell explorer data with respect to genes instead of peaks

In [416]:
print(f'Associated genes {len(closest_genes)}')
t = closest_genes[['gene', 'peak']].copy()
t['distance'] = np.abs(closest_genes['distance'])
t = t.loc[t['distance'] < SCE_PEAK_GENE_TSS_MAX_DISTANCE]
print(f'Associated genes with distance to TSS < {SCE_PEAK_GENE_TSS_MAX_DISTANCE} {len(t)}')

print('Building mapping peak -> gene, closest peaks is mapped to gene, second closest with suffix _1, etc')
t.sort_values(by=['gene', 'distance'], inplace=True)
associated_genes_map = {}
prev_gene = None
prev_count = 1
for _, (gene, peak, distance) in t.iterrows():
    if prev_gene != gene:
        associated_genes_map[peak] = gene
        prev_count = 1
        prev_gene = gene
    else:
        associated_genes_map[peak] = f'{gene}_{prev_count}'
        prev_count += 1
        
print('Preparing coverage dataframe per gene')
genes_filtereddf = filtereddf.loc[filtereddf.index.isin(associated_genes_map)].copy()
genes_filtereddf.index = [associated_genes_map[peak] for peak in genes_filtereddf.index]

print('Preparing top markers by gene')
associated_markers_with_genes = \
    markers_with_genes.loc[np.abs(markers_with_genes['distance']) < SCE_PEAK_GENE_TSS_MAX_DISTANCE].copy()

print('Updating markers genes names by distance to gene with prefix')
associated_markers_with_genes['gene'] = [
    associated_genes_map[f'{chr}:{start}-{end}'] for chr, start, end in 
        zip(associated_markers_with_genes['chr'],
            associated_markers_with_genes['start'],
            associated_markers_with_genes['end'])
]
Associated genes 98449
Associated genes with distance to TSS < 10000 49814
Building mapping peak -> gene, closest peaks is mapped to gene, second closest with suffix _1, etc
Preparing coverage dataframe per gene
Preparing top markers by gene
Updating markers genes names by distance to gene with prefix
In [417]:
genes_sce_path = os.path.join(OUTPUT_DIR, 'sce_genes')
! mkdir -p {genes_sce_path}

export_to_cse(genes_filtereddf, clusters_df, tsne_coords, umap_coords, 
              markers=associated_markers_with_genes,
              sce_path=genes_sce_path,
              token=f'{SCE_TOKEN}_genes', 
              name=f'{SCE_NAME} by genes',
              public=SCE_PUBLIC,
              organism=SCE_ORGANISM,
              markers_x_column='peak', markers_gene_column='gene')
SCE path /mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/sce_genes
SCE token 2019_scatacseq_mice_lungs_1vs1_genes
SCE name 2019_scatacseq_mice_lungs_1vs1 by genes
SCE public False
SCE organism mm10
Processing dataset.json
Processing plot_data.json
Processing exp_data.json
Processing data.h5
rm: cannot remove ‘/mnt/stripe/shpynov/2019_sc_atacseq_DT1755_1vs1_mice/pipeline/sce_genes/data.h5’: No such file or directory
Processing markers.json
Done
In [ ]: