Welcome to HiC1Dmetrics’s documentation!

1. Overview and Installation

1.1 Introduction

We presented “HiC1Dmetrics”, a pipeline that are able to calculate and analysis one-dimensional (1D) metrics for Hi-C data. HiC1Dmetrics is a Python3-based program (https://github.com/wangjk321/HiC1Dmetrics) and provide command line interface for UNIX system.

RTDimport

1.2 Installation

HiC1Dmetrics were released on PyPI, and could be accessed by:

pip3 install h1d

After installation, try:

$ h1d -V
h1d version 0.1.20

1.3 Requirements

HiC1Dmetrics is based on python3 and it requires:

  • Python (>= 3.6) packages:

    • pandas

    • numpy

    • scipy

    • scikit-learn

    • statsmodels

    • matplotlib

    • seaborn

    • fithic==2.0.7

    • multiprocess

    • cooler == 0.8.2

  • Others:

    • bedtools >= 2.29.2

All required python packages will be automatically installed when use pip install h1d

1.4 Test file

In this tutorial, we prepared servel files for test. All results in this tutorial can be reproduced using test data. We mainly used the data from GSE104334.

  • Zipped intra-chromosomal contacts (can be find in the ./test_data/ of GitHub page)

    • GSE104334_Ctrl.chr21.matrix.gz

    • GSE104334_Rad21KD.chr21.matrix.gz

  • Hi-C file in .hic (Juicer) format (the download link is in ./test_data/ of GitHub page)

    • GSE104334_Ctrl.hic

    • GSE104334_Rad21KD.hic

  • Hi-C file in .cool (Cooler) format (the download link is in ./test_data/ of GitHub page)

    • GSE104334_Ctrl.cool

    • GSE104334_Rad21KD.cool

  • Two folder for the test of multisamples as described here:

    • multisample1

    • multisample2

  • Other files that support the test (can be find in the ./test_data/ of GitHub page):

    • hg19_genome_table.txt

    • hg19_geneDensity50000.txt

1.5 Input format

HiC1Dmetrics support:

  • raw .hic defined by juicer software.

  • OR raw cool defined by cooler software

  • OR dense matrix (raw or zipped) of intra-chromosomal contacts, like:

0 25000 50000 75000 ...
0 0 0 0 0 ...
25000 0 8 3 5 ...
50000 0 3 8 4 ...
75000 0 5 4 0 ...
... ... ... ... ... ...

Note 1: when use .hic file, a genome table file (tab-separated) must be prepared, which described the length of each chromosome for your genome reference:

chr1 248956422
chr2 242193529
... ...

Note 2: Calculation of Interaction Frequency (IF) can only accept raw .hic file

1.6 Overall usage

The function of HiC1Dmetrics include serveral sub-command:

$ h1d -h
positional arguments:
  {basic,one,two,multitypes,multisamples,call}
                        Choose the mode to use sub-commands
    basic               Provide basic functions to visualize and handle Hi-C
                        data.
    one                 1D metrics designed for one Hi-C sample
    two                 1D metrics designed for comparison of two Hi-C samples
    multitypes          Various types of 1D metrics for the same sample
    multisamples        The same metrics for muliple samples
    call                Extract secondary information from metrics (dTAD,
                        stripeTAD, et.al)

In the next section, we mainly use a high-resolution Hi-C data by Rao et,al. (GSE104334) to make this tutorial

1.7 Memory requirement

  • HiC1Dmetrics uses acceptable memory (serveral gigabytes) on high-resolution (i.e. >=5kb) Hi-C。

  • Current HiC1Dmetrics does not assume the Hi-C data with super-high resolution (e.g. <1kb). The application on fine resolution Hi-C will use much more memory. i.e., for 1kb resolution Hi-C data (GSE63525, chromosome 21), ‘.hic’ input uses ~69 Gb memory, while ‘dense matrix’ input uses ~10 Gb memory; For mouse chromosome 1, 1kb-data uses >200 Gb memory.

  • One reason is that we extract matrix from .hic file using Juicertools (the first step), which requires hundreds of gigabytes of memory (>300 Gb at chromosome 1) in 1kb resolution. The other reason is that h1d still need to load and deal with (the second step) the huge 2D matrix (e.g., the file size of 1kb dense matrix of chromosome 1 is ~ 233 Gb).

  • We recommend using the dense matrix as the input instead of .hic/ .cool file to reduce the memory usage.

2. Calculate 1D metrics for one sample

2.1 Quick start

h1d one IS ./test_data/GSE104334_Ctrl.chr21.matrix.gz \
		50000 chr21 -o Control_IS_chr21

This command would generate a bedGraph file (Control_IS_chr21.bedGraph) for Insulation Score. An example file can be download fron here

... ... ...
chr21 24300000 24350000 0.59419
chr21 24350000 24400000 0.604341
... ... ...

2.2 Usage

The analysis of one-sample metrics cound be run by h1d one sub-command :

$ h1d one -h # type -h for help
usage: h1d one [-h] [-p PARAMETER] [-o OUTNAME] [-d] [-s START] [-e END]
               [--datatype DATATYPE] [--gt GT] [--prefix PREFIX]
               [--maxchr MAXCHR] [-n NPROCESSER] [-t TADFILE]
               type data resolution chromosome

1D metrics designed for one Hi-C sample.

positional arguments:
  type                  Type of 1D metrics,,should be one of
                        {IS,CI,DI,SS,DLR,PC1,IES,IAS,IF}.
  data                  Path of matrix or rawhic file.
  resolution            Resolution of input matrix.
  chromosome            Chromosome number.

optional arguments:
  -h, --help            show this help message and exit
  -p PARAMETER, --parameter PARAMETER
                        Parameter for indicated metrics.
  -o OUTNAME, --outname OUTNAME
                        output name (default: 'metrics').
  -d, --draw            Plot figure for candidate region.
  -s START, --start START
                        Start sites for plotting.
  -e END, --end END     End sites for plotting.
  --datatype DATATYPE   Type of input data: [matrix(default),rawhic,cool]
  --msi MSI             Method for significant interactions: [fithic2,hiccups]
  --gt GT               genome_table file.
  --prefix PREFIX       ${prefix}chr1.matrix.gz
  --maxchr MAXCHR       Maximum index of chromosome (human genome is 22,i.e.)
  -n NPROCESSER, --nProcesser NPROCESSER
                        Number of processors
  -t TADFILE, --TADfile TADFILE
                        Give a TAD file, instead of using building-in TAD
                        calling method
  • type : type of 1D metrics could be one of {IS,CI,DI,SS,DLR,PC1,IES,IAS,IF}:

    Details is shown in our paper: link in the future

  • -p, --parameter for each 1D metric is :

Type Description default value
DI length of bins 1000000
IS square size 300000
CI length of bins 300000
SS length of bins 300000
DLR Local distance 3000000
PC1 Gene density file None
IAS Paramter for TAD calling 300000
IES Paramter for TAD calling 300000
IF FDR threshold 0.05

Note !! : The sign of PC1 value is arbitrary unless provide a geneDensity file.

  • -t TADFILE, specify a TAD file (.bed format) to replace the built-in TAD calling method.

  • --msi, specify the method to calculate significant interactions. ‘fithic2’ (default) or ‘hiccups’ is supported.

2.3 Calculate 1D metrics (one-sample)

  • Use contact matrix:

    h1d one CI ./test_data/GSE104334_Ctrl.chr21.matrix.gz \
    	50000 chr21 -p 300000 -o control_CI_chr21 --datatype matrix
    
  • Use raw .hic file:

    h1d one CI ./test_data/GSE104334_Ctrl.hic \
    	50000 chr21  -p 300000 -o control_CI_chr21 --datatype rawhic \
    	--gt ./test_data/hg19_genome_table.txt	
    
  • Use raw .cool file

    h1d one CI ./test_data/GSE104334_Ctrl.50000.cool \
    	50000 chr21 -p 300000 -o control_CI_chr21 --datatype cool
    	--gt ./test_data/hg19_genome_table.txt
    

    Output will be control_CI_chr21.bedGraph as described before.

Calculation of IF only support rawhic datatype

Multiprocessing for all chromomes:

To deal with multiple chromosomes, you should first use dump function in h1d.

To run all chromosomes parallel, do:

h1d one IS ./test/Control/ 50000 all 
	--maxchr 22 --prefix observed.KR. -n 30 -o control
  • chromosome , set chromosome to “all” will compute metrics for all chromosomes.

  • data, if calculating for all chromosomes, the input file should be absolute folder of contact matrix.

  • -maxchr, Maximum index of chromosome (human genome is 22,i.e.). It will compute chromosome 1~maxchr plus chromosome X.

  • --prefix, the prefix of matrix file, please modify the name of zipped matrix to ${prefix}chr1.matrix.gz. If you used our dump function, the file should be:

    ├── observed.KR.chr1.matrix.gz
    ├── observed.KR.chr10.matrix.gz
    ├── observed.KR.chr11.matrix.gz
    ├── observed.KR.chr12.matrix.gz
    ├── observed.KR.chr13.matrix.gz
    ├── observed.KR.chr14.matrix.gz
    

    so the prefix is observed.KR.

  • -n, Number of processors

Output would be control_IS_allchr.csv.

2.4 Visulize 1D metrics (one-sample)

  • Use contact matrix:

    h1d one CI ./test_data/GSE104334_Ctrl.chr21.matrix.gz \
    	50000 chr21 -p 300000 -o Control_CI_chr21 --datatype matrix \
      --draw -s 26000000 -e 33000000
    
  • Use raw hic:

    h1d one CI ./test_data/GSE104334_Ctrl.hic \
    	50000 chr21 -p 300000 -o Control_CI_chr21 --datatype rawhic \
    	--gt ./test_data/hg19_genome_table.txt --draw -s 26000000 -e 33000000
    
  • Use cool file:

    h1d one CI ./test_data/GSE104334_Ctrl.50000.cool \
    	50000 chr21 -p 300000 -o Control_CI_chr21 --datatype cool \
    	--gt ./test_data/hg19_genome_table.txt --draw -s 26000000 -e 33000000
    

The output will be control_CI_chr21.bedGraph and control_CI_chr21.pdf:

RTDimport

3. Calculate 1D metrics for comparison of two samples

3.1 Quick start

h1d two ISC ./test_data/GSE104334_Rad21KD.chr21.matrix.gz \
	./test_data/GSE104334_Ctrl.chr21.matrix.gz \
	50000 chr21 -o treat_vs_control_ISC

This command would generate a bedGraph file (treat_vs_control_ISC.bedGraph) for ISC. An example result can be checked from here

3.2 Usage

The analysis of two-sample metrics cound be run by h1d two sub-command :

$ h1d two -h # type -h for help
usage: __main__.py two [-h] [-p PARAMETER] [-o OUTNAME] [-d] [-s START]
                       [-e END] [--datatype DATATYPE] [--gt GT]
                       type matrix controlmatrix resolution chromosome

1D metrics designed for comparison of two Hi-C samples

positional arguments:
  type                  Type of 1D metrics for two-sample comparison,should be
                        one of {ISC,CIC,SSC,deltaDLR,CD,IESC,IASC,IFC,DRF}
  matrix                Path of treated file (matrix or rawhic).
  controlmatrix         Path of control file (matrix or rawhic).
  resolution            Resolution of input matrix
  chromosome            Chromosome number ('chr21',i.e).

optional arguments:
  -h, --help            show this help message and exit
  -p PARAMETER, --parameter PARAMETER
                        Parameter for indicated metrics
  -o OUTNAME, --outname OUTNAME
                        output name (default: metricsChange)
  -d, --draw            Plot figure for candidate region
  -s START, --start START
                        Start sites for plotting
  -e END, --end END     End sites for plotting
  --datatype DATATYPE   matrix or rawhic
  --gt GT               genome table file
  • type : type of 1D metrics could be one of {ISC,CIC,SSC,deltaDLR,CD,IESC,IASC,IFC,DRF}:

    • Insulation Score Change (ISC) (PMID: 31495782)

    • Contrast Index Change(CIC)

    • Separation Score Change(SSC)

    • Delta Distal-to-Local Ratio (deltaDLR) (PMID: 30146161)

    • Correlation Difference (CD) (PMID: 20513432)

    • IntraScore Change(IASC)

    • InterScore Change (IESC)

    • Interaction Frequency Change (IFC)

    • Directional Relative Frequency (DRF)(Original metric)

    Details is shown in our paper: link in the future

  • -p or --parameter for each metrics:

Type Description default value
ISC square size 300000
CIC length of bins 300000
SSC length of bins 300000
deltaDLR Local distance 3000000
CD Correlation method pearson
IASC Paramter for TAD calling 300000
IESC Paramter for TAD calling 300000
IFC FDR threshold 0.05
DRF length of bins 200000-5000000

3.3 Calculate 1D metrics (two-sample)

  • Use contact matrix:

    h1d two ISC ./test_data/GSE104334_Rad21KD.chr21.matrix.gz \
    	./test_data/GSE104334_Ctrl.chr21.matrix.gz 50000 chr21 \
    	--datatype matrix -p 300000 -o treat_vs_control_ISC
    
  • Use raw hic:

    h1d two ISC ./test_data/GSE104334_Rad21KD.hic \
    	./test_data/GSE104334_Ctrl.hic 50000 chr21 \
    	--datatype rawhic --gt ./test_data/hg19_genome_table.txt \
    	-p 300000 -o treat_vs_control_ISC
    
  • Use cool file

    h1d two ISC ./test_data/GSE104334_Rad21KD.50000.cool \
    	./test_data/GSE104334_Ctrl.50000.cool 50000 chr21 \
    	--datatype cool --gt ./test_data/hg19_genome_table.txt \
    	-p 300000 -o treat_vs_control_ISC
    

3.3.1 Multiprocessing for all chromomes:

  • chromosome , set chromosome to “all” will compute metrics for all chromosomes.

  • data, if calculating for all chromosomes, the input file should be absolute folder of contact matrix.

  • -maxchr, Maximum index of chromosome (human genome is 22,i.e.). It will compute chromosome 1~maxchr plus chromosome X.

  • --prefix, the prefix of matrix file, please modify the name of zipped matrix to ${prefix}chr1.matrix.gz. If you used our dump function, the file should be:

    ├── observed.KR.chr1.matrix.gz
    ├── observed.KR.chr10.matrix.gz
    ├── observed.KR.chr11.matrix.gz
    ├── observed.KR.chr12.matrix.gz
    ├── observed.KR.chr13.matrix.gz
    ├── observed.KR.chr14.matrix.gz
    

    so the prefix is observed.KR.

  • -n, Number of processors

To run all chromosomes parallel (treat vs control), do:

h1d one ISC ./test/Treat/ ./test/Control/  50000 all 
	--maxchr 22 --prefix observed.KR. -n 30 -o treat_vs_control

Output would be treat_vs_control_IS_allchr.csv.

3.4 Visualize 1D metrics (two-sample)

  • Use contact matrix:

    h1d two CIC ./test_data/GSE104334_Rad21KD.chr21.matrix.gz \
    	./test_data/GSE104334_Ctrl.chr21.matrix.gz 50000 chr21 \
        --datatype matrix -p 300000 -o treat_vs_control_ISC \
    	--draw -s 26000000 -e 33000000
    
  • Use raw .hic file

    h1d two CIC ./test_data/GSE104334_Rad21KD.hic \
    	./test_data/GSE104334_Ctrl.hic 50000 chr21 \
    	--datatype rawhic --gt ./test_data/hg19_genome_table.txt \
    	-p 300000 -o treat_vs_control_ISC \
    	--draw -s 26000000 -e 33000000
    
  • Use .cool file

    h1d two CIC ./test_data/GSE104334_Rad21KD.50000.cool \
    	./test_data/GSE104334_Ctrl.50000.cool 50000 chr21 \
    	--datatype cool --gt ./test_data/hg19_genome_table.txt \
    	-p 300000 -o treat_vs_control_ISC \
    	--draw -s 26000000 -e 33000000
    
RTDimport

4. Extract secondary information from metrics (dTAD, stripeTAD, etc.)

4.1 Quick start

# dense matrix
h1d call stripe ./test_data/GSE104334_Ctrl.chr21.matrix.gz \
	50000 chr21 -o testname

# .hic
h1d call stripe ./test_data/GSE104334_Ctrl.hic \
	50000 chr21 -o testname --datatype rawhic --gt ./test_data/hg19_genome_table.txt
	
# .cool
h1d call stripe ./test_data/GSE104334_Ctrl.50000.cool \
	50000 chr21 -o testname --datatype cool --gt ./test_data/hg19_genome_table.txt

The output will be: testname_stripe.csv

4.2 Parameters

h1d call -h
usage: __main__.py call [-h] [-o OUTNAME] [-c CONTROLMATRIX]
                        [--datatype DATATYPE] [--gt GT] [-p PARAMETER]
                        mode matrix resolution chromosome
  • Required parameters:

    • mode, Running mode,,should be one of {dTAD,stripe,stripeTAD, TAD,hubs}

    • data, Path of matrix file or raw .hic file.

    • resolution, resolution (50000, i.e.) of given contact matrix, or choosed resolution for analyzing .hic file.

    • chromosome, selected chromosome to be analyzed.

  • Optional parameters:

    • -o, output name, default: defaultname

    • -c, contact matrix or .hic file of control sample, which is required when using “dTAD” mode.

    • --datatype, type of input data: “matrix” (default) or “rawhic”.

    • --gt, genome table file when using raw .hic data.

    • -p, parameters for particular mode:

mode Which parameter Default
dTAD for DRF 200000-5000000
stripe for 'strong IAS peak' 0.02
stripeTAD for IAS 300000
TAD for IS 300000
Hubs for IF 0.05

!! Please note that “stripe” is different from “stripeTAD”: Stripe is the regions with “stripe” structure, whereas stripe-TAD is asymmetric TAD (may contain many stripes). Thus stripe-TAD is the classfication from all TAD.

4.3 dTAD

h1d provide the function to call dTAD as

h1d call dTAD ./test_data/GSE104334_Rad21KD.chr21.matrix.gz \
	50000 chr21 -c ./test_data/GSE104334_Ctrl.chr21.matrix.gz \
  --datatype matrix -o testname -p 200000-5000000

The output will be testname_leftdTAD.csv and testname_rightdTAD.csv, as:

chr TADstart TADend
chr21 17900000 18250000
chr21 26800000 27700000
... ... ...

4.4 stripe

This function automatically identify all regions with stripe ‘structure’. The key idea is to find the sharp, strong IAS peaks. We used the similar strategy which use insulation score to identify TAD boundaries. For the stripe calling, after extracting the local maximum positions of IAS, only positions with IAS > IASmean is retained. Then, similar to Crane et.al, Nature 2015, we calculate a delta vector of IAS for each bin to extract only strong IAS peaks. To avoid clustered small peaks, the IAS value of a ‘stripe’ position should be higher than any position around 100kb.

h1d call stripe ./test_data/GSE104334_Ctrl.chr21.matrix.gz 50000 chr21 \
	--datatype matrix -o testname

This will output the summit of IAS signal, i.e. the stripes:

chr start end IAS
chr21 15450000 15500000 2.333717
chr21 15900000 15950000 2.561192
chr21 16300000 16350000 1.892257
... ... ... ...

4.5 stripe-TAD

This function simply divide all TAD into “loop”, “left-stripe”, “right-stripe” and “other” TAD:

h1d call stripeTAD ./test_data/GSE104334_Ctrl.chr21.matrix.gz 50000 chr21 \
	--datatype matrix -o testname -p 300000

The output will be testname_stripe.csv, as:

chr TADstart TADend TADtype
chr21 9500000 10250000 loopTAD
chr21 14800000 15600000 leftStripe
chr21 15600000 16850000 otherTAD
... ... ... ...

4.6 Hubs

This function extract chromatin Hubs as described in PMID: 26272203

Comutation of hubs requires rawhic input
h1d call hubs ./test_data/GSE104334_Ctrl.hic 50000 chr21 \
	--datatype rawhic --gt ./test_data/hg19_genome_table.txt -o testname -p 0.05

The output will be testname_hubs.csv in .bed style, as :

chr21 15450000 15750000
chr21 15850000 16000000
... ... ...

4.7 TAD

This function will use Insulation Score to simply call TAD:

h1d call TAD ./test_data/GSE104334_Ctrl.chr21.matrix.gz 50000 chr21 \
	--datatype matrix -o testname -p 300000

5. Obtain the same metrics for muliple samples

5.1 Calculation

h1d multisamples IS ./test_data/multisample1/metadata.txt \
		50000 chr21 -o multisamples_metrics

The output would be multisamples_metrics.csv:

chr start end Control1 Control2 Treat1 Treat2 ...
... ... ... ... ... ... ... ...
chr21 23900000 23950000 0.600162 0.658727 0.734661 0.692226 ...
chr21 23950000 24000000 0.610387 0.674535 0.721764 0.70977 ...
... ... ... ... ... ... ... ...

5.2 Parameters

$ h1d multisamples
usage: h1d multisamples [-h] [--datatype DATATYPE]
                        [--samplelist SAMPLELIST] [--labels LABELS]
                        [-p PARAMETER] [-o OUTNAME] [--gt GT] [--corr]
                        [--heat] [--line] [--anova] [--discrete] [-s START] [-e END]
                        type data resolution chromosome

Required parameters:

  • type, type of 1D metrics could be one of {IS,CI,DI,SS,DLR,PC1,IES,IAS,IF}, as described

  • data, a tsv (Tab-separated) file contain name and paths for all samples (below). Used file could be contact matrix or raw .hic as introduced here

Control1 ./test_data/multisample1/Control1.chr21.matrix.gz
Control1 ./test_data/multisample1/Control2.chr21.matrix.gz
Treat1 ./test_data/multisample1/Treat1.chr21.matrix.gz
Treat2 ./test_data/multisample1/Treat2.chr21.matrix.gz
Treat3 ./test_data/multisample1/Treat3.chr21.matrix.gz
Treat4 ./test_data/multisample1/Treat4.chr21.matrix.gz
  • resolution, resolution (50000, i.e.) of given contact matrix, or choosed resolution for analyzing .hic file.

  • chromosome, selected chromosome to be analyzed.

Optional parameters:

  • --datatype, type of input data: “matrix” (default) or “rawhic”.

  • -p, parameters for one-sample metrics.

  • -o, output name, default: multisamples_metrics

  • --gt, genome table file when using raw .hic data.

5.3 Visualization

  • --corr, Plot correlation for all samples.

  • --heat, Plot raw heatmap for all samples.

  • --line, Plot line chart for all samples.

  • -s, start site of plot when using --corr and --line.

  • -e, end site of plot when using --corr and --line.

5.3.1 Correlation map of all samples

h1d multisamples IS ./test_data/multisample1/metadata.txt 50000 chr21 \
	-o multisamples_metrics --corr
RTDimport

5.3.2 Line chart or multiple samples:

h1d multisamples IS ./test_data/multisample1/metadata.txt 50000 chr21 -o multisamples_metrics \
	--line -s 24500000 -e 34500000
RTDimport

5.3.3 Heatmap or multiple samples:

h1d multisamples IS ./test_data/multisample1/metadata.txt 50000 chr21 -o multisamples_metrics \
	--heat -s 24500000 -e 34500000
RTDimport

5.3.4 Discrete heatmap

Some metrics such as PC1, can not be quantitatively compared, thus we convert it to discrete value to draw a heatmap:

h1d multisamples PC1 ./test_data/multisample2/metadata.txt 50000 chr19 -o multisamples2_metrics \
	--discrete -s 22500000 -e 32500000 -p test_data/multisample2/mm10_geneDensity50000.txt
RTDimport

5.4 Statistically comparison

As we mentioned in the manuscript, most of (except PC1 and DI) 1D metrics could be quantitatively compared. Considering most of them exhibited the unimodal distribution without obvious skew, we use ANOVA-like test to statistically compare multiple Hi-C samples. Each bin and its surrounding bins (default=2) are considered to to run such test. The obtained p-values is then adjusted via Benjamini/Hochberg method, to get the qvalue.

To run this, please specify a genomic region and type:

h1d multisamples IS ./test_data/multisample1/metadata.txt 50000 chr21 -o ptest --anova -s 24500000 -e 34500000

, where the test.txt is described above. The out put will be:

chr start end pvalue qvalue
... ... ... ... ...
chr21 24650000 24700000 1.4207106516019808e-05 5.25394882479223e-05
chr21 24700000 24750000 0.000600635194484068 0.0014715562264859666
chr21 24750000 24800000 0.00011666149851486222 0.0003412784135658656
... ... ... ... ...

6. Obtain various types of 1D metrics for the same sample

6.1 Calculation

# dense matrix
h1d multitypes IS,CI,DI \
	./test_data/GSE104334_Ctrl.chr21.matrix.gz \
	50000 chr21 -p 300000,300000,1000000 \
	--datatype matrix -o multi_types_metric
	
# .hic input
h1d multitypes IS,CI,DI \
	./test_data/GSE104334_Ctrl.hic  \
	50000 chr21 -p 300000,300000,1000000 \
	--datatype rawhic -o multi_types_metric --gt ./test_data/hg19_genome_table.txt 

# .cool input
h1d multitypes IS,CI,DI \
	./test_data/GSE104334_Ctrl.50000.cool  \
	50000 chr21 -p 300000,300000,1000000 \
	--datatype cool -o multi_types_metric --gt ./test_data/hg19_genome_table.txt 

This will output a csv file multi_types_metric.csv like:

chr start end InsulationScore ContrastIndex DirectionalityIndex
... ... ... ... ... ...
chr21 23900000 23950000 0.652077 0.667196 -3.776812
chr21 23950000 24000000 0.636362 0.678699 -3.448456
... ... ... ... ... ...

6.2 Parameters

$ h1d multitypes
usage: h1d multitypes [-h] -p PARAMETER [-c CONTROLMATRIX]
                      [-o OUTNAME] [--datatype DATATYPE] [--gt GT]
                      [-d] [-s START] [-e END]
                      typelist data resolution chromosome
  • Required parameters:

    • typelist the list of types separated by comma.

      • When calculating one-sample metrics, it should be subset of [IS,CI,DI,SS,DLR,PC1,IES,IAS,IF].

      • When calculating two-sample metrics, it should be subset of [ISC,CIC,SSC,deltaDLR,CD,IESC,IASC,IFC,DRF]

    • data, Path of matrix file or raw .hic file

    • resolution, resolution (50000, i.e.) of given contact matrix, or choosed resolution for .hic file

    • chromosome, selected chromosome to be analyzed.

    • -p, list of parameters (comma-separated). Please refer to one-sample or two-sample metrics.

  • Optional parameters:

    • -o, output name, default: multitypes_metrics

    • -c, contact matrix or .hic file of control sample, which is required when calculating two-sample metrics.

    • --datatype, type of input data: “matrix” (default) or “rawhic”.

    • --gt, genome table file when using raw .hic data.

6.3 Visuliazation

-d, decide whether to plot.

-s, start site for plot.

-e, end site for plot.

  • Visulize one-sample metrics (output will be all_onesample.csv and all_onesample.pdf)

h1d multitypes IS,CI,DI,SS,DLR,PC1,IAS,IES,IF \
	./test_data/GSE104334_Ctrl.hic 50000 chr21 \
	-p 300000,300000,1000000,300000,3000000,./test_data/hg19_geneDensity50000.txt,300000,300000,0.05 
	--datatype rawhic --gt ./test_data/hg19_genome_table.txt 
	-o all_onesample -d -s 24500000 -e 34500000
RTDimport
  • Visulize two-sample metrics: (output will be all_twosample.csv and all_twosample.pdf)

h1d multitypes ISC,CIC,SSC,deltaDLR,CD,IASC,IESC,IFC,DRF \
	./test_data/GSE104334_Rad21KD.hic 50000 chr21 \
	-c ./test_data/GSE104334_Ctrl.hic \
	-p 300000,300000,300000,3000000,pearson,300000,300000,0.05,200000-5000000 \ 
	--datatype rawhic --gt ./test_data/hg19_genome_table.txt \
	-o all_twosample -d -s 24500000 -e 34500000
RTDimport

7. Provide basic functions to visualize and handle Hi-C data

$ h1d basic
usage: __main__.py basic [-h] [-o OUTNAME] [-c CONTROLMATRIX]
                         [--datatype DATATYPE] [--gt GT] [--plottype PLOTTYPE]
                         [-s START] [-e END] [--normalize NORMALIZE]
                         mode data resolution chromosome
  • mode, Running mode,,should be one of {plot,dump}

  • data, Path of matrix file or raw .hic file.

  • resolution, resolution (50000, i.e.) of given contact matrix, or choosed resolution for analyzing .hic file.

  • chromosome, selected chromosome to be analyzed.

  • --datatype, type of input data: “matrix” (default) or “rawhic”.

  • --gt, genome table file when using raw .hic data.

  • -o, output name, default: unname

7.1 Plot indicated region

This function provide basic visualization for Hi-C data.

  • --plottype, Type of plot, could be one of {tri,square,tad}

  • -s, start site for plot.

  • -e, end site for plot.

  • -c, File of control sample. If provided, it will plot differential matrix of treat_vs_control.

h1d basic plot ./test_data/GSE104334_Ctrl.chr21.matrix.gz \
	50000 chr21 --datatype matrix -o testplot --plottype square \
	-s 27500000 -e 32500000
RTDimport

If use --plottype tri:

RTDimport

Differential matrix is plotted when the control data is provided:

h1d basic plot ./test_data/GSE104334_Ctrl.chr21.matrix.gz \
	50000 chr21 --datatype matrix -o testplot --plottype square \
	-s 27500000 -e 32500000
RTDimport

7.2 Make contact matrix.

--normalize, Normalize methods {NONE/VC/VC_SQRT/KR}

h1d basic dump ./test_data/GSE104334_Ctrl.hic 50000 chr21 \
	--datatype rawhic -o testdump --gt ./test_data/hg19_genome_table.txt
	--normalize KR

The output will be:

testdump
└── 50000
    └── observed.KR.chr21.matrix.gz

which is dense matrix (zipped) of intra-chromosomal contacts, like:

0 25000 50000 75000 ...
0 0 0 0 0 ...
25000 0 8 3 5 ...
50000 0 3 8 4 ...
75000 0 5 4 0 ...
... ... ... ... ... ...

Dump all chromosomes

  • chromosome , set chromosome to “all” will compute metrics for all chromosomes.

  • data, must be .hic data

  • -maxchr, Maximum index of chromosome (human genome is 22,i.e.). It will compute chromosome 1~maxchr plus chromosome X.

  • -n, Number of processors

h1d basic dump ./test_data/GSE104334_Ctrl.hic 50000 all \
	--gt ./test_data/hg19_genome_table.txt --normalize KR -o justtest \
	--datatype rawhic --maxchr 22 -n 30

Output would be:

justtest
└── 50000
    ├── observed.KR.chr1.matrix.gz
    ├── observed.KR.chr10.matrix.gz
    ├── observed.KR.chr11.matrix.gz
    ├── observed.KR.chr12.matrix.gz
    ├── observed.KR.chr13.matrix.gz
    ├── observed.KR.chr14.matrix.gz
    ├── observed.KR.chr15.matrix.gz
    ├── observed.KR.chr16.matrix.gz
    ├── observed.KR.chr17.matrix.gz
    ├── observed.KR.chr18.matrix.gz
    ├── observed.KR.chr19.matrix.gz
    ├── observed.KR.chr2.matrix.gz
    ├── observed.KR.chr20.matrix.gz
    ├── observed.KR.chr21.matrix.gz
    ├── observed.KR.chr22.matrix.gz
    ├── observed.KR.chr3.matrix.gz
    ├── observed.KR.chr4.matrix.gz
    ├── observed.KR.chr5.matrix.gz
    ├── observed.KR.chr6.matrix.gz
    ├── observed.KR.chr7.matrix.gz
    ├── observed.KR.chr8.matrix.gz
    ├── observed.KR.chr9.matrix.gz
    └── observed.KR.chrX.matrix.gz

7.3 Make gene density file for PC1

h1d basic gd refFlat.hg19.txt 50000 \
		./test_data/hg19_genome_table.txt -o hg19.geneDensity.txt
  • refFlat.hg19.txt is defined by UCSC and should be at least 6 columns as (The first column geneName must not be NA):

geneName name chrom strand txStart txEnd
RCAN1 NM_001034679 chr1 + 327876 339078
KCNE1 NM_001077977 chr1 + 377375 383871
... ... .. ... ... ...
  • 50000 is the resolution for PC1 analysis.

  • ./test_data/hg19_genome_table.txt is genome table file (tab-separated) which described the length of each chromosome for your genome reference:

chr1 248956422
chr2 242193529
... ...
  • -o is the output name

Indices and tables