Skip to content

Releases: etal/cnvkit

Version 0.9.2

26 Feb 06:30
Compare
Choose a tag to compare

This release contains a new command import-rna to infer coarse-grained copy number from RNA expression data. (#151)

Three new HMM-based segmentation methods are offered: 'hmm', 'hmm-germline', and 'hmm-tumor'. These should be considered experimental and used with caution; the implementations are likely change in the next release.

The option --male-reference in the commands batch, reference, fix, call, and export (at least) has been renamed to --haploid-x-reference everywhere to reduce user confusion. A shim is in place so --male-reference will continue to work.

Documentation, logging, and some error messages are improved.

Thanks to @chapmanb, @MajoroMask, and others for contributing to this release.

Dependencies

  • 'pandas' version 0.22 is supported.
  • 'pysam' version 0.13.0 is supported.
  • 'hmmlearn' version 0.2 is a run-time requirement to use the new HMM-based segmentation methods. The rest of CNVkit can be run without it. To ensure the right version is installed, install CNVkit with conda as usual, then install hmmlearn with pip within the CNVkit conda environment.
  • Assume and require pip/setuptools for installation. (This is included with stock Python 2.7 and later.)

Scripts

  • New script "skg_convert.py" to convert between BED, GATK interval list, GFF, VCF, and tabular formats using the 'skgenome.tabio' sub-package, with options for simple post-processing.
  • Removed the deprecated script refFlat2bed.py. (Use skg_convert.py instead.)

Commands

access:

  • Drop noncanonical, untargeted contigs/chromsomes by default. This affects analyses run from scratch with batch, too. (#169, #299)

segment:

  • Three new methods can be specified with -m: hmm, hmm-germline, and hmm-tumor.
  • With -m flasso, force a breakpoint at centromeres, as was already done for the default 'cbs' method.

reference:

  • The option --antitargets is no longer required to build a flat reference. Previously, building a flat reference for WGS or TAS required creating an empty file to use as antitargets alongside the target BED.
  • Print a warning if the sample sex inferred from targets does not match that of antitargets. (#281)

scatter:

  • Removed the deprecated, invisible option --background-marker. (Use --antitarget-marker instead.)
  • Trendlines should reflect small CNVs better, while preserving overall smoothing. The implementation now uses the Savitzky-Golay method instead of a Kaiser window, and the smoothing bandwidth is better-tuned. (This can also slightly improve outlier filtering in segment.)

export seg:

  • Add option --enumerate-chroms to replace chromosome or contig names with sequential integers. Previously, this renumbering was always done, following some version of the SEG format. But since most tools don't require the contigs to be sequential integers, and this behavior causes trouble for users, it's now disabled by default. (#282)

gainloss/genemetrics:

  • Rename gainloss command to genemetrics. A shim is in place so cnvkit.py gainloss will continue to work. (#278)
  • Report segment- and bin-level weight and probes separately. (#107, #278)

Bug fixes

  • autobin: Require -g/--access for WGS (#289)

  • batch: Use the "access" regions for the WGS workflow to choose bin size; these were previously being ignored, so bin sizes were too large, being based on the size of the whole genome, not just sequencing-accessible regions.

  • call: Safely handle bins with zero weight when running call --filter cn. (bcbio/bcbio-nextgen#2112; thanks @chapmanb)

  • coverage, guess_baits.py: Handle input BED files containing >4 columns. (#301)

  • gainloss: Without -s, make 'depth' the weighted mean of bins, not just the first bin's value.

  • segment: Ensure the .cns output file's columns are sorted properly (#291)

  • vcfio: Don't crash if a record has no ALT values (#279)

  • tabio:

    • Recognize BED format with decimal in chromosome name (#293)
    • Improvements to GFF/GTF/GFF3 parsing. The new options are mostly accessible through the Python API and the script 'skg_convert.py'. (#311)
    • In 'read_auto' (and all CNVkit commands that take regions as input), determine the file format first by checking the file extension and verifying the format of the first(-ish) line. Only if that doesn't work, fallback to the original method of testing the first(-ish) line against a brittle series of regular expressions. (#315)

Python API

  • cnvlib.write: Newly available at the top level to write tabular files (like .cnr and .cns), symmetric with 'cnvlib.read()'. The 'cnvlib.tabio' alias to 'skgenome.tabio' has been removed; to read and write formats other than TSV-with-header ('tab'), import and use 'skgenome.tabio' directly.
  • CopyNumArray.squash_genes: remove deprecated keyword argument 'squash_background'. Use 'squash_antitarget' instead.
  • segmetrics: Move the functions supporting this command from 'cnvlib.command' to a new module 'cnvlib.segmetrics'.

Version 0.9.1

09 Nov 20:18
Compare
Choose a tag to compare

Highlights: Useful enhancements and changes to plotting and segmentation, and a new script for single-exon CNV testing.

Plus, bug fixes and usability improvements to avoid unexpected errors. (#250, #255, #262, etc.)

Dependencies

  • Compatible with the most recent pandas version 0.21.0 (#273, #274; thanks @chapmanb)
  • R dependencies were reduced to simplify installation.

Scripts

  • Renamed "cnn_*.py" to "cnv_*.py".
  • New script "cnv_ztest.py" to detect single-bin (e.g. single exon) deep deletions and high-level amplifications.
  • In "cnv_updater.py", rename "Background" (i.e. off-target) bins to "Antitarget", addition to adding a "depth" column if it's missing.

Commands

autobin:

  • Raise the maximum target/antitarget bin sizes to 50kb/1Mb.

fix:

  • Allow specifying sample_id via --sample-id/-id, in case the input coverage filenames do not have the expected form "sample_id.targetcoverage.cnn" and "sample_id.antitargetcoverage.cnn". (#269; thanks @chapmanb)

segment:

  • Process each chromosome arm separately (with 'cbs' and 'haar', but not 'flasso'). Centromere locations are guessed from the largest gap between sequencing-accessible regions, and are not necessarily the true locations, although they do match fairly well on the human genome.
  • Logging of dropped bins is streamlined somewhat.
  • New method -m none to only calculate arm-level segment means (for testing and experimentation).

scatter:

  • Highlight non-neutral segments from .call.cns. If segments have the columns 'cn' and potentially also 'cn1' and 'cn2' (as added by the call command), use those fields to display copy number alterations, LOH and allelic imbalance with colorized segments (orange by default), and use gray for neutral segments. If a VCF is also given, the same is done for SNVs in the lower panel. Otherwise, all segments are colorized as before. (#18, #157)

  • New option --by-bins to display x-axis positions by sequential bin number on each chromosome, rather than genomic coordinates. This makes the plots much more useful with targeted amplicon sequencing data, or very small gene panels. (#63)

  • Trend line (--trend) now accounts for bin weights, which generally results in a better fit.

  • Improved interaction of -c and -g options:

    • Only apply the window margin (-w) if -g is used alone, or -c specifies a small chromosomal region with no genes.
    • Allow an empty gene list (-g '' or -g ',') to prevent highlighting and labeling of any genes / small non-genic "Selection" in the -c region.
    • If any gene in -g is not fully within the region specified by -c, name that gene and its coordinates in the error message.
    • If the -c region has size <=0, show a specific error message.
    • Handle NaN log2 values when calculating y-axis limits.

heatmap:

  • Incorporate the --by-bins argument to match scatter. (#63)
  • Warn if selected region contains no data for a sample. This helps troubleshoot if a chromosome name was mis-specified on the command line. (#268)

export seg:

  • Change column headers to match DNAcopy output. The column headers generally don't matter in the SEG format, but the DNAcopy dataframe is considered the canonical form.

Python API

  • cnvlib.do_segment -- new keyword argument min_weight to drop bins with 'weight' below the specified value. If not used, then only bins with weight 0 will be dropped. This feature is not recommended for normal usage and is not available on the command line.
  • cnvlib.do_scatter -- Remove deprecated keyword argument 'background_marker' in favor of 'antitarget_marker', corresponding to scatter options deprecated in v0.9.0.
  • cnvlib.cnary.CopyNumArray: Add method 'smoothed', which calculates the trendline displayed by the scatter command.
  • skgenome.tabio: Add read support for samtools 'dict' format, which resembles the plain-text SAM header and can contain chromosome names and sizes.
  • skgenome.gary.GenomicArray: Add magic methods __bool__ (Py3) and __nonzero__ (Py2) to ensure an empty GenomicArray, i.e. 0 rows, is treated as false-ish on both Python 2.7 and 3.x.

Version 0.9.0

17 Aug 18:45
Compare
Choose a tag to compare

In addition to bug fixes, documentation updates, and usability improvements, this release includes some larger changes:

  • The off-target bins in .cnn and .cnr files are now assigned the label "Antitarget" instead of "Background" in the "gene" column.

    The label "Background" in existing files will still be handled the same way, but new output files generated with CNVkit 0.9.0 and later will use the "Antitarget" label -- so, earlier versions of CNVkit may have problems with files produced by CNVkit 0.9.0. Some command line options and API keyword arguments similarly replace "background" with "antitarget", with shims in place for compatibility with existing scripts. (#171)

  • The sub-packages 'genome' and 'tabio' are now in a separate top-level package 'skgenome', still included in the CNVkit distribution. (See "Python API" below.)

    This does not affect the command-line usage of CNVkit, but clears the way to extract a scikit-genome package that can be installed and used separately from CNVkit for computing with genomic intervals.

Documentation

  • Link to an example VCF file that contains matched tumor and normal samples and will work nicely with CNVkit.
  • Describe the breaks command's output columns. (#220)
  • Show a Python code example customizing a plot with matplotlib.pyplot. (#196)

Dependencies

  • pysam: Raise minimum to 0.10; support new version 0.11.2.1 (#218; thanks @chapmanb)
  • pandas: Support new version 0.20.1 (#215)
  • numpy: Support new version 0.13 (#235, #238)

Commands

batch:

  • Log the CNVkit version number at the start of the run.
  • Print a message at the end if no tumor/test samples were specified. (#214)
  • Clarify error messages for bad option combinations. (#216)
  • Removed the deprecated, suppressed/invisible option --split. It was a shim in the 0.8 series to support old scripts.

reference:

  • Ensure the inferred chromosomal sex matches between the targets and antitargets for the same sample. If the inferences do not match, prefer antitargets. (#234, #237)

fix:

  • Warn & don't reweight bins if most antitargets have no/low coverage. This avoids a variety of surprising downstream problems when the input was specified as hybrid capture (the default), but is actually from targeted amplicon sequencing, or otherwise has no reads mapped to most off-target bins.

segment:

  • Log the segmentation method and p-value/q-value threshold.

call:

  • Add option --center-at, for re-centering log2 values at a user-specified neutral value.
  • The option --center can be used without an argument, in which case it uses the default centering method 'median'.

diagram:

  • New option --title to add a custom title to the top of the generated figure. (#239; thanks @micknudsen)

export vcf:

  • When given a .cnr file corresponding to the usual segmented input file (.cns), emit the CIPOS and CIEND tags in the generated VCF. These indicate the "fuzzy" coordinates of segment breakpoints. Here, the ranges are simply the widths of the underlying bins adjacent to each segment breakpoint. These tags can help meta-methods aggregate/harmonize CNVkit's calls with those of other structural variant callers. (#72)

import-picard:

  • Don't accept directory as an argument (was deprecated).
  • Be a little more flexible in filenames accepted: instead of requiring input files to be named *.targetcoverage.??? or *.antitargetcoverage.???, strip the full suffix and default to 'targetcoverage.cnn' output suffix, or 'antitargetcoverage.cnn' if input filename contains 'antitarget'. Works the same for filenames following the earlier convention, but now is pretty safe for amplicon targets with arbitrary filenames, and behavior is generally less surprising.

Bug fixes

  • antitarget: Don't crash if -g/--access is not given (#207)
  • batch: Don't crash in 'wgs' mode when given just targets (-t) without a FASTA reference genome sequence (-f)
    -call --filter ampdel: Drop segments with copy number (cn field) between 0 and 5, exclusive, as the documentation indicates. Previously, it was just merging adjacent segments with copy number 1--4, but not dropping them. (#222)
  • export cdt: Match the CDT spec. Fix a regression in which columns could be swapped/misaligned versus the header. Add a dummy "EWEIGHT" row to ensure Java TreeView starts reading data from the correct line in the file.
  • export theta: Don't crash on bins where reference is NaN. (#168)
  • metrics, descriptives: Handle degenerate/trivial cases consistently. (#202)
  • segment: Handle sample names that are integers with leading zeros. (#213)
  • sex: Don't crash if chromosomes X and Y are both missing. (#236)
  • VCF parsing (call, scatter, segment):
    • Safely handle small or empty VCF files that previously could trigger a crash during BAF calculation. Now, with an empty VCF an all-blank "baf" will be emitted. (#218, #224; thanks @chapmanb)
    • Improve handling of Mutect2 VCF files, somewhat. Mutect2 VCFs are still not recommended as input to CNVkit; try FreeBayes or GATK HaplotypeCaller instead. (#195)

Python API

Moved sub-packages 'genome' and 'tabio' to separate top-level package 'skgenome'
(#201). The top-level cnvlib API is mostly the same otherwise, but supporting
modules were refactored to decouple skgenome from cnvlib and remove
redundancies. In particular:

  • Split module cnvlib.core split into skgenome.tabio and cnvlib.cmdutil
  • Remove GenomicArray static method row2label in favor of functions to_label and from_label in new module skgenome.rangelabel.
  • The SEG writer in 'tabio' now replaces chromosome names with 1-based integer indices, per SEG spec/convention. The export seg command now uses this writer directly.

Scripts

  • Remove the script coverage_bin_size.py, previously deprecated in favor of the autobin command.
  • Add skg_convert.py to convert between tabular formats (including BED and UCSC RefFlat).
  • Deprecate refFlat2bed.py in favor of skg_convert.py.
  • Add cnn_annotate.py to replace the "gene" field for each bin in a .cnn or .cnr file, given a gene annotation database like refFlat.txt. The need for this comes up occasionally when users notice at the end of an analysis that vendor-annotated targets are not the desired gene names.

Version 0.8.5

04 Mar 23:04
Compare
Choose a tag to compare

New 'autobin' command, replacing the script coverage_bin_size.py. Fix some bugs and usability issues. Unit tests improved, especially for the 'cnvlib.genome' sub-package.

Dependencies

  • Pandas 0.18.1 is once again supported. Previously the minimum version was 0.19.1. (bcbio/bcbio-nextgen#1836)
  • Pysam minimum version is still 0.9.1.4, but slightly older versions in the 0.9 series may still work too. (#192)

Commands

autobin:

  • New command, replacing and extending the script coverage_bin_size.py. The script is still included (and shares most of the same code), but is considered deprecated and will be removed in the 0.9.0 release. (#170)
  • In 'amplicon' and 'hybrid' modes, ensure sampling regions for coverage is the same in every run by set random seed. (#191)

antitarget, autobin, batch:

  • Fix an issue in GenomicArray.subtract() that caused some of the expected output regions to be missing. In cases where this caused an entire chromosome to be lost, the coverage_bin_size.py script and autobin and batch commands in hybrid mode would crash. (bcbio/bcbio-nextgen#1799)

batch, diagram:

  • Fix creation of chromosomal diagrams with --diagram and the diagram command. (#190)

export:

  • In export seg, use 1-based indexing in the SEG output. (#197)
  • Fix export cdt format; it was generating Java TreeView (jtv) earlier.

Version 0.8.4

16 Feb 00:12
Compare
Choose a tag to compare

This minor release focuses on improving usability and fixing some bugs.

Documentation is updated (thanks @kyleabeauchamp for #186).

Dependencies

  • Raise minimum pandas version from 0.18.1 to 0.19.0
  • Raise minimum matplotlib version to 1.3.1

Commands

fix, metrics:

  • Set PRNG seed to ensure reproducible results. The pipeline is now fully repeatable with identical results if run in serial, i.e. without -p.

fix, reference:

  • Reduce boundary effects (expected log2 and spread values of 0 in some bins) when smoothing biases on very small gene panels, e.g. targeted amplicon sequencing of <5 genes, <100 bins. (#181)

fix:

  • Don't complain about mismatched sample IDs if antitargets are blank. This allows reusing a blank "MT" file in a shell loop for WGS and amplicon data.

reference:

  • Make antitargets (antitarget.bed or *.antitargetcoverage.cnn) an optional argument. Previously this argument was required, so processing WGS or amplicon data, which has no off-target regions or reads, required the user to create and provide a blank BED file or appropriately named, empty .cnn files. (#183)

segment:

  • Don't log "Dropped 0 low-coverage bins". Only log when it actually drops bins.

diagram, heatmap:

  • Add option --no-shift-xy. Shifting X and Y according reference and sample sex was done in diagram, but not heatmap. Now it's optional in both.

heatmap:

  • Add a legend of log2 ratio colors to the plot. (#36)
  • Add options -x/--sample-sex and -y/--male-reference. (#172)

gender/sex:

  • Rename 'gender' command to 'sex', with shim for backward compatibility. (#182)
  • In other commands, the -g/--gender``argument is renamed to-x/--sample-sex, also with a compatibility shim. Argument valuesxandyare accepted in addition tof/femaleandm/male`, respectively.

import-picard:

  • Deprecate searching a directory tree for files. It was a vestige of early lab work, and makes a shaky assumption about Picard CalculateHsMetrics --PER_TARGET_COVERAGE output filenames.

API

  • The do_* function implementations moved to their named modules. The do_* functions can still be called or imported from the cnvlib and cnvlib.commands modules.
  • All parsing and serialization of "chr:start-end" genomic region labels is consolidated under a new module, cnvlib.genome.rangelabel. These functions are used in in tabio.textcoord, GenomicArray.labels(), and elsewhere to ensure consistent behavior.

Internal

  • cnvlib.genome: Handle nested bins correctly in the merge, flatten, and intersect modules, functions and GenomicArray methods. Verified with thorough unit tests.
  • VCF: If the paired normal sample's genotypes are all 0/0 or missing, fall back to --zygosity-freq (inference from b-allele frequency) rather than marking all variants as somatic. Then infer and drop additional somatic SNVs based on genotype after parsing, and only if that wouldn't drop all records. This allows CNVkit to safely distinguish somatic vs. germline in VCFs from Mutect2, though Mutect2 is still not recommended. (#184)

Version 0.8.3

18 Jan 02:02
Compare
Choose a tag to compare

Bug fixes and a few usability improvements. Notably, for the whole-genome sequencing workflow (batch -m wgs), bin size is now inferred from a sample's genome-wide coverage depth instead of using a fixed value, which should yield better results by default.

Dependencies

  • scipy: Raise minimum version to 0.15 (for the function scipy.stats.median_test)

New scripts

  • coverage_bin_size.py: Quickly estimate on- and off-target read depths to suggest reasonable bin sizes to use with the target and antitarget or batch commands. (#170)
  • guess_baits.py: In case the baited regions for a target capture panel are not known, use sample BAM files from sequencing with that panel to infer the likely captured regions. Works either guided, given a list of potential targets (e.g. all exons in a genome), or unguided, scanning all sequencing-accessible bases in the genome to find areas with elevated coverage.

Both scripts are preliminary and may be removed in a future release.

Global changes

  • Infer read lengths automatically from the given sample BAM files where needed (coverage and batch). Remove the hard-coded parameter cnvlib.params.READ_LEN. (#74)
  • Handle VCFs generated by LoFreq. This program does not emit sample genotypes, but locus depths and allele frequencies can be found in the INFO column instead -- unusual but technically within the VCF spec. (#173)

Commands

batch, coverage, segment:

  • The option -p/--processes can now be used without an argument to specify parallelizing across all available CPUs. The now-optional argument value is the maximum number of CPUs to use; the special value -p 0 was previously used to specify all CPUs (this still works).

batch:

  • Automatically estimate a reasonable average bin size in the whole-genome workflow, -m wgs, using a fast estimate of a given normal/control sample's genome-wide average coverage depth. (If multiple normals are given, the median-sized sample is used for this calculation.) This allows CNVkit to handle low-coverage/low-pass WGS data better by default. (#170)

coverage:

  • With --count, count all reads that overlap a region, but trim any portions of each read aligned outside the region from the number of bases counted. The result should now be closer to that without --count.

scatter:

  • In chromosome-level plots, the displayed x-axis range now matches the specified region (via -c or -g + -w) exactly. Previously, the displayed range depended on the bin locations. (#180)

Bug fixes

  • antitarget: Handle empty off-target regions safely. (bcbio/bcbio-nextgen#1696)
  • export theta: Rename argument --min-depth to --min-variant-depth, matching the equivalent argument in other commands. (#178; thanks @myronpeto)
  • scatter: Warn, don't crash, if a region in --region-list covers no bins. (#174; thanks @gabeng)

API changes

  • New module cnvlib.samutil for convenience functions on BAM files, using pysam.
  • New module cnvlib.autobin supporting the script coverage_bin_size.py. (#170)
  • Removed sub-package cnvlib.ngfrills, moving most functionality to samutil and tabio.
  • genome.GenomicArray: New method total_range_size, similar to pybedtools total_coverage()

Version 0.8.2

14 Dec 22:00
Compare
Choose a tag to compare

This release covers a number of internal changes to improve the stability and consistency of CNVkit, as well as new and improved command options to make more features available from the command line.

Due to a slight change in the binning procedure (see target and antitarget below), newly generated target and antitarget BED files, or a reference generated with batch, may not use the same bin boundaries as earlier versions. CNVkit will check these files for consistency and alert you if your BED or .cnn files do not match because of this change, e.g. running batch from scratch with the same panel but with two different CNVkit versions. If you want to update CNVkit mid-project, either keep using the same reference.cnn file as before for all new samples (as always), or regenerate all your *.targetcoverage.cnn and *.antitargetcoverage.cnn files to build a new reference.

Dependencies

  • pyvcf: No longer needed. Instead, parse VCFs with pysam, which is noticeably
    faster and better able to handle newer VCF and gVCF features. (#159)
  • pysam: Raise minimum version to 0.9.1.4.

Global changes

  • When extracting a sample ID from a filename, instead of trimming everything after the first '.' character, only drop known or single-part extensions. For example, "Case1.exome.tumor.bam" and "Case1.exome.tumor.vcf.gz" will now resolve to the sample ID "Case1.exome.tumor" instead of "Case1". Output files will be named like "Case1.exome.tumor.cnr" instead of "Case1.cnr", avoiding potential naming conflicts in the batch command when processing multiple samples. (#48)
  • Always sort regions by genomic coordinates after reading a file. This doesn't modify the input file in-place, but ensures the output files are always sorted the same way.
  • Gender detection is more robust. It now uses Mood's median test instead of the Mann-Whitney rank test. As a fallback for edge cases, e.g. only one segment per chromosome, it compares difference of weighted medians in autosomes versus sex chromosomes.

VCF parsing:

  • Improve handling of VCFs from Mutect2 (#122, #153) and bcftools (#146).
  • Don't reject records where FILTER is 'PASS' or '.'.
  • VCF options are now consistent across the commands that can use them (call, scatter, segment, export theta and export nexus-ogt).
  • New VCF option -z/--zygosity-freq to override VCF genotype calls. (#153, #132)

Commands

target, antitarget:

  • Divide bins evenly, using the same internal mechanism (the new GenomicArray.subdivide() method). Previously, subdivided regions were not always equal-sized as they should have been. Now, the coordinates of newly generated targets from a baits BED file may be a little different than before.

target:

  • Drop zero-width bins (#167).
  • Improve assignment of gene names to targets in WGS datasets. (#164)
  • Accept any supported region format for --annotate, including BED, interval list and GFF, in addition to the already supported UCSC refFlat. The format is detected automatically. (#163)
  • Raise an error if the given annotations file (refFlat or equivalent) and the given baited/targeted intervals do not have any overlapping chromosomes.

antitarget:

  • Set the default average bin size to 150kb. Previously, the CLI default was 200kb, but the API default was 100kb; experience shows 150kb works well.

access:

  • Avoid a possible error when more than 1000 small regions are excluded from a single sequencing-accessible region. (#150)

coverage:

  • Fix a unicode vs. bytes incompatibility on Python 3. (#147)
  • Fix a crash if the input BED has more than 4 columns.

reference:

  • Add -g/--gender option to declare the chromosomal sex of the input sample(s) (same for all), instead of detecting/guessing for each sample. (#161)
  • Ensure printed table of bad bins is a reasonable width. (#140)

segment:

  • With a VCF (-v), don't output 'cn1' and 'cn2' columns; calculate the 'baf' column the same as in call. (#148)
  • Improve memory efficiency somewhat when using a VCF. (#162)
  • Fix possible 1-base overlap of output segments when using the cbs or flasso methods. Specifically, the start positions were erroneously all shifted 1 base to the left before. (#158)

scatter, heatmap:

  • Improve rendering of genomes much smaller than the human genome, e.g. yeast, by scaling telomere padding to the total genome size. The blank space at chromosome boundaries was set to a fixed number of basepairs, but is now calculated as 0.3% of the whole genome size (sum of chromosome lengths) -- which works out the same for the human genome. (#155)

scatter:

  • Add option --segment-color. Now you can choose 'red' if you like.

metrics:

  • Input -s/--segments is now optional. If not given, compare bin log2 values to chromosome medians instead of segment means.

import-theta, export theta:

  • Drop sex chromosomes, since THetA2 doesn't handle them well. (#103, #153)

API

tabio:

  • Read new formats: GFF (simply); UCSC genePred refFlat; sub-formats bed3, bed4
  • Detect more formats with tabio.read_auto: BED, interval list, text coordinates (chr:start-end), refFlat, GFF, TSV with column names.
  • Remove module ngfrills.regions, no longer needed.

GenomicArray:

  • Moved to new sub-package 'genome'
  • Rename method select to filter
  • Rename method match_to_bins to into_ranges and generalize.
  • New methods flatten, merge, resize_ranges, subdivide, subtract

In general, the 'genome' functionality can be reached by using the tabio sub-package to load a GenomicArray instance and use its methods directly:

from cnvlib import tabio
regions = tabio.read_auto(filename)
# Generate 500bp flanking regions
flanks = regions.resize_ranges(500).merge().subtract(regions)

Version 0.8.1

11 Oct 18:58
Compare
Choose a tag to compare

This is primarily a bugfix release. The documentation is also improved, particularly covering the cnvlib API.

API:

  • For convenience in scripting, the relevant functions for running each CLI command (cnvlib.commands.do_*) are exported to the top level. For example: import cnvlib; cnvlib.do_batch(...)

Bug fixes:

  • access: Avoid a type-validation error on Python 3. (#141)
  • batch: Parallel processing now selects an appropriate number of workers for each step of the pipeline, reducing CPU contention when processing multiple samples in parallel. (#138)
  • call: Apply the ci and sem filters before calculating b-allele frequencies and absolute copy number, as these filters can alter the final calls.
  • reference: Safely handle an edge case in detecting gender from sample coverage depths when all bins have identical coverage depth, e.g. no coverage. (#144)
  • segment: Fix handling and segmentation of SNV allele frequencies from a VCF. Ensure output column ordering is correct. Avoid a crash that could occur when SNV segmentation produces a segment that does not cover any coverage bins. (bcbio/bcbio-nextgen#1590)
  • cnvlib.tabio: Improve handling of empty files, including VCFs with no samples and/or no locus records. If records and samples are present but genotypes are missing or undetectable, scatter, call and export would previously reject all records when filtering for SNPs, but will now accept all records instead.

Version 0.8

13 Sep 22:48
Compare
Choose a tag to compare

This is a larger release and the first update since our publication.

CNVkit now runs under Python 3 as well as 2.7. (#3, #101; thanks @mpschr)

File format changes:

  • New "depth" column in .cnn, .cnr, .cns
  • In .cns, "weight" is the sum, not mean, of bin-level weights within the segment

New script cnn_updater.py can be used to add the "depth" column to existing .cnn, .cnr and .cns files. However, most CNVkit commands should still work with pre-v0.8 files without using this script first. For best results, rebuild the .cnr and .cns for an ongoing study using the existing targetcoverage, antitargetcoverage and reference .cnn files.

Algorithmic changes:

  • reference, gender, call, diagram, export: Gender, or chromosomal sex, is now inferred with a statistical test instead of a fixed threshold, significantly improving the inferences on noisy or aneuploid samples. (#116)
  • reference, fix, call: Center log2 values by median of chromosome medians, by default. (#114)
  • reference, metrics, segmetrics: Improve the calculation of biweight location and biweight midvariance (now in descriptives.py).

These deprecated components (since 0.7.x) have been removed:

  • Commands rescale and loh -- use call and scatter, respectively, instead
  • Some options in export bed and export theta -- use call first instead
  • Script genome2access.py -- use cnvkit.py access instead

Updated commands:

batch:

  • New option --method, with choices "hybrid" (default), "wgs", "amplicon", to simplify/streamline usage with whole-genome or amplicon sequencing protocols. See documentation for details; in short, "wgs" and "amplicon" do not use antitargets or the edge/density bias correction; "wgs" by default uses the sequencing-accessible genome as the targets, and uses a more stringent significance threshold for segmentation.
  • Hide/deprecate --split option; it's always on now. To ensure bin coordinates do not change between batch runs (they generally won't anyway), use the -r/--reference option instead of specifying -t and -a in batch.
  • Add --drop-low-coverage option, which is passed to segment internally.
  • The -p/--processes option is also passed to coverage and segment internally (see below).

antitarget:

  • Increase the default average bin size from 100kb to 200kb.

coverage:

  • Parallelize coverage calculation over BED rows. The number of threads can be specified with the -p option. (#121; thanks @brentp)

segment:

  • Parallelize CBS and Haar segmentation methods across chromosomes. (#123, #125; thanks @brentp)

call:

  • New --filter option, with choices 'cn', 'ampdel', 'ci', 'sem' implemented.
  • With VCF b-allele frequencies (-v, 'baf'), always calculate the allele-specific integer copy numbers 'cn1' and 'cn2' so that 'cn1' is the larger one. BAF mirror direction stays majority-rules. (#105; thanks @mpschr)
  • If b-allele frequencies are used and total copy number is zero, report allelic copy numbers as 0, not NaN.

scatter:

  • Add --title option.
  • Allow selecting & labeling gene(s) w/ only segments as input.

heatmap, scatter:

  • Allow saving plots in any image file format supported by matplotlib, not just The file format is determined by the output filename's extension, e.g. 'png' saves in PNG format -- making it easier to integrate CNVkit plots with HTML reports. (#120; thanks @chapmanb)

diagram:

  • Add -g/--gender option to specify sample's known gender.

gainloss:

  • Make output tables more consistent across options. Show individual gene names (rather than all genes grouped within a segment in 1 row); don't show rows with no gene name; report the segment probe count instead of number of probes within the gene; show any extra columns present in the input .cns file. (#107, #108; thanks @mpschr)

gender:

  • Show column headers and Y-chromosome log2 values in the output table.

segmetrics:

  • Add stats options for mean, median, mode
  • Add MSE, SEM stats as options

metrics, segmetrics:

  • Add --drop-low-coverage option (like in segment and gainloss)

Internals:

  • New sub-package tabio: a more robust I/O framwork unifying support for tabular formats, including CNVkit's .cnn/.cnr/.cns, BED, SEG, VCF, GATK/Picard interval list, and text coordinates (chr:start:end). Base class GenomicArray and its derived classes CopyNumArray and VariantArray do not implement their own I/O, but rather are instantiated via tabio. The "import-" commands use this as well.
  • Removed rary.RegionArray; all functionality is now in tabio and GenomicArray.
  • New module "descriptives.py" implements descriptive statistics on plain numpy arrays or pandas Series instances, independent of CNVkit.
  • Better testing on Travis, covering Python 2.7, 3.4 and 3.5, on both Linux and OS X (thanks @kyleabeauchamp, @rmcgibbo, and @mpharrigan; #110)

Bug fixes:

  • batch: Errors in parallel processes will immediately be raised as exceptions at the top level, rather than dying silently. Previously, no error would occur until a missing output file was needed later in the pipeline. (#55)
  • segment:
    • Skip possible R warning text when parsing CBS output (#106) and run Rscript with the --vanilla option (#112; thanks @jsmedmar). Non-isolated R processes were prone to add various warning messages to the expected SEG output, which could crash the "segment" command for some users.
    • Handle zero-weight bins better (#128; thanks @chapmanb).
  • scatter:
    • Handle selected segments with an empty gene name (#104; thanks @mpschr).
    • Don't crash on zero-length GenomicArray/CopyNumArray inputs.
  • VCF parsing (now within tabio) improved:
    • More robust to missing genotype (GT) & depth (DP) fields (#102)
    • Handle VCFs from MuTect2 (#122)
  • export theta: don't crash when SNP VCF is a single, unpaired sample, or if segmented input (.cns) is empty.
  • heatmap: Avoid a possible crash if a sample is missing a chromosome.

Packaging:

  • Universal wheels are enabled for installation with pip (setup.cfg).

New & updated dependencies:

  • futures
  • futurize
  • numpy raised to version 1.9
  • pandas raised to version 0.18.1
  • pysam version 0.9.1.1 is specifically excluded

Version 0.7.11

20 Apr 23:22
Compare
Choose a tag to compare

New dependency on pyfaidx, a Python library for handling samtools-style FASTA indexes (.fai).

export vcf:

  • Add CNVkit version and current date (i.e. local calendar date that the
    "cnvkit.py export vcf" command was run) to the VCF header.

export theta:

  • Given a VCF of SNVs called jointly in paired tumor and normal samples,
    extract SNP allele counts to THetA2's custom input format
    ("snp_formatted.txt"). The two additional files CNVkit generates this way can
    be used with THetA2's "--TUMOR_SNP" and "--NORMAL_SNP" options to improve
    estimates of tumor purity and clonality.
  • Use CNVkit's segment weights and probe counts to estimate normal-sample read
    counts for each segment if no copy number reference profile (.cnn) or paired
    normal sample (.cnr) is given.
    The command's second argument is now optional and deprecated in favor of the
    -r/--reference option, which does the same thing.

import-theta:

  • Save integer copy number in the "cn" column of the output file(s) (CNVkit's
    .cns format).

call, export nexus-ogt:

  • When reading structural variants from a VCF file, interpret the END tag as the
    variant end position, not the length, per the VCF 4.2 specification.
    This bug could cause the b-allele frequencies calculated in call and export nexus-ogt to be erroneously repeated across many consecutive bins.

scatter:

  • When loading CNVkit files (in any command), identify and drop rows with "NaN"
    log2 values. (CNVkit never emits these, but they could happen if a user
    generates .cnr files from Illumina CGH array data files using a custom
    script.) The other rows (spread, gc, rmask) can be NaN without a problem, but
    plotting with scatter would crash when adjusting the y-axis based on NaN
    log2 values. (#95)
  • Detect & warn if input .cnr/.cns/.vcf is not sorted by genomic coordinates.
    This could happen if the input VCF or manually constructed .cnr/.cns file (not
    generated by CNVkit) was not sorted by genomic coordinates. Then the error
    message was cryptic, because some bins/segments/SNVs were selected successfully
    but plotting crashed when laying out the x-axis coordinates.

Internals & packaging:

  • Use the pyfaidx library to extract sequences from a genome FASTA file (used in
    the reference command), replacing some custom code in cnvlib. (#73; thanks
    @mdshw5)
  • Documentation updates.