Welcome to cortex_var

cortex_var is a tool for genome assembly and variation analysis from sequence data. You can use it to discover and genotype variants on single or multiple haploid or diploid samples. If you have multiple samples, you can use Cortex to look specifically for variants that distinguish one set of samples (eg phenotype=X, cases, parents, tumour) from another set of samples (eg phenotype=Y, controls, child, normal). See our Nature Genetics paper and the documentation for detailed descriptions.

If you have questions about what cortex_var can do, or experimental design, or you are having problems and need advice, or if you want to know what others are doing with cortex_var, then please join our Google group/mailing list (to do this, click here and follow instructions), and then post/email your questions there. Note you cannot post to or email the list without first joining. If you have tried to use cortex but something has done wrong, then please raise a bug. Instructions on how to raise a bug are here.

cortex_var latest news

13th November 2013: release v1.0.5.21. Get it here. Minor release with updates to the 1000 Genomes Phase3 pipeline. These changes will migrate into run_calls in a forthcoming release.

4th August 2013: release v1.0.5.20. I have again made two releases in quick succession. These are small incremental releases. I've modified the VCF filtering now, depending on whether a joitn or independent workflow is being used. See the Release Notes for details. Also there is now a --subsample option, which is sometimes useful to examine how things (eg power) varies with coverage. I'm afraid v1.0.5.19 had slightly broken VCF filtering, hence the rapid new release.

10th July 2013: reissued release v1.0.5.18. Apologies everyone - on July 1st I mis-bundled the v1.0.5.18 zip file and left out one of the new scripts. I've just fixed this and reissued the release.

1st July 2013: release v1.0.5.18. I've just brought out version and then very rapidly brought out v. also. There are some small bugfixes (see the release notes), but the main benefits of this release are: more robust choice of error-cleaning threshold, in the face of real-life heterogeneous data. run_calls now looks at the coverage distribution (as always) and makes a better choice of threshold. It also dumps a PDF for each sample, showing the distribution and where the line was drawn. There is also an important bugfix, fixing a bug I introduced by reverting a previous fix - essentially I was not allowing error-cleaning of contigs longer than a SNP - this can affect high diversity species as well as microbial samples where you often have low coverage contigs of the host species.

21 June 2013: release v1.0.5.16. This is a minor release. Bugfixes include: a bug in VCF filters which were removing calls where the ref allele in the reference FASTA was lower case, and various memory leaks. For those interested in testing beta code, there are early versions of the new Cortex error correction code (used in 1000 Genomes Project) and a simple pan-genome analysis function so you can compare a set of known genes with a set of samples, to see which samples have which genes/plasmids/whatever. Both of these functions (including the user-interface) will change in the next releases though, so bear that in mind.

6 February 2013: release v1.0.5.15. This is a bugfix release. Run_calls now allows you to specify the ascii offset if you are parsing non-standard FASTQ quality encoding. It also now copes better when being given a dataset with multiple samples with different read-lengths - if for example some have 50bp reads, and others have 100bp reads, it copes properly when the user has specified going up to k>50, so some samples have empty graphs. Some installation issues reported by a few users have been fixed by Isaac Turner. Some bugs in the 1000 Genomes Phase2b Cortex pipeline have been caught and fxed by Chunlin Xiao - this pipeline has been used so far to call variants on 1500 humans (mean depth 5x), so you might be interested if you want to work on a population of samples with large genomes. See the link elsewhere on this page (on the right) for docs on the pipeline itself. Various other small fixes, plus moved to using htslib for parsing bams. See the release notes for further details.

20th November 2012: New Cortex paper out! The next Cortex paper, "High-throughput microbial population genomics using the Cortex variation assembler" is out at Bioinformatics, early access. Get it here. Check it out to find out how we call at multiple kmers, the different discovery workflows you can use, different ways to integrate a reference assembly into your work, and how to scale to thousands of microbial samples. Also worth taking a look at the two case studies - for microbial samples the relatively high coverage compared with vertebrate sequencing, plus the relatively low repeat content, means you can attain comparable sensitivity with assembly as you do with mapping, plus of course the lower FDR, better access to non-SNP variation, and in fact the ability to compare a sample with the entire pan-genome of known sequence, not just a reference.

14 November 2012: release v1.0.5.14. Get it here.. This is a bugfix release, fixing a crash (segfault) that could be caused when you set --max_var_len to very large (megabase) values, plus a few other small changes. See the release notes for further details.

12 October 2012: release v1.0.5.13. This release introduces support for reading GZIPPED FASTQ (and FASTA) and BAM files (alignment information in the BAM is ignored). There are also a number of bugfixes, notably for a nasty bug which created invalid graph binaries. I've also updated the description of the 1000 Genomes Cortex pipeline here . There are a few UI changes - the --format option has been removed, --max_read_len is now only needed if you use --gt. Also the install process has been updated; you should now only need to run the install shell script, and then compile Cortex itself. run_calls should now work without needing to set any environment variables also.

23rd August 2012: Bugfix release v1.0.5.11. The main change in this release is in the scripts/1000genomes directory, which I have not advertised previously. It contains scripts for running Cortex on large numbers (tens, hundreds) of samples with large genomes - i.e. for the 1000 Genomes project. These are to allow collaborators across the world to reliably run a consistent Cortex pipeline on human populations. However this is the first time people other than me have done this, so I expect there may be some smoothing-out of issues in the near future. You can see a PDF describing the pipeline here. I've had enough people ask me about running Cortex on lots of samples with big genomes, that I thought people would find it useful to see the process. This release is a bugfix for a script in that 1000 Genomes directory, plus fixes for a few potential bugs-in-waiting (array overflow errors) in Cortex itself.

17th August 2012: Bugfix release v1.0.5.10. Thanks to Akdes Serin for finding some bugs in run_calls.pl, and Fernando Cruz, for implicitly pointing out that the INSTALL file had some text referring to an old release. If you are not using run_calls.pl, there is NO benefit to upgrading to this release.

14th August 2012: I've just put a new release v1.0.5.9 up on Sourceforge. Various new features for Cortex itself (genotyping separate from discovery, novel sequence calling), considerable performance improvement thanks to some changes made by Isaac Turner (e.g if I/O is not a bottleneck, loading a human reference genome binary (k=31) now takes 15 minutes where it used to take 45 minutes). However the biggest change in this release is the introduction of a wrapper script that allows you to automate an entire analysis across many samples (from fastq to VCF). The manual has been completely overhauled also.

9th January 2012: The Cortex paper is now out in Nature Genetics! Check it out for a detailed description of our methods for variant discovery and genotyping.

3rd November 2011: We have just released v1.0.5.3 - new features are: better error cleaning, genotype calls and likelihoods, allows dumping of subgraphs found by alignment of sequence to the graph. There is a new dependency on the GNU Scientific Library, which for simplicity I have bundled with cortex_var - this means the zip is a lot bigger than before (about 24Mb). Apologies for this - I'll pare this down in future releases. Note I have modified the binary header for binary files slightly - Cortex will continue to be able to read old binaries.

cortex_var features

  • Variant discovery by de novo assembly - no reference genome required
  • Supports multicoloured de Bruijn graphs - have multiple samples loaded into the same graph in different colours, and find variants that distinguish them.
  • Capable of calling SNPs, indels, inversions, complex variants, small haplotypes
  • Extremely accurate variant calling - see our paper for base-pair-resolution validation of entire alleles (rather than just breakpoints) of SNPs, indels and complex variants by comparison with fully sequenced (and finished) fosmids - a level of validation beyond that demanded of any other variant caller we are aware of - currently cortex_var is the most accurate variant caller for indels and complex variants.
  • Capable of aligning a reference genome to a graph and using that to call variants
  • Support for comparing cases/controls or phenotyped strains
  • Typical memory use: 1 high coverage human in under 80Gb of RAM, 1000 yeasts in under 64Gb RAM, 10 humans in under 256 Gb RAM