EasyCluster: a fast program to build gene-oriented EST clusters

A fast program to build gene-oriented clusters of EST/FL-cDNA sequences

EasyCluster is a python program devised to create gene-oriented clusters of ESTs/FL-cDNAs given a genomic sequence. It performs first EST to genome mapping by means of GMAP program and then groups related ESTs according to the biological assumption that two or more ESTs are part of the same gene locus if they share at least one splice site. Optionally, EasyCluster can predict alternative splicing events per EST cluster by means of an ASTALAVISTA-like algorithm. Resulting clusters are in GFF format and ready to be used in different computational programs including tools to assemble FL-transcripts and gene finders. EST clusters can also be inspected in their genomic context thanks to ad hoc HTML pages.

EasyCluster can be installed and used in whatever operating system running python and GMAP. It has been also developed for users with no advanced skills in bioinformatics. EasyCluster, in fact, can be used interactively through a step-by-step process.

- Program requirements

EasyCluster requires python interpreter version 2.4 or superior and GMAP program (at least version 2007-09-28). For very large EST datasets python 2.5 is required to import ctypes module and the specific C library DNA_Stat. Moreover, to correctly view EST clusters in HTML format we strongly recommend the use of Mozzilla FireFox browser 1.5.x or superior.

- Quick installation and use

The current EasyCluster release can be downloaded here. After the download, EasyCluster can be installed through the following steps:

ernesto$ gunzip easycls.tar.gz

ernesto$ tar -xvf easycls.tar

ernesto$ cd easycls

ernesto$ python setup.py

This command will install Easycluster in the default directory EasyCluster that represents the main directory in which all calculations will be performed. Alternatively, users can change the default directory simply indicating the complete path of the desired new location.

Testing the installed version of EasyCluster is quite straightforward:

ernesto$ cd EasyCluster

ernesto$ ./bin/test.py

To view all available options:

ernesto$ ./bin/easy.py -h

EasyCluster can be use by command-line or interactively.

In the first case:

ernesto$ ./bin/easy.py -g data/genomic.fasta -e data/est.fasta -w my_work_directory

The directory my_work_directory will store all results and data produced during the EasyCluster run.

In the second case:

ernesto$ ./bin/easy.py -i

This will start the interactive modality.

More details about EasyCluster installation and use can be found here.

-Testing EasyCluster

EasyCluster has been tested on a variety of EST data form different organisms including Homo sapiens, Mus musculus, Arabidopsis thaliana or Vitis vinifera.

The following four datasets have been used for direct evaluation or comparison with other EST clustering tools:

1. Unigene cluster Hs.122986

This Unigene cluster has been used to compare EasyCluster to ASmodeler.

2. Human HOXA gene family

This dataset has been used to valuate the behaviour of EasyCluster with paralogous genes.

Download HOXA genomic region, ESTs and EasyCluster results.

3. Human cured dataset

This dataset represents a high quality and unbiased benchmark of human ESTs to reliably test and evaluate EST clustering tools. It contains 17733 ESTs belonging to 111 human genes. All ESTs are spliced and the EST to gene relationship is perfect and manually cured. Only ESTs mapping with minimum percentage of alignment identity and coverage of 80 have been included in the dataset. This benchmark comprises non-overlapping genes as well as overlapping and nested genes. Our benchmark is not limited and all interested users are welcome to introduce new and relevant examples.

Download the human benchmark dataset and corresponding clustering results from EasyCluster, wcd, TGICL, ClustDB and BlastClust.

4. Ricinus communis complete genome

Ricinus communis is an oilseed plant for which no Unigene clusters are available. Its genome has been completely sequenced at 4x coverage at JCVI Institute. More than 57000 related ESTs are stored in the public dbEST (an actual estimation can be found here). EasyCluster has been used in this case to produce a first compilation of EST clusters in addition with the detection of alternative splicing events. Genomic and EST sequences can be downloaded here. EasyCluster results can instead be downloaded from here.

- Acknowledgments

We thank Dr Scott Hazelhurst for valuable suggestions about EST clustering and evaluation. A special thank is addressed to Dr. Jurgen Kleffe to make available the DNA_Stat library and fruitful feedbacks about ClustDB.

For help, suggestions or bugs please do not hesitate to contact E. Picardi at ernesto.picardi@uniba.it.