A fast program to build gene-oriented
clusters of EST/FL-cDNA sequences
EasyCluster is a python
program devised to create gene-oriented clusters of ESTs/FL-cDNAs given a
genomic sequence. It performs first EST to genome mapping by means of GMAP
program and then groups related ESTs according to the biological assumption
that two or more ESTs are part of the same gene locus if they share at least one
splice site. Optionally, EasyCluster can predict alternative splicing events
per EST cluster by means of an ASTALAVISTA-like
algorithm. Resulting clusters are in GFF
format and ready to be used in different computational programs including tools
to assemble FL-transcripts and gene finders. EST clusters can also be inspected
in their genomic context thanks to ad hoc HTML pages.
EasyCluster can be
installed and used in whatever operating system running python and GMAP. It has
been also developed for users with no advanced skills in bioinformatics.
EasyCluster, in fact, can be used interactively through a step-by-step process.
- Program
requirements
EasyCluster requires
python interpreter version 2.4 or superior and GMAP
program (at least version 2007-09-28).
For very large EST datasets python 2.5 is required to import ctypes module and the
specific C library DNA_Stat.
Moreover, to correctly view EST clusters in HTML format we strongly recommend
the use of Mozzilla FireFox browser 1.5.x or superior.
- Quick installation
and use
The current EasyCluster
release can be downloaded here. After the download, EasyCluster
can be installed through the following steps:
ernesto$ gunzip
easycls.tar.gz
ernesto$ tar -xvf
easycls.tar
ernesto$ cd easycls
ernesto$ python
setup.py
This command will install
Easycluster in the default directory EasyCluster that represents the main
directory in which all calculations will be performed. Alternatively, users can
change the default directory simply indicating the complete path of the desired
new location.
Testing the installed
version of EasyCluster is quite straightforward:
ernesto$ cd EasyCluster
ernesto$ ./bin/test.py
To view all available
options:
ernesto$ ./bin/easy.py
-h
EasyCluster can be use
by command-line or interactively.
In the first case:
ernesto$ ./bin/easy.py
-g data/genomic.fasta -e data/est.fasta -w my_work_directory
The directory
my_work_directory will store all results and data produced during the
EasyCluster run.
In the second case:
ernesto$ ./bin/easy.py
-i
This will start the interactive
modality.
More details about
EasyCluster installation and use can be found here.
-Testing EasyCluster
EasyCluster has been
tested on a variety of EST data form different organisms including Homo
sapiens,
Mus musculus, Arabidopsis thaliana or Vitis vinifera.
The following four
datasets have been used for direct evaluation or comparison with other EST
clustering tools:
1. Unigene cluster Hs.122986
This
Unigene cluster has been used to compare EasyCluster to ASmodeler.
2. Human HOXA gene family
This
dataset has been used to valuate the behaviour of EasyCluster with paralogous
genes.
Download
HOXA genomic
region, ESTs
and EasyCluster results.
3. Human cured dataset
This
dataset represents a high quality and unbiased benchmark of human ESTs to
reliably test and evaluate EST clustering tools. It contains 17733 ESTs
belonging to 111 human genes. All ESTs are spliced and the EST to gene
relationship is perfect and manually cured. Only ESTs mapping with minimum
percentage of alignment identity and coverage of 80 have been included in the
dataset. This benchmark comprises non-overlapping genes as well as overlapping
and nested genes. Our benchmark is not limited and all interested users are
welcome to introduce new and relevant examples.
Download the
human
benchmark dataset and corresponding clustering results
from EasyCluster, wcd,
TGICL,
ClustDB
and BlastClust.
4. Ricinus communis complete genome
Ricinus
communis is an oilseed plant for which no Unigene clusters are available. Its
genome has been completely sequenced at 4x coverage at JCVI Institute. More than
57000 related ESTs are stored in the public dbEST (an actual estimation can be
found here).
EasyCluster has been used in this case to produce a first compilation of EST
clusters in addition with the detection of alternative splicing events. Genomic
and EST sequences can be downloaded here.
EasyCluster results can instead be downloaded from here.
- Acknowledgments
We thank Dr Scott Hazelhurst for valuable
suggestions about EST clustering and evaluation. A special thank is addressed
to Dr. Jurgen
Kleffe to make available the DNA_Stat library and fruitful feedbacks about
ClustDB.
For help, suggestions or
bugs please do not hesitate to contact E. Picardi at ernesto.picardi@uniba.it.