EasyCluster – A program for clustering EST/FL-cDNA sequences
Description
EasyCluster is a specialized Python program to build gene-oriented
clusters of ESTs given a genomic sequence and a related set of ESTs and/or
FL-cDNAs. The complete package can be freely downloaded from http://www.pesolelab.it/easycluster/.
The current
release contains the following files:
easy.py
: main EasyCluster script
getFastas.py
: accessory script to extract EST and genomic sequences in fasta format
rmWorkdir
: accessory script to remove a work directory
test.py
: accessory script to test EasyCluster after installation
setup.py
: script to install EasyCluster and its environment
modcluster.py
: module to handle EST clusters
moddraw.py
: module to draw EST clusters
modAS.py
: module to predict Alternative Splicing events per cluster
src/dna_base.c
: source code of DNA_Stat package
src/dna_base.h
: source code of DNA_Stat package
src/dna_stat.c
: source code of DNA_Stat package
src/dna_stat.h
: source code of DNA_Stat package
doc/easycluster.doc
: EasyCluster documentation (MSword file)
doc/easycluster.html
: EasyCluster documentation (HTML file)
doc/easycluster.txt
: EasyCluster documentation (TXT file)
data/genomic.fasta
: example of fasta genomic region
data/est.fasta
: example of fasta EST sequences
data/outfile.gmap
: example of GMAP outfile in compressed format
data/README_data.txt
: description of example files
Requirements
EasyCluster requires the GMAP package. A copy can be freely obtained
from GMAP home page: http://www.gene.com/share/gmap.
If possible, we recommend the version 2007-09-28 even though new releases
should work as well.
EasyCluster can use C functions from DNA_Stat package (by J. Kleffe)
through the ctypes module to speed up EST storage and retrieval. Therefore
python versions higher than 2.4 should be used. To check your python version
use the following command line:
python -V
Our tests indicate that EasyCluster is stable under both 2.4 and 2.5
python versions. However, releases lower than 2.4 could not correctly work
because they should not include specific modules to handle set data structures.
In any case, EasyCluster is able to verify all dependencies and requirements
before starting all calculations. If GMAP is not installed or the required
python version is not useful, the program will raise an error interrupting the
analysis.
Installation
To install
EasyCluster type the following commands:
- gunzip easycls.tar.gz
-
tar -xvf easycls.tar
-
cd easycls
-
python setup.py
During the installation warning messages could be experienced. It is
quite expected when the compilation is performed on different operating systems
and compiler versions.
The installation procedure has been tested on Linux (RedHat and
SUSE) and Mac OS X (10.4 and 10.5) operating systems. In principle it should
work also on Windows machines or on all systems in which GMAP can be installed.
At the beginning of the setup the user needs to provide a MAIN
directory in which EasyCluster will be installed, otherwise the package will
use a default directory. The MAIN directory is also used to create the
appropriate EasyCluster environment. In this way all scripts and needed modules
will be putted in the MAIN directory. An overview of the EasyCluster
environment is as follow:
[MAIN directory]:
[bin
directory] --- containing main python scripts
easy.py
getFastas.py
rmWorkdir
test.py
[lib
directory] --- containing all needed modules
[python2.x
directory]
[site-packages]
dna.so
modAS.py
modAS.pyc
modcluster.py
modcluster.pyc
moddraw.py
moddraw.pyc
[data
directory] --- containing sample fasta files
est.fasta
genomic.fasta
outfile.gmap
README_data.txt
[doc
directory] --- containing this documentation
easycluster.doc
easycluster.html
easycluster.txt
After the
installation, enter into the MAIN directory and type the following command to
test the program:
./bin/test.py
Usage
EasyCluster has been projected to facilitate the EST clustering to
people not completely confident with bioinformatics skills. In fact, it can be
used by command line or interactively. In the second case the program will ask
for input files and options step by step.
To run EasyCluster
interactively type:
./bin/easy.py -i
To use command
line options type:
./bin/easy.py -h
List of available options:
-g to specify a file containing genomic sequence(s) in fasta format
-e to specify a file containing EST sequences in fasta format
-c minimum coverage (default=90.0)
-s minimum identity (default=95.0)
-U name of user defined GMAP database if available (does not require -g option)
-L location of user defined database by absolute path
-u name of user defined database of EST sequences
-l location of user defined database of EST sequences by absolute path
-G GMAP location by absolute path (default /usr/local/bin)
-r name of user defined GMAP outfile including its path
-f format of user defined GMAP outfile. It is not relevant because only files in compressed format (-Z option in GMAP) will be taken into account.
-q perform quick clustering (not yet available)
-I include unspliced ESTs (not yet available)
-A enable the prediction of Alternative Splicing events per cluster
-w to specify a work directory (default=workdir)
-m main directory by absolute path (default=current directory)
-a remove database of EST sequences at the end
-b remove GMAP database of genomic sequence(s) at the end
-h show help
-v show software version
-i run
EasyCluster interactively
Example using provided genomic and EST sequences
Enter in your MAIN
directory and then type:
./bin/easy.py -g
data/genomic.fasta -e data/est.fasta -w easy_test
EasyCluster will start the analysis reading genomic and EST
sequences provided by genomic.fasta and est.fasta files respectively and will
automatically create a GMAP database and a database of ESTs. Then, the program
will run GMAP and results will be used to build clusters. Databases, results
and additional files will be stored in the Òeasy_testÓ work directory. Every
time EasyCluster starts a new analysis it creates the user provided work
directory that is organized as follow:
[my_work_directory]
outfile.gmap
--- EST to genome mapping file (GMAP outfile)
nopaths.txt
--- ESTs with no matches after the mapping
[clusters]
--- directory containing cluster details
[firstclustering]
--- results of the first clustering step
[secondclustering]
--- results of the second clustering step
[results]
clusters.gff
--- detected clusters in GFF format (version 2)
clusters.txt
--- detected clusters in tabular format
[database]
maindb
--- local database storing clustering info
[GMAP]
--- folder containing GMAP database
[ESTdb]
--- folder containing local database of ESTs
[webresults]
report.html
--- main HTML page to browse results
[webclusters]
--- folder containing HTML files for each cluster
[webregions]
--- folder containing HTML files for each region
Fasta files should
contain simple headers and without space or non-canonical characters (such as
|Õ#@*~&%$£Ó!/?\:.,;). Each header should identify only a sequence.
Examples:
>genomic_sequence_1
AGTGACAGATGACAGTAGCAGTAGCAGT
AGTGACAGATGACAGTAGCAGTAGCAGT
AGTGACAGATGACAGTAGCAGTAGCAGT
>est_1
AGTGACAGATGACAGTAGCAGTAGCAGT
AGTGACAGATGACAGTAGCAGTAGCAGT
>est_2
AGTGACAGATGACAGTAGCAGTAGCAGT
AGTGACAGATGACAGTAGCAGTAGCAGT
>est_3
AGTGACAGATGACAGTAGCAGTAGCAGT
AGTGACAGATGACAGTAGCAGTAGCAGT
Look at
data/genomic.fasta and data/est.fasta for most consistent examples.
Clusters are generated in tabular format or GFF in order to be
easily used in gene prediction pipelines.
Example of tabular
format:
SEQ1 + CLS_1 721 5997 97 S35320039
S30503603 S30503600É
SEQ1 + CLS_2 7860 11210 73 S37098609
S20526016 S30526400É
Each line contains
7 fields separated by a tab character. Fields are:
1 – name of
the genomic region
2 – strand
3 – cluster
number
4 – start
coordinate
5 – end
coordinate
6 – number
of EST in the cluster
7 – list of
EST names separated by space
Example of GFF
format (version2):
## gff-version 2
SEQ1 genomicseq.fasta exon 733 901 98 + . SEQ1-CLS_1-S23782858
SEQ1 genomicseq.fasta exon 1970 2146 94 + . SEQ1-CLS_1-S23782858
SEQ1 genomicseq.fasta exon 732 901 100 + . SEQ1-CLS_1-S11705887
SEQ1 genomicseq.fasta exon 1970 2146 100 + . SEQ1-CLS_1-S11705887
SEQ1 genomicseq.fasta exon 2303 2366 100 + . SEQ1-CLS_1-S11705887
All 9 fields are according to the GFF standards. For more details
visit the GFF page at http://www.sanger.uk.
EasyCluster
generate also a report page in HTML format (stored into the webresults folder)
and each detected cluster can be browsed and inspected graphically. Although
HTML pages should be compatible with all Internet browsers, we suggest using
Mozzilla FireFox.
If a GMAP database
is available, EasyCluster does not need a genomic file:
./bin/easy.py -U gmap_database -L
/gmap_database/location -e est.fasta -w test2
Moreover, if you have a GMAP outfile in compressed format, EasyCluster
does not need to run GMAP again, saving time:
./bin/easy.py -g
genomic.fasta -e est.fasta -r outfile.gmap -w test3
Author and contact
EasyCluster has been developed in the Pesole-Lab at the University
of Bari (Italy). For detailed questions not covered in this brief documentation
file or bugs, please do not hesitate to contact E. Picardi at e.picardi@biologia.uniba.it.
New released or updates or known errors can be found at www.pesolelab.it/easycluster/.