EasyCluster – A program for clustering EST/FL-cDNA sequences

EasyCluster – A program for clustering EST/FL-cDNA sequences

Description

EasyCluster is a specialized Python program to build gene-oriented clusters of ESTs given a genomic sequence and a related set of ESTs and/or FL-cDNAs. The complete package can be freely downloaded from http://www.pesolelab.it/easycluster/.

The current release contains the following files:

easy.py : main EasyCluster script

getFastas.py : accessory script to extract EST and genomic sequences in fasta format

rmWorkdir : accessory script to remove a work directory

test.py : accessory script to test EasyCluster after installation

setup.py : script to install EasyCluster and its environment

modcluster.py : module to handle EST clusters

moddraw.py : module to draw EST clusters

modAS.py : module to predict Alternative Splicing events per cluster

src/dna_base.c : source code of DNA_Stat package

src/dna_base.h : source code of DNA_Stat package

src/dna_stat.c : source code of DNA_Stat package

src/dna_stat.h : source code of DNA_Stat package

doc/easycluster.doc : EasyCluster documentation (MSword file)

doc/easycluster.html : EasyCluster documentation (HTML file)

doc/easycluster.txt : EasyCluster documentation (TXT file)

data/genomic.fasta : example of fasta genomic region

data/est.fasta : example of fasta EST sequences

data/outfile.gmap : example of GMAP outfile in compressed format

data/README_data.txt : description of example files

Requirements

EasyCluster requires the GMAP package. A copy can be freely obtained from GMAP home page: http://www.gene.com/share/gmap. If possible, we recommend the version 2007-09-28 even though new releases should work as well.

EasyCluster can use C functions from DNA_Stat package (by J. Kleffe) through the ctypes module to speed up EST storage and retrieval. Therefore python versions higher than 2.4 should be used. To check your python version use the following command line:

python -V

Our tests indicate that EasyCluster is stable under both 2.4 and 2.5 python versions. However, releases lower than 2.4 could not correctly work because they should not include specific modules to handle set data structures. In any case, EasyCluster is able to verify all dependencies and requirements before starting all calculations. If GMAP is not installed or the required python version is not useful, the program will raise an error interrupting the analysis.

Installation

To install EasyCluster type the following commands:

- gunzip easycls.tar.gz

- tar -xvf easycls.tar

- cd easycls

- python setup.py

During the installation warning messages could be experienced. It is quite expected when the compilation is performed on different operating systems and compiler versions.

The installation procedure has been tested on Linux (RedHat and SUSE) and Mac OS X (10.4 and 10.5) operating systems. In principle it should work also on Windows machines or on all systems in which GMAP can be installed.

At the beginning of the setup the user needs to provide a MAIN directory in which EasyCluster will be installed, otherwise the package will use a default directory. The MAIN directory is also used to create the appropriate EasyCluster environment. In this way all scripts and needed modules will be putted in the MAIN directory. An overview of the EasyCluster environment is as follow:

[MAIN directory]:

[bin directory] --- containing main python scripts

easy.py

getFastas.py

rmWorkdir

test.py

[lib directory] --- containing all needed modules

[python2.x directory]

[site-packages]

dna.so

modAS.py

modAS.pyc

modcluster.py

modcluster.pyc

moddraw.py

moddraw.pyc

[data directory] --- containing sample fasta files

est.fasta

genomic.fasta

outfile.gmap

README_data.txt

[doc directory] --- containing this documentation

easycluster.doc

easycluster.html

easycluster.txt

After the installation, enter into the MAIN directory and type the following command to test the program:

./bin/test.py

Usage

EasyCluster has been projected to facilitate the EST clustering to people not completely confident with bioinformatics skills. In fact, it can be used by command line or interactively. In the second case the program will ask for input files and options step by step.

To run EasyCluster interactively type:

./bin/easy.py -i

To use command line options type:

./bin/easy.py -h

List of available options:

-g to specify a file containing genomic sequence(s) in fasta format

-e to specify a file containing EST sequences in fasta format

-c minimum coverage (default=90.0)

-s minimum identity (default=95.0)

-U name of user defined GMAP database if available (does not require -g option)

-L location of user defined database by absolute path

-u name of user defined database of EST sequences

-l location of user defined database of EST sequences by absolute path

-G GMAP location by absolute path (default /usr/local/bin)

-r name of user defined GMAP outfile including its path

-f format of user defined GMAP outfile. It is not relevant because only files in compressed format (-Z option in GMAP) will be taken into account.

-q perform quick clustering (not yet available)

-I include unspliced ESTs (not yet available)

-A enable the prediction of Alternative Splicing events per cluster

-w to specify a work directory (default=workdir)

-m main directory by absolute path (default=current directory)

-a remove database of EST sequences at the end

-b remove GMAP database of genomic sequence(s) at the end

-h show help

-v show software version

-i run EasyCluster interactively

Example using provided genomic and EST sequences

Enter in your MAIN directory and then type:

./bin/easy.py -g data/genomic.fasta -e data/est.fasta -w easy_test

EasyCluster will start the analysis reading genomic and EST sequences provided by genomic.fasta and est.fasta files respectively and will automatically create a GMAP database and a database of ESTs. Then, the program will run GMAP and results will be used to build clusters. Databases, results and additional files will be stored in the “easy_test” work directory. Every time EasyCluster starts a new analysis it creates the user provided work directory that is organized as follow:

[my_work_directory]

outfile.gmap --- EST to genome mapping file (GMAP outfile)

nopaths.txt --- ESTs with no matches after the mapping

[clusters] --- directory containing cluster details

[firstclustering] --- results of the first clustering step

[secondclustering] --- results of the second clustering step

[results]

clusters.gff --- detected clusters in GFF format (version 2)

clusters.txt --- detected clusters in tabular format

[database]

maindb --- local database storing clustering info

[GMAP] --- folder containing GMAP database

[ESTdb] --- folder containing local database of ESTs

[webresults]

report.html --- main HTML page to browse results

[webclusters] --- folder containing HTML files for each cluster

[webregions] --- folder containing HTML files for each region

Fasta files should contain simple headers and without space or non-canonical characters (such as |’#@*~&%$£”!/?\:.,;). Each header should identify only a sequence. Examples:

>genomic_sequence_1

AGTGACAGATGACAGTAGCAGTAGCAGT

>est_1

AGTGACAGATGACAGTAGCAGTAGCAGT

>est_2

AGTGACAGATGACAGTAGCAGTAGCAGT

>est_3

AGTGACAGATGACAGTAGCAGTAGCAGT

Look at data/genomic.fasta and data/est.fasta for most consistent examples.

Clusters are generated in tabular format or GFF in order to be easily used in gene prediction pipelines.

Example of tabular format:

SEQ1 + CLS_1 721 5997 97 S35320039 S30503603 S30503600…

SEQ1 + CLS_2 7860 11210 73 S37098609 S20526016 S30526400…

Each line contains 7 fields separated by a tab character. Fields are:

1 – name of the genomic region

2 – strand

3 – cluster number

4 – start coordinate

5 – end coordinate

6 – number of EST in the cluster

7 – list of EST names separated by space

Example of GFF format (version2):

## gff-version 2

SEQ1 genomicseq.fasta exon 733 901 98 + . SEQ1-CLS_1-S23782858

SEQ1 genomicseq.fasta exon 1970 2146 94 + . SEQ1-CLS_1-S23782858

SEQ1 genomicseq.fasta exon 732 901 100 + . SEQ1-CLS_1-S11705887

SEQ1 genomicseq.fasta exon 1970 2146 100 + . SEQ1-CLS_1-S11705887

SEQ1 genomicseq.fasta exon 2303 2366 100 + . SEQ1-CLS_1-S11705887

All 9 fields are according to the GFF standards. For more details visit the GFF page at http://www.sanger.uk.

EasyCluster generate also a report page in HTML format (stored into the webresults folder) and each detected cluster can be browsed and inspected graphically. Although HTML pages should be compatible with all Internet browsers, we suggest using Mozzilla FireFox.

If a GMAP database is available, EasyCluster does not need a genomic file:

./bin/easy.py -U gmap_database -L /gmap_database/location -e est.fasta -w test2

Moreover, if you have a GMAP outfile in compressed format, EasyCluster does not need to run GMAP again, saving time:

./bin/easy.py -g genomic.fasta -e est.fasta -r outfile.gmap -w test3

Author and contact

EasyCluster has been developed in the Pesole-Lab at the University of Bari (Italy). For detailed questions not covered in this brief documentation file or bugs, please do not hesitate to contact E. Picardi at e.picardi@biologia.uniba.it.

New released or updates or known errors can be found at www.pesolelab.it/easycluster/.