EasyCluster – A program for clustering EST/FL-cDNA sequences

 

Description

 

EasyCluster is a specialized Python program to build gene-oriented clusters of ESTs given a genomic sequence and a related set of ESTs and/or FL-cDNAs. The complete package can be freely downloaded from http://www.pesolelab.it/easycluster/.

The current release contains the following files:

            easy.py : main EasyCluster script

            getFastas.py : accessory script to extract EST and genomic sequences in fasta format

            rmWorkdir : accessory script to remove a work directory

            test.py : accessory script to test EasyCluster after installation

            setup.py : script to install EasyCluster and its environment

            modcluster.py : module to handle EST clusters

            moddraw.py : module to draw EST clusters

            modAS.py : module to predict Alternative Splicing events per cluster

            src/dna_base.c : source code of DNA_Stat package

            src/dna_base.h : source code of DNA_Stat package

            src/dna_stat.c : source code of DNA_Stat package

            src/dna_stat.h : source code of DNA_Stat package

            doc/easycluster.doc : EasyCluster documentation (MSword file)

            doc/easycluster.html : EasyCluster documentation (HTML file)

            doc/easycluster.txt : EasyCluster documentation (TXT file)

            data/genomic.fasta : example of fasta genomic region

            data/est.fasta : example of fasta EST sequences

            data/outfile.gmap : example of GMAP outfile in compressed format

            data/README_data.txt : description of example files

 

Requirements

 

EasyCluster requires the GMAP package. A copy can be freely obtained from GMAP home page: http://www.gene.com/share/gmap. If possible, we recommend the version 2007-09-28 even though new releases should work as well.

EasyCluster can use C functions from DNA_Stat package (by J. Kleffe) through the ctypes module to speed up EST storage and retrieval. Therefore python versions higher than 2.4 should be used. To check your python version use the following command line:

 

python -V

 

Our tests indicate that EasyCluster is stable under both 2.4 and 2.5 python versions. However, releases lower than 2.4 could not correctly work because they should not include specific modules to handle set data structures. In any case, EasyCluster is able to verify all dependencies and requirements before starting all calculations. If GMAP is not installed or the required python version is not useful, the program will raise an error interrupting the analysis.

 

Installation

 

To install EasyCluster type the following commands:

 

            - gunzip easycls.tar.gz

      - tar -xvf easycls.tar

      - cd easycls

      - python setup.py

 

During the installation warning messages could be experienced. It is quite expected when the compilation is performed on different operating systems and compiler versions.

The installation procedure has been tested on Linux (RedHat and SUSE) and Mac OS X (10.4 and 10.5) operating systems. In principle it should work also on Windows machines or on all systems in which GMAP can be installed.

At the beginning of the setup the user needs to provide a MAIN directory in which EasyCluster will be installed, otherwise the package will use a default directory. The MAIN directory is also used to create the appropriate EasyCluster environment. In this way all scripts and needed modules will be putted in the MAIN directory. An overview of the EasyCluster environment is as follow:

 

[MAIN directory]:

            [bin directory] --- containing main python scripts

                       easy.py

                       getFastas.py

                       rmWorkdir

                       test.py

            [lib directory] --- containing all needed modules

                       [python2.x directory]

                                   [site-packages]

                                               dna.so

                                               modAS.py

                                               modAS.pyc

                                               modcluster.py

                                               modcluster.pyc

                                               moddraw.py

                                               moddraw.pyc

            [data directory] --- containing sample fasta files

                       est.fasta

                       genomic.fasta

                       outfile.gmap

                       README_data.txt

            [doc directory] --- containing this documentation

                       easycluster.doc

                       easycluster.html

                       easycluster.txt

 

After the installation, enter into the MAIN directory and type the following command to test the program:

 

./bin/test.py

 

Usage

 

EasyCluster has been projected to facilitate the EST clustering to people not completely confident with bioinformatics skills. In fact, it can be used by command line or interactively. In the second case the program will ask for input files and options step by step.

 

To run EasyCluster interactively type:

            ./bin/easy.py -i

To use command line options type:

            ./bin/easy.py -h

 

List of available options:

            -g        to specify a file containing genomic sequence(s) in fasta format

            -e         to specify a file containing EST sequences in fasta format

            -c         minimum coverage (default=90.0)

            -s         minimum identity (default=95.0)

            -U       name of user defined GMAP database if available (does not require -g option)

            -L        location of user defined database by absolute path

            -u        name of user defined database of EST sequences

            -l         location of user defined database of EST sequences by absolute path

            -G       GMAP location by absolute path (default /usr/local/bin)

            -r         name of user defined GMAP outfile including its path

            -f         format of user defined GMAP outfile. It is not relevant because only files in                                compressed format (-Z option in GMAP) will be taken into account.

            -q        perform quick clustering (not yet available)

            -I         include unspliced ESTs (not yet available)

            -A       enable the prediction of Alternative Splicing events per cluster

            -w       to specify a work directory (default=workdir)

            -m       main directory by absolute path (default=current directory)

            -a         remove database of EST sequences at the end

            -b        remove GMAP database of genomic sequence(s) at the end

            -h        show help

            -v        show software version

            -i         run EasyCluster interactively

 

Example using provided genomic and EST sequences

 

Enter in your MAIN directory and then type:

 

            ./bin/easy.py -g data/genomic.fasta -e data/est.fasta -w easy_test

 

EasyCluster will start the analysis reading genomic and EST sequences provided by genomic.fasta and est.fasta files respectively and will automatically create a GMAP database and a database of ESTs. Then, the program will run GMAP and results will be used to build clusters. Databases, results and additional files will be stored in the Òeasy_testÓ work directory. Every time EasyCluster starts a new analysis it creates the user provided work directory that is organized as follow:

 

[my_work_directory]

            outfile.gmap --- EST to genome mapping file (GMAP outfile)

            nopaths.txt --- ESTs with no matches after the mapping

            [clusters] --- directory containing cluster details

                       [firstclustering] --- results of the first clustering step

                       [secondclustering] --- results of the second clustering step

                       [results]

                                   clusters.gff --- detected clusters in GFF format (version 2)

                                   clusters.txt --- detected clusters in tabular format

            [database]

                       maindb --- local database storing clustering info

                       [GMAP] --- folder containing GMAP database

                       [ESTdb] --- folder containing local database of ESTs

            [webresults]

                       report.html --- main HTML page to browse results

                       [webclusters] --- folder containing HTML files for each cluster

                       [webregions] --- folder containing HTML files for each region

 

Fasta files should contain simple headers and without space or non-canonical characters (such as |Õ#@*~&%$£Ó!/?\:.,;). Each header should identify only a sequence. Examples:

 

>genomic_sequence_1

AGTGACAGATGACAGTAGCAGTAGCAGT

AGTGACAGATGACAGTAGCAGTAGCAGT

AGTGACAGATGACAGTAGCAGTAGCAGT

 

>est_1

AGTGACAGATGACAGTAGCAGTAGCAGT

AGTGACAGATGACAGTAGCAGTAGCAGT

>est_2

AGTGACAGATGACAGTAGCAGTAGCAGT

AGTGACAGATGACAGTAGCAGTAGCAGT

>est_3

AGTGACAGATGACAGTAGCAGTAGCAGT

AGTGACAGATGACAGTAGCAGTAGCAGT

 

Look at data/genomic.fasta and data/est.fasta for most consistent examples.

Clusters are generated in tabular format or GFF in order to be easily used in gene prediction pipelines.

Example of tabular format:

 

SEQ1  +     CLS_1 721   5997  97    S35320039 S30503603 S30503600É

SEQ1  +     CLS_2 7860  11210 73    S37098609 S20526016 S30526400É

 

Each line contains 7 fields separated by a tab character. Fields are:

1 – name of the genomic region

2 – strand

3 – cluster number

4 – start coordinate

5 – end coordinate

6 – number of EST in the cluster

7 – list of EST names separated by space

 

Example of GFF format (version2):

 

## gff-version 2

SEQ1  genomicseq.fasta  exon  733   901   98    +     .     SEQ1-CLS_1-S23782858

SEQ1  genomicseq.fasta  exon  1970  2146  94    +     .     SEQ1-CLS_1-S23782858

SEQ1  genomicseq.fasta  exon  732   901   100   +     .     SEQ1-CLS_1-S11705887

SEQ1  genomicseq.fasta  exon  1970  2146  100   +     .     SEQ1-CLS_1-S11705887

SEQ1  genomicseq.fasta  exon  2303  2366  100   +     .     SEQ1-CLS_1-S11705887

 

All 9 fields are according to the GFF standards. For more details visit the GFF page at http://www.sanger.uk.

 

EasyCluster generate also a report page in HTML format (stored into the webresults folder) and each detected cluster can be browsed and inspected graphically. Although HTML pages should be compatible with all Internet browsers, we suggest using Mozzilla FireFox.

 

If a GMAP database is available, EasyCluster does not need a genomic file:

 

./bin/easy.py -U gmap_database -L /gmap_database/location -e est.fasta -w test2

 

Moreover, if you have a GMAP outfile in compressed format, EasyCluster does not need to run GMAP again, saving time:

 

./bin/easy.py -g genomic.fasta -e est.fasta -r outfile.gmap -w test3

 

Author and contact

 

EasyCluster has been developed in the Pesole-Lab at the University of Bari (Italy). For detailed questions not covered in this brief documentation file or bugs, please do not hesitate to contact E. Picardi at e.picardi@biologia.uniba.it.

New released or updates or known errors can be found at www.pesolelab.it/easycluster/.