How to prepare a Genome

In order to organize your data according to GeneViTo format you must follow these instructions :

1. Essential files

In order to use the GeneViTo GUI you must provide at least the following files :

*.ptt = Protein Table

*.ffn = FASTA nucleotide coding regions file 

*.fna = FASTA Nucleic Acid file

These files are available through the GenBank FTP site : ftp://ftp.ncbi.nih.gov/genbank/genomes/

e.g.

In order to acquire the above mentioned files for Chlamydia trachomatis you have to visit ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Chlamydia_trachomatis/ , where all these files are stored.

The proteome file must follow the SWISS-PROT format. This file can be acquired for example from the SWISS-PROT database by downloading 

through the SRS system http://srs.ebi.ac.uk/ the entire proteome or from the Proteome Analysis Server at EBI http://www.ebi.ac.uk/proteome/index.html?http://www.ebi.ac.uk/proteome/ProteomeSource.html.

e.g.

ID 6PGD_CHLTR STANDARD; PRT; 480 AA.

AC O84066;

DT 30-MAY-2000 (Rel. 39, Created)

DT 30-MAY-2000 (Rel. 39, Last sequence update)

DT 30-MAY-2000 (Rel. 39, Last annotation update)

DE 6-PHOSPHOGLUCONATE DEHYDROGENASE, DECARBOXYLATING (EC 1.1.1.44).

GN GND OR CT063.

OS Chlamydia trachomatis.

OC Bacteria; Chlamydiales; Chlamydiaceae; Chlamydia.

RN [1]

 

2. Additional files

These files are not essential for the GUI to work. They supply additional information on genomic/proteomic elements. These files are :

This file is available through http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html. By selecting the genome of choice you have to :

e.g. for Clamydia trachomatis , click on the second column (NC_000117). In the new page that appears you must select from the feature table 

the Structural RNAs option. Doing so, a new page will open and from the the report below in format. menu you have to select FASTA nucleotide and save it in a plain text file.

This file is available through http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html. By selecting the genome of choice you have to :

e.g. for Clamydia trachomatis , click on the second column (NC_000117). In the new page that appears you must select  from the top right column

BLAST protein homologs:
COGs (Clusters of Orthologous Groups)
3D Structure (Sequences with known structure)
TaxMap (Sequences grouped by superkingdom)
TaxPlot (3-way genome comparison)
CDD(Conserved Domain Database)

the COGs (Clusters of Orthologous Groups) option, that will open a new page. On this new page the COGs functional classes are displayed for the aforementioned organism.

Code COGs Description
J 113  Translation
A 0  RNA processing and modification
K 21  Transcription
L 54  Replication, recombination and repair
B 0  Chromatin structure and dynamics
D 9  Cell cycle control, mitosis and meiosis
Y 0  Nuclear structure
V 3  Defense mechanisms
T 12  Signal transduction mechanisms
M 38  Cell wall/membrane biogenesis
N 14  Cell motility
Z 0  Cytoskeleton
W 0  Extracellular structures
U 19  Intracellular trafficking and secretion
O 35  Posttranslational modification, protein turnover, chaperones
C 41  Energy production and conversion
G 35  Carbohydrate transport and metabolism
E 51  Amino acid transport and metabolism
F 16  Nucleotide transport and metabolism
H 35  Coenzyme transport and metabolism
I 34  Lipid transport and metabolism
P 18  Inorganic ion transport and metabolism
Q 3  Secondary metabolites biosynthesis, transport and catabolism
R 60  General function prediction only
S 30  Function unknown
- 254  not in COGs

You have to click on the second column COGs and a new page appears. This page must be saved in a .txt file (selecting from the browser option Save As). This must be done for all the categories and the relative files must have the following filenames:

Amino acid transport and metabolism.txt
Carbohydrate transport and metabolism.txt
Cell cycle control, mitosis and meiosis.txt
Cell motility.txt
Cell wallmembrane biogenesis.txt
Chromatin structure and dynamics.txt
Coenzyme transport and metabolism.txt
Cytoskeleton.txt
Defense mechanisms.txt
Energy production and conversion.txt
Extracellular structures.txt
Function unknown.txt
General function prediction only.txt
Inorganic ion transport and metabolism.txt
Intracellular trafficking and secretion.txt
Lipid transport and metabolism.txt
Nuclear structure.txt
Nucleotide transport and metabolism.txt
Posttranslational modification, protein turnover, chaperones.txt
RNA processing and modification.txt
Replication, recombination and repair.txt
Secondary metabolites biosynthesis, transport and catabolism.txt
Signal transduction mechanisms.txt
Transcription.txt
Translation.txt
not in COGs.txt

These files must be put inside a directory. Provide a name of your choice for the directory.
 

This file is available through http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html. By selecting the genome of choice you have to :

e.g. for Clamydia trachomatis , click on the second column (NC_000117). In the new page that appears you must select under the circular map

of the organism the Genetic Code link:


Organism: Chlamydia trachomatis
Genetic Code: 11
Lineage: Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia.

By clicking the 11 a new page appears.

11. The Bacterial and Plant Plastid Code (transl_table=11)

 

    AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
  Starts = ---M---------------M------------MMMM---------------M------------
  Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

Click here to change format

You have to select the Click here to change format link and the page will transform:

11. The Bacterial and Plant Plastid Code (transl_table=11)

 

TTT  F Phe      TCT  S Ser      TAT  Y Tyr      TGT  C Cys  
TTC  F Phe      TCC  S Ser      TAC  Y Tyr      TGC  C Cys  
TTA  L Leu      TCA  S Ser      TAA  * Ter      TGA  * Ter  
TTG  L Leu i    TCG  S Ser      TAG  * Ter      TGG  W Trp  

CTT  L Leu      CCT  P Pro      CAT  H His      CGT  R Arg  
CTC  L Leu      CCC  P Pro      CAC  H His      CGC  R Arg  
CTA  L Leu      CCA  P Pro      CAA  Q Gln      CGA  R Arg  
CTG  L Leu i    CCG  P Pro      CAG  Q Gln      CGG  R Arg  

ATT  I Ile i    ACT  T Thr      AAT  N Asn      AGT  S Ser  
ATC  I Ile i    ACC  T Thr      AAC  N Asn      AGC  S Ser  
ATA  I Ile i    ACA  T Thr      AAA  K Lys      AGA  R Arg  
ATG  M Met i    ACG  T Thr      AAG  K Lys      AGG  R Arg  

GTT  V Val      GCT  A Ala      GAT  D Asp      GGT  G Gly  
GTC  V Val      GCC  A Ala      GAC  D Asp      GGC  G Gly  
GTA  V Val      GCA  A Ala      GAA  E Glu      GGA  G Gly  
GTG  V Val i    GCG  A Ala      GAG  E Glu      GGG  G Gly  

Click here to change format

The  "red" highlighted region above, must be saved (Copy & Paste) in a text file with a name of your choice .

This file is available through http://www.kazusa.or.jp/codon/. At the main page you must select the letter that corresponds to the first letter

of your genome of choice.

Alphabetical lists of all organisms
A  B  C  D  E  F  G  H  I  J  K  L  M

N  O  P  Q  R  S  T  U  V  W  X  Y  Z

Chloroplast  Mitochondrion

Others (intials are not capital)

e.g. for Chlamydia trachomatis you select C and in the new page you select Chlamydia trachomatis [gbbct]: 1160.

In the new page that appears you see this:

Chlamydia trachomatis [gbbct]: 1160 CDS's (394899 codons)
fields: [triplet] [frequency: per thousand] ([number])
UUU 30.0( 11828)  UCU 33.8( 13336)  UAU 19.9(  7853)  UGU 10.6(  4192)
UUC 17.0(  6717)  UCC 12.0(  4727)  UAC  9.9(  3914)  UGC  6.2(  2452)
UUA 32.0( 12628)  UCA  9.6(  3780)  UAA  1.6(   633)  UGA  0.4(   165)
UUG 19.6(  7756)  UCG  6.5(  2579)  UAG  0.9(   363)  UGG  9.3(  3690)

CUU 22.9(  9043)  CCU 23.8(  9395)  CAU 15.5(  6120)  CGU 12.6(  4985)
CUC 11.1(  4397)  CCC  5.5(  2155)  CAC  6.3(  2493)  CGC  7.6(  3009)
CUA 14.6(  5763)  CCA  9.9(  3917)  CAA 27.4( 10821)  CGA  9.4(  3697)
CUG 10.0(  3962)  CCG  3.9(  1523)  CAG 14.0(  5531)  CGG  3.9(  1559)

AUU 34.1( 13483)  ACU 17.9(  7076)  AAU 24.4(  9625)  AGU 10.0(  3967)
AUC 19.7(  7776)  ACC  8.8(  3491)  AAC 11.0(  4355)  AGC  8.7(  3426)
AUA 11.1(  4385)  ACA 18.1(  7154)  AAA 41.9( 16528)  AGA 11.9(  4717)
AUG 19.8(  7831)  ACG  8.1(  3192)  AAG 17.0(  6715)  AGG  2.4(   964)

GUU 25.2(  9956)  GCU 36.8( 14523)  GAU 35.0( 13809)  GGU 12.5(  4934)
GUC  9.8(  3882)  GCC 10.1(  3984)  GAC 10.7(  4210)  GGC  8.8(  3484)
GUA 17.7(  6992)  GCA 21.0(  8290)  GAA 41.8( 16525)  GGA 28.1( 11109)
GUG 12.8(  5036)  GCG  9.4(  3699)  GAG 23.3(  9182)  GGG 14.2(  5616)

Coding GC 41.53% 1st letter GC 51.56% 2nd letter GC 39.20% 3rd letter GC 33.85%

The  "red" highlighted region above, must be saved (Copy & Paste) in a text file with a name of your choice .

At the Webcutter website http://www.firstmarket.com/cutter/cut2.html, you must have selected the following options  before running Webcutter :

Please enter a title for this sequence:

Paste the DNA sequence into the box below

Please select the type of analysis you would like
Linear sequence analysis
Circular sequence analysis
Find sites which may be introduced by silent mutagenesis

 

Please indicate how you would like the restriction sites displayed
Map of restriction sites
Table of sites, sorted alphabetically by enzyme name
Table of sites, sorted sequentially by base pair number

Please indicate which enzymes to include in the display
All enzymes
Enzymes not cutting
Enzymes cutting once
Enzymes cutting exactly times
Enzymes cutting at least times, and at most times
highlights for enzymes from the polylinker

Please indicate which enzymes to include in the analysis
All enzymes in the database
Only enzymes with recognition sites equal to or greater than bases long
Only the following enzymes:
Use the command, control, or shift key to select multiple entries

Here is a sample of Webcutter output:

AatI          7   9333 13482 19521 42397 48711     agg/cct            More info
                  56867 57843
AatII         3   1929 49019 80792                 gacgt/c            More info
Acc113I       12  6298 11712 17493 24282 31362     agt/act            More info
                  32362 38424 45260 76393 76399
                  79306 80868
Acc16I        17  3122 3333 17333 19198 20267      tgc/gca            More info
                  27062 28425 29882 40655 46328
                  59270 61994 63929 66818 68610
                  71292 74072

For an entire genome you must provide a file with all the desired restriction enzymes in the above ("yellow region") format. Bear in mind, that Webcutter, accepts a limited number of sequences for a batch run. Thus, when you "run" a whole genome in sets of sequences, the relative cut position (in bp) of the input sequences is reset each time. For example, providing a set that corresponds to 1000-2000bp of a genome, the output file will start the numbering of restriction enzymes cut sites from position 1bp. The GeneViTo file format requires that you have a single file, with the above format, and the cut-positions should correspond to their actual position on the genome.

The PRED-CLASS algorithm is available through http://biophysics.biol.uoa.gr/PRED-CLASS/

Now, you can :

Run PRED-CLASS on a sequence

Browse detailed results obtained with the algorithm on several test sets

View lists of data used to train and evaluate PRED-CLASS

Go to the Biophysics Lab Homepage

 

 You have to select Run PRED-CLASS on a sequence, in order to run PRED-CLASS. In the new page that appears you must insert the protein in the provided area. This option provides one by one protein "run". For an entire proteome, upon request to the PRED-CLASS authors, batch run might be available. However, a list of  PRED-CLASS results for 21 prokaryotic and 6 archaean entire proteomes is available through http://biophysics.biol.uoa.gr/PRED-CLASS/Results/FULL/ (a README file provides further info).

The GeneViTo file format for PRED-CLASS results is that of the files in http://biophysics.biol.uoa.gr/PRED-CLASS/Results/FULL/. You must provide a plain .txt file with this format. 

The PRED-TMR2 algorithm is available through http://biophysics.biol.uoa.gr/PRED-TMR2/ . 

Now, you can :

Run PRED-TMR2 on a sequence

Browse the results obtained with the algorithm

Go to the Biophysics Lab Homepage

 

You have to select Run PRED-TMR2 on a sequence, in order to run PRED-TMR2. In the new page that appears you must insert the protein in the provided area. This option provides one by one protein "run". For an entire proteome, upon request to the PRED-TMR2 authors, batch run might be available. However, a list of PRED-TMR2 results for 7 entire proteomes is available through http://biophysics.biol.uoa.gr/PRED-TMR2/Results/index.html

The orienTM algorithm is available through http://biophysics.biol.uoa.gr/orienTM/.  

You have to select Execute orienTM on a sequence in order to run orienTM on a protein sequence with already defined transmembrane segments. You can also run PRED-TMR2 and orienTM successively.

The GeneViTo file format for PREDTMR2 & orienTM is as follows (in XML format):

<SEQ>

                    <NAME>AROB_CHLTR </NAME>

                                        <RESULTS>

                                        <TM>

                               <FROM>93</FROM>

                                            <TO>112</TO>

                                            <ORIEN>OUTWARDS</ORIEN>

                                       </TM>

                                        </RESULTS>

</SEQ>

Following the above mentioned XML format( you must comply to the above format strictly i.e. the respective tags must be on the same lines as  in the example for a protein - there can't be for example <FROM> and </FROM> tags in different lines), you can insert prediction results from various algorithms e.g. in the <TM> tag, you can insert predicted transmembrane segments in the <FROM> </FROM>  and <TO> </TO> tag groups.

A <SEQ> tag, can contain one <NAME> </NAME> tag group, one <RESULTS> </RESULTS> tag group. Inside the RESULTS tag group there can be several <TM> </TM> tag groups. Inside the TM group, there can be several  <FROM> </FROM> tags followed by <TO> </TO> tags, followed by <ORIEN> </ORIEN> tags. In case you don't have one of the elements you fill in a "-" among the tags.

 

The SIGNALP algorithm is available through http://www.cbs.dtu.dk/services/SignalP/. You have to run either one by one or batch, all the proteome and provide a single file with the results following the typical SIGNALP format:

>DUT_CHLTR 

SignalP-NN result: 

>DUT_CHLTR length = 70
# Measure Position Value Cutoff signal peptide?
max. C 18 0.367 0.50 NO
max. Y 18 0.113 0.32 NO
max. S 4 0.433 0.90 NO
mean S 1-17 0.184 0.44 NO


SignalP-HMM result: 

>DUT_CHLTR
Prediction: Non-secretory protein
Signal peptide probability: 0.015
Max cleavage site probability: 0.009 between pos. 23 and 24

>EFG_CHLTR 

SignalP-NN result: 

>EFG_CHLTR length = 70
# Measure Position Value Cutoff signal peptide?
max. C 41 0.105 0.50 NO
max. Y 41 0.083 0.32 NO
max. S 70 0.286 0.90 NO
mean S 1-40 0.084 0.44 NO

SignalP-HMM result: 

>EFG_CHLTR
Prediction: Non-secretory protein
Signal peptide probability: 0.000
Max cleavage site probability: 0.000 between pos. -1 and 0

 

The input of all file is done through this interface:

 

1. Selecting Essential files

From the menu File you choose the option Prepare Genome and the above interface appears.

If you click Prepare Genome without selecting no other feature, you will be prompted to insert through File Choosers the Essential Files mentioned at the beginning of this document. These files (.ptt, .ffn, .fna, proteome in SWISSPROT files) are obligatory for the genome to be prepared. In this procedure you will be prompted to provide a name for the project you are preparing, along with the Destination of the Directory that will be created.

This procedure is standard and will take place each time you prepare a genome, regardless the Additional input files choices you have made.

 

2. Selecting Additional files

According to the data you have at hand (which additional files did you gather following the procedure mentioned above?), you have to select the relative cyan toggle buttons, and then enter the corresponding files, through File Choosers. Afterwards, you have to push the Prepare Genome button, enter the Essential files and wait for your genome to be prepared.

In this last picture you can see the procedure of locating a file through a File Chooser.