How to prepare a Genome

In order to organize your data according to GeneViTo format you must follow these instructions :

File Formats

1. Essential files

In order to use the GeneViTo GUI you must provide at least the following files :

GenBank files

*.ptt = Protein Table

*.ffn = FASTA nucleotide coding regions file

*.fna = FASTA Nucleic Acid file

These files are available through the GenBank FTP site : ftp://ftp.ncbi.nih.gov/genbank/genomes/

e.g.

In order to acquire the above mentioned files for Chlamydia trachomatis you have to visit ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Chlamydia_trachomatis/ , where all these files are stored.

Proteome file

The proteome file must follow the SWISS-PROT format. This file can be acquired for example from the SWISS-PROT database by downloading

through the SRS system http://srs.ebi.ac.uk/ the entire proteome or from the Proteome Analysis Server at EBI http://www.ebi.ac.uk/proteome/index.html?http://www.ebi.ac.uk/proteome/ProteomeSource.html.

e.g.

ID 6PGD_CHLTR STANDARD; PRT; 480 AA.

AC O84066;

DT 30-MAY-2000 (Rel. 39, Created)

DT 30-MAY-2000 (Rel. 39, Last sequence update)

DT 30-MAY-2000 (Rel. 39, Last annotation update)

DE 6-PHOSPHOGLUCONATE DEHYDROGENASE, DECARBOXYLATING (EC 1.1.1.44).

GN GND OR CT063.

OS Chlamydia trachomatis.

OC Bacteria; Chlamydiales; Chlamydiaceae; Chlamydia.

RN [1]

2. Additional files

These files are not essential for the GUI to work. They supply additional information on genomic/proteomic elements. These files are :

Structural RNAs (GenBank file)

This file is available through http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html. By selecting the genome of choice you have to :

e.g. for Clamydia trachomatis , click on the second column (NC_000117). In the new page that appears you must select from the feature table

the Structural RNAs option. Doing so, a new page will open and from the the report below in format. menu you have to select FASTA nucleotide and save it in a plain text file.

Clusters of Orthologous Groups (COGs file)

This file is available through http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html. By selecting the genome of choice you have to :

e.g. for Clamydia trachomatis , click on the second column (NC_000117). In the new page that appears you must select from the top right column

BLAST protein homologs:
COGs (Clusters of Orthologous Groups)
3D Structure (Sequences with known structure)
TaxMap (Sequences grouped by superkingdom)
TaxPlot (3-way genome comparison)
CDD(Conserved Domain Database)

the COGs (Clusters of Orthologous Groups) option, that will open a new page. On this new page the COGs functional classes are displayed for the aforementioned organism.

Code		COGs	Description
	J	113	Translation
	A	0	RNA processing and modification
	K	21	Transcription
	L	54	Replication, recombination and repair
	B	0	Chromatin structure and dynamics
	D	9	Cell cycle control, mitosis and meiosis
	Y	0	Nuclear structure
	V	3	Defense mechanisms
	T	12	Signal transduction mechanisms
	M	38	Cell wall/membrane biogenesis
	N	14	Cell motility
	Z	0	Cytoskeleton
	W	0	Extracellular structures
	U	19	Intracellular trafficking and secretion
	O	35	Posttranslational modification, protein turnover, chaperones
	C	41	Energy production and conversion
	G	35	Carbohydrate transport and metabolism
	E	51	Amino acid transport and metabolism
	F	16	Nucleotide transport and metabolism
	H	35	Coenzyme transport and metabolism
	I	34	Lipid transport and metabolism
	P	18	Inorganic ion transport and metabolism
	Q	3	Secondary metabolites biosynthesis, transport and catabolism
	R	60	General function prediction only
	S	30	Function unknown
	-	254	not in COGs

You have to click on the second column COGs and a new page appears. This page must be saved in a .txt file (selecting from the browser option Save As). This must be done for all the categories and the relative files must have the following filenames:

Amino acid transport and metabolism.txt
Carbohydrate transport and metabolism.txt
Cell cycle control, mitosis and meiosis.txt
Cell motility.txt
Cell wallmembrane biogenesis.txt
Chromatin structure and dynamics.txt
Coenzyme transport and metabolism.txt
Cytoskeleton.txt
Defense mechanisms.txt
Energy production and conversion.txt
Extracellular structures.txt
Function unknown.txt
General function prediction only.txt
Inorganic ion transport and metabolism.txt
Intracellular trafficking and secretion.txt
Lipid transport and metabolism.txt
Nuclear structure.txt
Nucleotide transport and metabolism.txt
Posttranslational modification, protein turnover, chaperones.txt
RNA processing and modification.txt
Replication, recombination and repair.txt
Secondary metabolites biosynthesis, transport and catabolism.txt
Signal transduction mechanisms.txt
Transcription.txt
Translation.txt
not in COGs.txt

These files must be put inside a directory. Provide a name of your choice for the directory.

Translation Table (GenBank file)

This file is available through http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html. By selecting the genome of choice you have to :

e.g. for Clamydia trachomatis , click on the second column (NC_000117). In the new page that appears you must select under the circular map

of the organism the Genetic Code link:

Organism: Chlamydia trachomatis
Genetic Code: 11
Lineage: Bacteria; Chlamydiae; Chlamydiales; Chlamydiaceae; Chlamydia.

By clicking the 11 a new page appears.

11. The Bacterial and Plant Plastid Code (transl_table=11)

    AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
  Starts = ---M---------------M------------MMMM---------------M------------
  Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

Click here to change format

You have to select the Click here to change format link and the page will transform:

11. The Bacterial and Plant Plastid Code (transl_table=11)

TTT  F Phe      TCT  S Ser      TAT  Y Tyr      TGT  C Cys  
TTC  F Phe      TCC  S Ser      TAC  Y Tyr      TGC  C Cys  
TTA  L Leu      TCA  S Ser      TAA  * Ter      TGA  * Ter  
TTG  L Leu i    TCG  S Ser      TAG  * Ter      TGG  W Trp  

CTT  L Leu      CCT  P Pro      CAT  H His      CGT  R Arg  
CTC  L Leu      CCC  P Pro      CAC  H His      CGC  R Arg  
CTA  L Leu      CCA  P Pro      CAA  Q Gln      CGA  R Arg  
CTG  L Leu i    CCG  P Pro      CAG  Q Gln      CGG  R Arg  

ATT  I Ile i    ACT  T Thr      AAT  N Asn      AGT  S Ser  
ATC  I Ile i    ACC  T Thr      AAC  N Asn      AGC  S Ser  
ATA  I Ile i    ACA  T Thr      AAA  K Lys      AGA  R Arg  
ATG  M Met i    ACG  T Thr      AAG  K Lys      AGG  R Arg  

GTT  V Val      GCT  A Ala      GAT  D Asp      GGT  G Gly  
GTC  V Val      GCC  A Ala      GAC  D Asp      GGC  G Gly  
GTA  V Val      GCA  A Ala      GAA  E Glu      GGA  G Gly  
GTG  V Val i    GCG  A Ala      GAG  E Glu      GGG  G Gly

Click here to change format

The "red" highlighted region above, must be saved (Copy & Paste) in a text file with a name of your choice .

Codon Usage (GenBank file)

This file is available through http://www.kazusa.or.jp/codon/. At the main page you must select the letter that corresponds to the first letter

of your genome of choice.

Alphabetical lists of all organisms

A  B  C  D  E  F  G  H  I  J  K  L  M

N  O  P  Q  R  S  T  U  V  W  X  Y  Z

Chloroplast  Mitochondrion

Others (intials are not capital)

e.g. for Chlamydia trachomatis you select C and in the new page you select Chlamydia trachomatis [gbbct]: 1160.

In the new page that appears you see this:

Chlamydia trachomatis [gbbct]: 1160 CDS's (394899 codons)

fields: [triplet] [frequency: per thousand] ([number])

UUU 30.0( 11828)  UCU 33.8( 13336)  UAU 19.9(  7853)  UGU 10.6(  4192)
UUC 17.0(  6717)  UCC 12.0(  4727)  UAC  9.9(  3914)  UGC  6.2(  2452)
UUA 32.0( 12628)  UCA  9.6(  3780)  UAA  1.6(   633)  UGA  0.4(   165)
UUG 19.6(  7756)  UCG  6.5(  2579)  UAG  0.9(   363)  UGG  9.3(  3690)

CUU 22.9(  9043)  CCU 23.8(  9395)  CAU 15.5(  6120)  CGU 12.6(  4985)
CUC 11.1(  4397)  CCC  5.5(  2155)  CAC  6.3(  2493)  CGC  7.6(  3009)
CUA 14.6(  5763)  CCA  9.9(  3917)  CAA 27.4( 10821)  CGA  9.4(  3697)
CUG 10.0(  3962)  CCG  3.9(  1523)  CAG 14.0(  5531)  CGG  3.9(  1559)

AUU 34.1( 13483)  ACU 17.9(  7076)  AAU 24.4(  9625)  AGU 10.0(  3967)
AUC 19.7(  7776)  ACC  8.8(  3491)  AAC 11.0(  4355)  AGC  8.7(  3426)
AUA 11.1(  4385)  ACA 18.1(  7154)  AAA 41.9( 16528)  AGA 11.9(  4717)
AUG 19.8(  7831)  ACG  8.1(  3192)  AAG 17.0(  6715)  AGG  2.4(   964)

GUU 25.2(  9956)  GCU 36.8( 14523)  GAU 35.0( 13809)  GGU 12.5(  4934)
GUC  9.8(  3882)  GCC 10.1(  3984)  GAC 10.7(  4210)  GGC  8.8(  3484)
GUA 17.7(  6992)  GCA 21.0(  8290)  GAA 41.8( 16525)  GGA 28.1( 11109)
GUG 12.8(  5036)  GCG  9.4(  3699)  GAG 23.3(  9182)  GGG 14.2(  5616)

Coding GC 41.53% 1st letter GC 51.56% 2nd letter GC 39.20% 3rd letter GC 33.85%

The "red" highlighted region above, must be saved (Copy & Paste) in a text file with a name of your choice .

Webcutter (Restriction Enzymes file)

At the Webcutter website http://www.firstmarket.com/cutter/cut2.html, you must have selected the following options before running Webcutter :

Please enter a title for this sequence:

Paste the DNA sequence into the box below

Please select the type of analysis you would like
Linear sequence analysis
Circular sequence analysis
Find sites which may be introduced by silent mutagenesis

Please indicate how you would like the restriction sites displayed
Map of restriction sites
Table of sites, sorted alphabetically by enzyme name
Table of sites, sorted sequentially by base pair number

Please indicate which enzymes to include in the display
All enzymes
Enzymes not cutting
Enzymes cutting once
Enzymes cutting exactly times
Enzymes cutting at least times, and at most times
highlights for enzymes from the polylinker

Please indicate which enzymes to include in the analysis
All enzymes in the database
Only enzymes with recognition sites equal to or greater than bases long
Only the following enzymes:
Use the command, control, or shift key to select multiple entries

Here is a sample of Webcutter output:

AatI          7   9333 13482 19521 42397 48711     agg/cct            More info
                  56867 57843
AatII         3   1929 49019 80792                 gacgt/c            More info
Acc113I       12  6298 11712 17493 24282 31362     agt/act            More info
                  32362 38424 45260 76393 76399
                  79306 80868
Acc16I        17  3122 3333 17333 19198 20267      tgc/gca            More info
                  27062 28425 29882 40655 46328
                  59270 61994 63929 66818 68610
                  71292 74072

For an entire genome you must provide a file with all the desired restriction enzymes in the above ("yellow region") format. Bear in mind, that Webcutter, accepts a limited number of sequences for a batch run. Thus, when you "run" a whole genome in sets of sequences, the relative cut position (in bp) of the input sequences is reset each time. For example, providing a set that corresponds to 1000-2000bp of a genome, the output file will start the numbering of restriction enzymes cut sites from position 1bp. The GeneViTo file format requires that you have a single file, with the above format, and the cut-positions should correspond to their actual position on the genome.

PRED-CLASS results file

The PRED-CLASS algorithm is available through http://biophysics.biol.uoa.gr/PRED-CLASS/ .

Now, you can :

Run PRED-CLASS on a sequence

Browse detailed results obtained with the algorithm on several test sets

View lists of data used to train and evaluate PRED-CLASS

Go to the Biophysics Lab Homepage

You have to select Run PRED-CLASS on a sequence, in order to run PRED-CLASS. In the new page that appears you must insert the protein in the provided area. This option provides one by one protein "run". For an entire proteome, upon request to the PRED-CLASS authors, batch run might be available. However, a list of PRED-CLASS results for 21 prokaryotic and 6 archaean entire proteomes is available through http://biophysics.biol.uoa.gr/PRED-CLASS/Results/FULL/ (a README file provides further info).

The GeneViTo file format for PRED-CLASS results is that of the files in http://biophysics.biol.uoa.gr/PRED-CLASS/Results/FULL/. You must provide a plain .txt file with this format.

PRED-TMR2 & orienTM results file

The PRED-TMR2 algorithm is available through http://biophysics.biol.uoa.gr/PRED-TMR2/ .

Now, you can :

Run PRED-TMR2 on a sequence

Browse the results obtained with the algorithm

Go to the Biophysics Lab Homepage

You have to select Run PRED-TMR2 on a sequence, in order to run PRED-TMR2. In the new page that appears you must insert the protein in the provided area. This option provides one by one protein "run". For an entire proteome, upon request to the PRED-TMR2 authors, batch run might be available. However, a list of PRED-TMR2 results for 7 entire proteomes is available through http://biophysics.biol.uoa.gr/PRED-TMR2/Results/index.html.

The orienTM algorithm is available through http://biophysics.biol.uoa.gr/orienTM/.

You have to select Execute orienTM on a sequence in order to run orienTM on a protein sequence with already defined transmembrane segments. You can also run PRED-TMR2 and orienTM successively.

The GeneViTo file format for PREDTMR2 & orienTM is as follows (in XML format):

<SEQ>

<NAME>AROB_CHLTR </NAME>

<RESULTS>

<TM>

<FROM>93</FROM>

<TO>112</TO>

<ORIEN>OUTWARDS</ORIEN>

</TM>

</RESULTS>

</SEQ>

Following the above mentioned XML format( you must comply to the above format strictly i.e. the respective tags must be on the same lines as in the example for a protein - there can't be for example <FROM> and </FROM> tags in different lines), you can insert prediction results from various algorithms e.g. in the <TM> tag, you can insert predicted transmembrane segments in the <FROM> </FROM> and <TO> </TO> tag groups.

A <SEQ> tag, can contain one <NAME> </NAME> tag group, one <RESULTS> </RESULTS> tag group. Inside the RESULTS tag group there can be several <TM> </TM> tag groups. Inside the TM group, there can be several <FROM> </FROM> tags followed by <TO> </TO> tags, followed by <ORIEN> </ORIEN> tags. In case you don't have one of the elements you fill in a "-" among the tags.

SIGNALP results file

The SIGNALP algorithm is available through http://www.cbs.dtu.dk/services/SignalP/. You have to run either one by one or batch, all the proteome and provide a single file with the results following the typical SIGNALP format:

>DUT_CHLTR

SignalP-NN result:

>DUT_CHLTR length = 70
# Measure Position Value Cutoff signal peptide?
max. C 18 0.367 0.50 NO
max. Y 18 0.113 0.32 NO
max. S 4 0.433 0.90 NO
mean S 1-17 0.184 0.44 NO

SignalP-HMM result:

>DUT_CHLTR
Prediction: Non-secretory protein
Signal peptide probability: 0.015
Max cleavage site probability: 0.009 between pos. 23 and 24

>EFG_CHLTR

SignalP-NN result:

>EFG_CHLTR length = 70
# Measure Position Value Cutoff signal peptide?
max. C 41 0.105 0.50 NO
max. Y 41 0.083 0.32 NO
max. S 70 0.286 0.90 NO
mean S 1-40 0.084 0.44 NO

SignalP-HMM result:

>EFG_CHLTR
Prediction: Non-secretory protein
Signal peptide probability: 0.000
Max cleavage site probability: 0.000 between pos. -1 and 0

Inserting the above files in GeneViTo

The input of all file is done through this interface:

1. Selecting Essential files

From the menu File you choose the option Prepare Genome and the above interface appears.

If you click Prepare Genome without selecting no other feature, you will be prompted to insert through File Choosers the Essential Files mentioned at the beginning of this document. These files (.ptt, .ffn, .fna, proteome in SWISSPROT files) are obligatory for the genome to be prepared. In this procedure you will be prompted to provide a name for the project you are preparing, along with the Destination of the Directory that will be created.

This procedure is standard and will take place each time you prepare a genome, regardless the Additional input files choices you have made.

2. Selecting Additional files

According to the data you have at hand (which additional files did you gather following the procedure mentioned above?), you have to select the relative cyan toggle buttons, and then enter the corresponding files, through File Choosers. Afterwards, you have to push the Prepare Genome button, enter the Essential files and wait for your genome to be prepared.

In this last picture you can see the procedure of locating a file through a File Chooser.

AatI
AatII
Acc113I
Acc16I
Acc65I
AccB1I
AccB7I
AccBSI
AccI
AccII
AccIII
AciI
AclNI
AclWI
AcsI
AcyI
AfaI
AfeI
AflII
AflIII
AgeI
AhdI
AluI
Alw21I
Alw26I
Alw44I
AlwI
AlwNI
Ama87I
AocI
Aor51HI
ApaI
ApaLI
ApoI
AscI
AseI
AsnI
Asp700I
Asp718I
AspEI
AspHI
AspI
AspLEI
AspS9I
AsuI
AtsI
AvaI
AvaII
AviII
AvrII
BalI
BamHI
BanI
BanII
BanIII
BbeI
BbiII
BbrPI
BbsI
BbuI
Bbv12I
Bbv16II
BbvI
BcgI
BcgI
BclI
BcnI
BcoI
BfaI
BfrI
BglI
BglII
BlnI
BlpI
Bme18I
BmyI
BpiI
BpmI
Bpu1102I
Bpu14I
BpuAI
Bsa29I
BsaAI
BsaBI
BsaHI
BsaI
BsaJI
BsaMI
BsaOI
BsaWI
Bsc4I
BscI
Bse118I
Bse1I
Bse21I
Bse8I
BseAI
BseCI
BseDI
BseNI
BsePI
BseRI
BsgI
Bsh1236I
Bsh1285I
Bsh1365I
BshNI
BsiEI
BsiHKAI
BsiI
BsiMI
BsiSI
BsiWI
BsiYI
BslI
BsmAI
BsmBI
BsmFI
BsmI
BsoBI
BsoFI
Bsp106I
Bsp119I
Bsp120I
Bsp1286I
Bsp13I
Bsp1407I
Bsp143I
Bsp143II
Bsp1720I
Bsp19I
Bsp68I
BspCI
BspDI
BspEI
BspHI
BspLU11I
BspMI
BspTI
BspXI
BsrBI
BsrBRI
BsrDI
BsrFI
BsrGI
BsrI
BsrSI
BssAI
BssHII
BssSI
BssT1I
Bst1107I
Bst2UI
Bst71I
Bst98I
BstBI
BstD102I
BstDEI
BstDSI
BstEII
BstF5I
BstH2I
BstI
BstMCI
BstNI
BstOI
BstPI
BstSFI
BstSNI
BstUI
BstX2I
BstXI
BstYI
BstZI
Bsu15I
Bsu36I
BsuRI
Cac8I
CciNI
CelII
CfoI
Cfr10I
Cfr13I
Cfr42I
Cfr9I
CfrI
ClaI
CpoI
Csp45I
Csp6I
CspI
CviJI
CvnI
DdeI
DpnI
DpnII
DraI
DraII
DraIII
DrdI
DsaI
EaeI
EagI
Eam1104I
Eam1105I
EarI
Ecl136II
EclHKI
EclXI
Eco105I
Eco130I
Eco147I
Eco24I
Eco255I
Eco31I
Eco32I
Eco47I
Eco47III
Eco52I
Eco57I
Eco64I
Eco72I
Eco81I
Eco88I
Eco91I
EcoICRI
EcoNI
EcoO109I
EcoO65I
EcoRI
EcoRII
EcoRV
EcoT14I
EcoT22I
EheI
ErhI
Esp1396I
Esp3I
FauI
FauNDI
FbaI
FokI
FriOI
FseI
Fsp4HI
FspI
GsuI
HaeII
HaeIII
HapII
HgaI
HgiEI
HhaI
Hin1I
Hin6I
HinP1I
HincII
HindII
HindIII
HinfI
HpaI
HpaII
HphI
Hsp92I
Hsp92II
HspAI
ItaI
KasI
Kpn2I
KpnI
Ksp22I
Ksp632I
KspI
Kzo9I
LspI
MaeI
MaeII
MaeIII
MamI
MboI
MboII
MfeI
MflI
MluI
MluNI
MnlI
Mph1103I
MroI
MroNI
MscI
MseI
MslI
Msp17I
MspA1I
MspCI
MspI
MspR9I
MunI
Mva1269I
MvaI
MvnI
MwoI
NaeI
NarI
NciI
NcoI
NdeI
NdeII
NgoAIV
NgoMI
NheI
NlaIII
NlaIV
NotI
NruI
NsiI
NspBII
NspI
NspV
PacI
PaeI
PaeR7I
PalI
Pfl23II
PflMI
PinAI
Ple19I
PleI
PmaCI
Pme55I
PmeI
PmlI
Ppu10I
PpuMI
PshAI
PshBI
Psp124BI
Psp1406I
Psp5II
PspAI
PspALI
PspEI
PspLI
PspN4I
PspOMI
PstI
PstNHI
PvuI
PvuII
RcaI
RsaI
RsrII
SacI
SacII
SalI
SapI
Sau3AI
Sau96I
SbfI
ScaI
ScrFI
SduI
SexAI
SfaNI
SfcI
SfiI
Sfr274I
Sfr303I
SfuI
SgfI
SgrAI
SinI
SmaI
SmiI
SnaBI
SpeI
SphI
SplI
SrfI
Sse8387I
Sse9I
SseBI
SspBI
SspI
SstI
SstII
StuI
StyI
SunI
SwaI
TaqI
TfiI
ThaI
Tru1I
Tru9I
Tsp45I
Tsp509I
TspEI
TspRI
Tth111I
TthHB8I
Van91I
Vha464I
VneI
VspI
XbaI
XcmI
XhoI
XhoII
XmaI
XmaIII
XmnI
Zsp2I