PRED-GPCR: GPCRs Family Classification from Sequence alone

PRED-GPCR Help page.

Version 1.01

What do the different E-values mentioned in the Filtering options and the Results page mean?

PRED-GPCR, just like Pfam, is based on profile Hidden Markov Model searches implemented by the HMMER software package. HMMER uses some rather accurate empirical methods to estimate E-values (expectation values) for a given query sequence with the profile HMMs included in the PRED-GPCR library. These E-values are refered to as "Individual motif E-values" in the PRED-GPCR system. E-values measure consistency between the results actually obtained in the trial and the "pure chance" explanation for those results. In other words E-value is the number of hits that would be expected in a given size Database of randomly generated sequences, having a match score at least as high as the match score of your sequence with a single motif. Typically, significant E-values are considered those much below 1. Therefore these E-values are by themselves a measure of significance of the matches returned for a given query (For a more in depth discussion, on profile HMMs and HMMER scores see the HMMER software package related documentation).
However, the PRED-GPCR library includes more than one profile HMMs for each GPCR family. Therefore, what we need is a statistically valid method to combine the evidence of all the HMMs derived from the same family. This method is provided by the Qfast algorithm (Bailey et. al 1998) which is used in the PRED-GPCR system to produce a single E-value for each family (refered to as "Combined E-values" in the PRED-GPCR system).

I can't decide what filtering options I should choose...

Well, that depends on what you are after.
The Combined E-value cutoff filters your results for combined family E-values. The default value for this field is 0.004 which has been determined to be the weighted Minimum Error Point on a test set of unseen examples. Minimum Error Point is the E-value for which a classifier makes the fewest errors (False positives plus False negatives). However, you can use a higher combined E-value cutoff if you wish to broaden your search space.
The Indvidual motifs cutoff filters single motifs and can be selected to be:

A motif specific cutoff that uses discrete predefined thresholds for each motif in the PRED-GPCR library to filter your results. These thresholds are empirical cutoffs, weighted between the Last True Positive (Last matching family member) E-value and the first True Negative (first matching sequence not belonging to the family) in a tuning data set.
A Global E-value cutoff that uses a common, user defined threshold to filter all motifs. Again, you can use a loose Global E-value cutoff if you wish to include distant hits in your query results.

As a general principle keep in mind that there is a trade-off between selectivity and sensitivity for different E-value cutoffs. More strict cutoffs suppress sensitivity in favor of selectivity and vice-versa.
The low complexity filter implements the CAST algorithm (Promponas et. al 2000) which allows detection of low complexity regions and their selective masking. This filter can improve the selectivity of the method, since the sequence score takes into account all scoring domains and could, therefore, return false positives in case of low scoring domains repeated along the sequence.

How should I evaluate my results?

Trusted results are ONLY those with a combined E-value below the weighted Minimum Error Point (see above) and family corresponding motif E-values below the individual motif specific E-value cutoff. These matches are indicated with the "!" symbol in the Results page. Additionally, users are warned for the less significant matches with the "?" symbol (Example Output. All results failing to fulfil either of the criteria mentioned above while still producing significant E-values (less than 1 order of magnitude above the predefined cutoffs) are marginal and should be used with discretion. In addition a sequence that does not produce significant results cannot safely be assumed to be a non-GPCR. There are GPCR families which have not been included in the PRED-GPCR classification system. This is the case for a few sparsely populated GPCR families or some ill-characterised Orphan GPCR families.

Should I query the PRED-GPCR system with a sequence fragment?

Of course. If you feel lucky...You see the PRED-GPCR library motifs correspond to a confined segment of the protein family multiple sequence alignment. Even if a protein family is represented in the PRED-GPCR library with more than one motifs, these motifs could correspond to a different fragment of the sequence you have in your hands. Therefore it is quite probable that a fragment sequence belonging to one of the GPCR families included in the PRED-GPCR system will not produce significant matches. So, always keep in mind that the PRED-GPCR system is more effective when queried with whole sequences.

How are the family related Swiss-Prot and Trembl Entries gathered?

These entries are obtained automatically using the PRED-GPCR system. The Swiss-Prot and Trembl databases are regularly queried against the PRED-GPCR library. All sequences matching below the weighted Minimum Error Point AND with motif matches below the individual motif specific cutoffs are treated as trustworthy and assumed family members.

Home Page

Submit a Protein Query

Taxonomy

Help page

PRED-GPCR Version 1.01
Designed for viewing with Internet Explorer 4 or above, Netscape 6 or above.

Biophysics &
Bioinformatics
Laboratory