University of Athens

Faculty of Biology

Dept. of Cell Biology and Biophysics

View this page in Greek

Liakopoulos, Th.D., Bagos, P.G., Alexopoulos, I.K. and Hamodrakas, S.J.

Department of Cell Biology and Biophysics,

Faculty of Biology, University of Athens

Panepistimiopolis, Athens 15701, Greece


A web tool to select biological sequences from a given set, with similarity/homology less than a user-defined level

Representative data sets are crucial in biological sequence analysis. Their applications include the development of new predictive algorithms, testing the performance of existing methods and drawing statistical inferences based on the aminoacid or base composition. A data set used for such purposes is usually required to satisfy a predetermined redundancy level, meaning that it should not include two sequences with similarity/homology higher than a predetermined value. We have created a web-based application that takes as input a set of N sequences and outputs a set of sequences of user-determined redundancy. Initially, the algorithm runs an all-against-all BLAST alignment on the input data set and creates an NxN matrix of pairwise distances defined by the similarity percentages. In the next step, the algorithm removes the sequence with the largest number of neighbors, causing that sequence not to be counted as a neighbor of any other sequence during the next iterations. It then reassesses the number of neighbors of each sequence and repeats the previous step until the sequences left over have no more neighbors. The user can specify the similarity (%) threshold and the minimum coverage length of the alignments. Sequences with a similarity below the threshold or a smaller coverage than the minimum length are not considered to be neighbors.