Protein Coding Regions Identification through the Modified Morlet Transform

Abstract
An important topic in biological sequences analysis area is the protein coding regions identification. This identification allows the posterior research for meaning, description or biological categorization of the analyzed organism. Currently, several methods combine pattern recognition with knowledge collected from training datasets of known genes or from comparison with genomic databases. Nonetheless, the accuracy of these methods is still far from satisfactory. New methods of DNA sequences processing and genes identification can be created through search-by-content such sequences. The periodic pattern of DNA in protein coding regions, called three-base periodicity (TBP), has been considered proper of coding regions. This phenomenon was not observed for nonprotein coding. The digital signal processing techniques supply a strong basis for regions identification with TBP.

In this work we introduce a new method for protein coding regions identification with TBP, based on a newly transform, called Modified Morlet Transform (MMT), which does not need to be trained on sequences databases. We use a fixed binary mapping rules to create four binary sequences. Where each one represents the positions of each nitrogenate base in DNA sequence. Next the MMT, with different scales is applied to all binary sequences. The module of each normalized coefficient is projected onto the position axis. Projection onto the scale axis reveal which scale carry more signal energy throughout the positions. The result of the projection position axis represents the protein coding region identificator. These projection coefficients correspond to regions with TBP. Thus, we use thresholding coefficients, to exclude positions where the associated energy is lower.

The performance of the proposed transform was examined by analyzing synthetic and real DNA sequences. Preliminary results show that MMT is better than traditional methods by presenting greater sensitivity to TBP and discriminatory capability between protein coding regions.

This work is a principal part of my master's thesis.


J. P. Mena-Chalco
Last modified: Seg Nov 28 21:10:09 BRST 2005