Bioinformatics tool for allergenicity prediction based on a novel descriptor fingerprint approach
The amino acids in the protein sequences in data sets were described by five E-descriptors and the strings were transformed into uniform vectors by auto-cross covariance (ACC) transformation.
The E-descriptors for the 20 naturally occurring amino acids, defined by Venkatarajan and Braun (J. Mol. Model (2001) 7:445–453), were derived by principal component analysis of a data matrix consisting of 237 physicochemical properties. The first principal component (E1) reflects the hydrophobicity of amino acids; the second (E2) – their size; the third (E3) – their helix-forming propensity; the forth (E4) correlates with the relative abundance of amino acids; and the fifth (E5) is dominated by the β-strand forming propensity.
An auto-cross covariance (ACC) transformation was used to make the length of the proteins uniform. ACC is a protein sequence mining method developed by Wold et al. (Anal. Chim. Acta 1993; 277:239-253).
The subsets of antigens and non-antigens were transformed into matrices with 25 x 15 variables each. The derived matrix consisted of 4854 rows (2427 allergens and 2427 non-allergens) and 25 x 15 columns. Each column was divided into 11 intervals and a 25 x 15 x 11-digit binary fingerprint was generated for each protein. A digit in the fingerprint equals 1, if the ACC value falls into the corresponding interval; otherwise, it takes 0. Thus, each protein has a unique binary fingerprint consisted of 25 x 15 units and (25 x 15 x 11 – 25 x 15) nulls. Tanimoto coefficients were calculated for all protein pairs in the set. A protein was classified as allergen or non-allergen according to the protein from the pair with the highest Tanimoto coefficient.