Multi-view Methods for Protein Structure Comparison using Latent Dirichlet Allocation
  
Motivation: With rapidly expanding protein structure databases, efficiently retrieving structures similar to a given protein is an important problem. It involves two major issues: (i) effective protein structure representation that captures inherent relationship between fragments and facilitates efficient comparison between the structures, (ii) effective framework to address different retrieval requirements. Recently, researchers proposed vector space model of proteins using bag of fragments representation (FragBag), which corresponds to the basic information retrieval model.
Results: In this paper we propose an improved representation of protein structures using Latent Dirichlet Allocation (LDA) topic model. Another important requirement is to retrieve proteins, whether they are either close or remote homologs. In order to meet diverse objectives, we propose multi-view point based framework that combines multiple representations and retrieval techniques. We compare the proposed representation and retrieval framework on the benchmark dataset developed by Kolodny and co-workers. The results indicate that the proposed techniques outperform the state of art methods.

Steps to recreate and run the experiments
  • Create bag of words representation from the protein structure dataset and library of fragments. From the pdb, the coordinates of the alpha atoms of the protein are compared against the 3 dimensional representation of fragments. The closest matching fragment is found and added to BoW. This is repeated for the whole length of protein structure, in overlapping windows. For each window, the closest matching fragment is identified.
  • Construct term-frequency, boolean and term frequency-inverse document frequency vectors on the BoW representation. Use cosine angle to compute similarity between proteins.
  • Build LDA on the BoW representation. It gives document (protein) X topic distribution as one of the outputs. Use the probability distribution vector as the feature vector of the protein. Find similarity between proteins using asymmetric KL divergence.
  • The similarity using both the models are combined using Lambda_1 * cosine similarity + Lambda_2 * KL similarity. Lambda_1, Lambda_2 are between 0 and 1, and they should sum to 1. (multi-view IR)
  • The topic distribution vector is used for classification and clustering also. (Weka tool)