Multi-view Methods for Protein Structure Comparison using Latent Dirichlet Allocation
Motivation: With rapidly expanding protein structure databases,
efficiently retrieving structures similar to a given protein is an
important problem. It involves two major issues: (i) effective protein
structure representation that captures inherent relationship between
fragments and facilitates efficient comparison between the structures,
(ii) effective framework to address different retrieval requirements.
Recently, researchers proposed vector space model of proteins using
bag of fragments representation (FragBag), which corresponds to the
basic information retrieval model.
Results: In this paper we propose an improved representation
of protein structures using Latent Dirichlet Allocation (LDA) topic
model. Another important requirement is to retrieve proteins, whether
they are either close or remote homologs. In order to meet
diverse objectives, we propose multi-view point based framework
that combines multiple representations and retrieval techniques. We
compare the proposed representation and retrieval framework on
the benchmark dataset developed by Kolodny and co-workers. The
results indicate that the proposed techniques outperform the state of
art methods.
Steps to recreate and run the experiments
-
Create bag of words representation from the protein structure dataset and library of fragments. From the pdb, the coordinates of the alpha atoms of the protein are compared against the 3 dimensional representation of fragments. The closest matching fragment is found and added to BoW. This is repeated for the whole length of protein structure, in overlapping windows. For each window, the closest matching fragment is identified.
-
Construct term-frequency, boolean and term frequency-inverse document frequency vectors on the BoW representation. Use cosine angle to compute similarity between proteins.
-
Build LDA on the BoW representation. It gives document (protein) X topic distribution as one of the outputs. Use the probability distribution vector as the feature vector of the protein. Find similarity between proteins using asymmetric KL divergence.
-
The similarity using both the models are combined using Lambda_1 * cosine similarity + Lambda_2 * KL similarity. Lambda_1, Lambda_2 are between 0 and 1, and they should sum to 1. (multi-view IR)
-
The topic distribution vector is used for classification and clustering also. (Weka tool)