Predicting Protein Abundance using Matrix Factorization and Biological Networks
Accurate protein abundance measurements are important for understanding cancer biology. However, due to the limitations of current mass spectrometry, many missing observations exist in the dataset. Moreover, most of the existing datasets only contain genomic variation and mRNA expression information and do not provide protein abundance information. To address this issue, Clinical Proteomic Tumor Analysis Consortium (CPATC) launched a community-based collaborative competition, NCI-CPTAC DREAM Proteogenomics Challenge, to develop a computational tool to extract the protein abundance information from the data. In this talk, we introduce our model that won the first place in the first challenge problem where the participants were asked to predict the abundances of missing proteins using the abundances of other existing proteins. Our model uses a matrix factorization method to impute the missing values. We also present another model that took the second place in the second challenge problem where we were asked to predict the protein abundance using only the genomic variation and mRNA expression data. We fitted our model using LASSO and additional biological features including protein-protein interaction networks, gene regulatory networks inferred from the given mRNA expression data and protein complex networks.