The Purdue Proteome Discovery Pipeline
Posted 16 Jun, 2009 in Series
| Contributor(s) | Ann Christine Catlin Rosen Center for Advanced Computing George Howlett Purdue University |
|---|---|
| Abstract | Proteomics approaches enable interrogation of large numbers of molecules to provide a more comprehensive understanding of biological systems. High throughput proteomics utilizes liquid chromatography - mass spectrometry technology for data acquisition. Bioinformatic analysis tools are essential to manage and mine resulting high volume proteomics datasets. Data analysis is a current bottleneck for many proteomics researchers because complete and freely accessible already-developed systems are not available. In addition, most analysis systems require experienced bioinformatician input. For proteomics to achieve its greatest possible impact in biology, data analysis must be more efficient and effective. The Proteome Discovery Pipeline, developed through a collaboration of Purdue University's Bindley Biosciences Center and Cyber Center, provides complete proteomics data analysis, including spectrum deconvolution, alignment, normalization, statistical significance tests, and pattern recognition. The Discovery Pipeline has been fully integrated into cceHUB, and can access biological datasets from the cceHUB data repository and also from a user's cceHUB home folders.
The overall framework for differential proteomics encompasses data pre-processing, protein identification, protein quantification, and analysis of protein networks. Data pre-processing includes mass spectral deconvolution and peak alignment. The protein identification component identifies proteins corresponding to analyzed peptides along with a statistical significance of the identification. Peak normalization is required for protein quantification. Additional statistical significance tests discriminate differentially expressed proteins, and pattern recognition algorithms assist researchers in classification of detected and differentially expressed proteins. The Discovery Pipeline can process datasets in NetCDF, mzData, and mzXML formats. The cceHUB Discovery PipelineSpectrum Deconvoltion Tool
Peak Alignment Tool
Normalization Tool The Normalization Tool allows multi-experiment analyses by normalizing the data for sample comparison. The Normalization Tool attempts to quantitatively filter overall peak intensity variations due to experimental errors such as systematic variable injection volumes loaded onto LC-MS.Several normalization methods have been incorporated into the tool. One method chooses an analysis run as a reference and sequentially normalize all others relative to this reference. The intensity ratio of each aligned peak pair in reference and sample is calculated. The normalization constant for the sample being considered is then taken as the median of the ratios of intensities for all components between the sample in question and the reference sample. A second method normalizes the data by dividing the intensity at each m/z value by the average intensity of the entire spectrum. The log linear model method assumes primarily multiplicative variation. The maximum likelihood and maximum a posteriori estimates for the parameters characterizing the multiplicative variation are derived to compute the scaling factors needed for normalization. All of these algorithms are implemented in the Normalization Tool to allow the user to choose based on the nature of the data. Significance Testing Tool The Significance Testing Tool identifies peptide or metabolite peaks that either make significant contributions to the molecular profile of a sample or distinguish a group of samples from others. Some peaks may be present in multiple sample groups but their intensity might differ between the groups. The quantitative difference indicates the situation in which a peak is present in most (or all) of the samples, but has different intensities between the groups.The standard two-sample t-test and the Wilcox-Mann-Whitney rank test are implemented in the Significance Testing Tool to compare the group differences. Pattern Recognition Tool The Pattern Recognition Tool implements the two main categories of pattern recognition, supervised and unsupervised.Supervised systems require knowledge or data in which the outcome or classification is known ahead of time, so that the system can be trained to recognize and distinguish outcomes. Unsupervised systems cluster or group records without previous knowledge of outcome or classification. The most frequently used unsupervised pattern recognition approach is principal component analysis (PCA). Other unsupervised methods include hierarchical clustering, k-means, and self organizing maps (SOM). Tool descriptions are extracted from the Discovery Pipeline journal article cited below. |
| credits | The Bindley Bioscience Center, Purdue University eEnterprise Center, Purdue University Cyber Center, Purdue University |
| citations | The Proteome Discovery Pipeline - A Data Analysis Pipeline for Mass-Spectrometry-Based Differential Proteomics. Catherine P. Riley, Erik S. Gough, Jing He, Shrinivas S. Jandhyala, Brad Kennedy, Seza Orcun, Mourad Ouzzani, Charles Buck, and Xiang Zhang. |
| Cite this work | If you reference this work in a publication, please cite as follows:
|
| Tags |
The
The
The
The
The