Online Simulation

And More

  1. biomarker discovery
  2. cancer care engineering
  3. cceHUB
  4. colorectal cancer
  5. diet
  6. global proteomics
  7. health services research
  8. lipidomics
  9. mass spectrometry
  10. metabolomics
  11. OMIC analysis
  12. population-based models
  13. proteome discovery pipeline
  14. proteomics
  15. sample acquisition
  16. screening
  17. statistical models
  18. visual analytics

Other

Support

Trouble Report

For immediate assistance browse through our support center. You can find answers to many questions in just a few minutes.

If still experiencing problems, send us a report.

required
Why the math question?

The Purdue Proteome Discovery Pipeline

Posted 16 Jun, 2009 in Series

Contributor(s) Ann Christine Catlin
Rosen Center for Advanced Computing

George Howlett
Purdue University
Abstract

Proteomics approaches enable interrogation of large numbers of molecules to provide a more comprehensive understanding of biological systems. High throughput proteomics utilizes liquid chromatography - mass spectrometry technology for data acquisition. Bioinformatic analysis tools are essential to manage and mine resulting high volume proteomics datasets. Data analysis is a current bottleneck for many proteomics researchers because complete and freely accessible already-developed systems are not available. In addition, most analysis systems require experienced bioinformatician input. For proteomics to achieve its greatest possible impact in biology, data analysis must be more efficient and effective.



The Proteome Discovery Pipeline, developed through a collaboration of Purdue University's Bindley Biosciences Center and Cyber Center, provides complete proteomics data analysis, including spectrum deconvolution, alignment, normalization, statistical significance tests, and pattern recognition. The Discovery Pipeline has been fully integrated into cceHUB, and can access biological datasets from the cceHUB data repository and also from a user's cceHUB home folders.

image

The overall framework for differential proteomics encompasses data pre-processing, protein identification, protein quantification, and analysis of protein networks. Data pre-processing includes mass spectral deconvolution and peak alignment. The protein identification component identifies proteins corresponding to analyzed peptides along with a statistical significance of the identification. Peak normalization is required for protein quantification. Additional statistical significance tests discriminate differentially expressed proteins, and pattern recognition algorithms assist researchers in classification of detected and differentially expressed proteins. The Discovery Pipeline can process datasets in NetCDF, mzData, and mzXML formats.

The cceHUB Discovery Pipeline

Spectrum Deconvoltion Tool

imageThe Spectrum Deconvolution Tool differentiates signals arising from the real analyte as opposed to signals arising from contaminants or instrumental noise, reducing data dimensionality which will benefit downstream statistical analysis. Spectra deconvolution extracts peak information from thousands of raw mass spectra, and reports the peak information in a simple peak table.

The tool provides chemical noise filtering, charge state fitting, and de-isotoping for the analysis of complex peptide samples. Overlapping peptide signals in mass spectra are deconvoluted by correlating the observed spectrum with modeled peptide isotopic peak profiles. Isotopic peak profiles for peptides are generated in silico from a protein database producing reference model distributions. The tool is also able to analyze metabolomics data generated from a LC-MS analytical platform. The newest version of the tool provides the capability of analyzing data generated from low resolution MS instruments, data deconvolution of overlapping mass spectral peaks, identification of doublets, and calculation of the ratio of the doublets.

Peak Alignment Tool

image The Peak Alignment Tool aligns the same peptides (or metabolites) detected in different samples. Ideally, the same peptide or metabolite detected on the same analytical system should produce the same signal. For example, for a peptide measured on a LC-MS system, retention time and molecular weight should be the same in different samples. However, this may not be the case due to experimental variations. Peak alignment recognizes peaks from the same molecule occurring in different samples from the millions of peaks detected during the course of an experiment.

The Peak Alignment Tool uses a two-step alignment approach. The first step addresses systematic retention time shift by recognizing and aligning significant peaks. A significant peak refers to a peak that is present in every sample and is the most intense peak in a certain m/z and retention time range. Discrete convolution is used in the second step to align overlapped peaks.

Normalization Tool

image The Normalization Tool allows multi-experiment analyses by normalizing the data for sample comparison. The Normalization Tool attempts to quantitatively filter overall peak intensity variations due to experimental errors such as systematic variable injection volumes loaded onto LC-MS.

Several normalization methods have been incorporated into the tool. One method chooses an analysis run as a reference and sequentially normalize all others relative to this reference. The intensity ratio of each aligned peak pair in reference and sample is calculated. The normalization constant for the sample being considered is then taken as the median of the ratios of intensities for all components between the sample in question and the reference sample. A second method normalizes the data by dividing the intensity at each m/z value by the average intensity of the entire spectrum. The log linear model method assumes primarily multiplicative variation. The maximum likelihood and maximum a posteriori estimates for the parameters characterizing the multiplicative variation are derived to compute the scaling factors needed for normalization. All of these algorithms are implemented in the Normalization Tool to allow the user to choose based on the nature of the data.

Significance Testing Tool

image The Significance Testing Tool identifies peptide or metabolite peaks that either make significant contributions to the molecular profile of a sample or distinguish a group of samples from others. Some peaks may be present in multiple sample groups but their intensity might differ between the groups. The quantitative difference indicates the situation in which a peak is present in most (or all) of the samples, but has different intensities between the groups.

The standard two-sample t-test and the Wilcox-Mann-Whitney rank test are implemented in the Significance Testing Tool to compare the group differences.

Pattern Recognition Tool

image The Pattern Recognition Tool implements the two main categories of pattern recognition, supervised and unsupervised.

Supervised systems require knowledge or data in which the outcome or classification is known ahead of time, so that the system can be trained to recognize and distinguish outcomes. Unsupervised systems cluster or group records without previous knowledge of outcome or classification. The most frequently used unsupervised pattern recognition approach is principal component analysis (PCA). Other unsupervised methods include hierarchical clustering, k-means, and self organizing maps (SOM).



Tool descriptions are extracted from the Discovery Pipeline journal article cited below.

credits The Bindley Bioscience Center, Purdue University
eEnterprise Center, Purdue University
Cyber Center, Purdue University
citations The Proteome Discovery Pipeline - A Data Analysis Pipeline for Mass-Spectrometry-Based Differential Proteomics. Catherine P. Riley, Erik S. Gough, Jing He, Shrinivas S. Jandhyala, Brad Kennedy, Seza Orcun, Mourad Ouzzani, Charles Buck, and Xiang Zhang.
Cite this work

If you reference this work in a publication, please cite as follows:

    The Proteome Discovery Pipeline - A Data Analysis Pipeline for Mass-Spectrometry-Based Differential Proteomics. Catherine P. Riley, Erik S. Gough, Jing He, Shrinivas S. Jandhyala, Brad Kennedy, Seza Orcun, Mourad Ouzzani, Charles Buck, and Xiang Zhang.
  • Ann Christine Catlin; George Howlett (2009), "The Purdue Proteome Discovery Pipeline," http://ccehub.org/resources/263.

    BibTex | EndNote

Tags
  1. alignment
  2. deconvolution
  3. mass spectrometry
  4. normalization
  5. proteome discovery pipeline
  6. proteomics

In This Series

  1. Pattern Recognition for Normalized LC-MS Data

    01 Jul. 2009 | Tools | Contributor(s): Ann Christine Catlin, George Howlett

    This tool provides principal component analysis (PCA), linear discriminate analysis (LDA), and canonical discriminate analysis (CDA) for data clustering on aligned, normalized LC-MS datasets.

  2. Peak Alignment of LC-MS Data

    24 Jun. 2009 | Tools | Contributor(s): Ann Christine Catlin, George Howlett

    Peak alignment addresses retention time shift by recongnizing and aligning significant peaks; it then uses discrete deconvolutio to align overlapped peaks.

  3. Significance Testing of Normalized LC-MS Data

    23 Jun. 2009 | Tools | Contributor(s): Ann Christine Catlin

    Several statistical significance tests are employed to identify peptide or metabolite peaks that either make significant contributions to the molecular profile of a sample or distinguish a group of samples from others.

  4. Normalization of Aligned LC-MS Data

    18 Jun. 2009 | Tools | Contributor(s): Ann Christine Catlin, George Howlett

    Normalization attempts to quantitatively filter overall peak intensity variations due to experiment errors such as systematic variable injection volumes loaded onto LC-MS.

  5. Spectrum Deconvolution of LC-MS Data

    03 Dec. 2008 | Tools | Contributor(s): Ann Christine Catlin, George Howlett

    Spectral deconvolution differentiates analyte signals from contaminants or instrumental noise, and reduces data dimensionality to benefit downstream statistical analysis.