Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein Prospector Converter #39

Open
jcmaynard opened this issue Nov 17, 2022 · 13 comments
Open

Protein Prospector Converter #39

jcmaynard opened this issue Nov 17, 2022 · 13 comments
Labels
enhancement New feature or request
Milestone

Comments

@jcmaynard
Copy link

Hi,

Would it be possible to add a converter for Protein Prospector output?

Cheers!

@devonjkohler devonjkohler added the enhancement New feature or request label Oct 19, 2023
@devonjkohler
Copy link
Contributor

Hi @jcmaynard

I haven't seen the output of PaSER. Could you shoot me over some data for this? I can look into the converter if I have the data.

Devon

@jcmaynard
Copy link
Author

Hi Devon,

I can send you some Prospector Output, one report with phospho and one with global proteins? Where is the best place to send it?

Cheers,

Jason

@devonjkohler
Copy link
Contributor

Hey @jcmaynard,

Would you be able to share them via email: [email protected]

Devon

@tonywu1999
Copy link
Contributor

tonywu1999 commented Jul 6, 2024

Hi @jcmaynard

I got started on writing the code for the protein prospector converter. Devon shared me your dataset, but I got a little confused on the data format.

  • Where is the RAW file name located in the output? I saw there was a cell containing a filename looking text - Z20180606_YvA_TotalRPLC/SW201948rc2mc2mm. Is that the RAW file name? If so, what would happen if you're looking to analyze a dataset with multiple MS runs (e.g. multiple TMT mixtures), i.e. would each MS run be its own TXT output file?
  • Which column represents the precursor charge? Is it z? With other tools, I usually see a column called Charge

@jcmaynard
Copy link
Author

Hi @tonywu1999

Here is the manual for Protein Prospector (specifically the section on data output): https://prospector.ucsf.edu/prospector/html/instruct/batchtagman.htm#search_compare

The first two rows have some cells that represent the Project Name: "Z20180606_YvA_TotalRPLC", and the search name "SW201948rc2mc2mm".

The data report lists the Peaklists used for the search under the header "Fraction". The charge state is under column header "z".

In the case of TMT10 or TMTpro, the intensity headers will have the same name for example "Int 127" for both 127N and 127C. The N isotope will always be first. I'm in the process of trying to get the Prospector Admin to change this.

There are a number of different options for reporting peptide mods in Prospector, the data I shared was just one of them. Mods can be split out into separate columns if that would be easier to parse.

Jason

@tonywu1999
Copy link
Contributor

The data report lists the Peaklists used for the search under the header "Fraction". The charge state is under column header "z".

Understood - z represents the precursor charge.

The first two rows have some cells that represent the Project Name: "Z20180606_YvA_TotalRPLC", and the search name "SW201948rc2mc2mm".

I'm still confused on how to determine which run(s) produced this report. For example, if you see the attached example from another tool (MaxQuant), you can see that there's a column "Raw.file" that outlines which RAW file a particular row is associated with. But I can't seem to find that in the Prospector search file.

In the case of TMT10 or TMTpro, the intensity headers will have the same name for example "Int 127" for both 127N and 127C. The N isotope will always be first. I'm in the process of trying to get the Prospector Admin to change this.

Could you clarify what you mean by this? How would one know when a measurement is associated with the N isotope vs the C isotope based on inspecting the input dataset?

There are a number of different options for reporting peptide mods in Prospector, the data I shared was just one of them. Mods can be split out into separate columns if that would be easier to parse.

I think your initial dataset works well with how peptide mods are reported. Is there a certain setting that a user needs to select to display the mods in the current format (i.e. is this the default format)?

@jcmaynard
Copy link
Author

Hi @tonywu1999,

Here is a breakdown of the column headers from the reports I sent @devonjkohler :
Column Headers

  • "" - this first column has no header, it is the protein rank that prospector outputs. the only meaningful thing here is that if there is a hyphen in the rank, example [2-3], that represents a homologous protein.
  • Uniq Pep -
  • Acc # - Uniprot Accession number
  • Gene
  • Num Unique - number of unique peptides matched to the protein
  • % Cov
  • Best Disc Score
  • Best Expect Val
  • M+H - Singly charged peptide mass
  • m/z - Precursor m/z value
  • z - precursor charge state
  • ppm - precursor mass error
  • Prev AA
  • DB Peptide - Peptide with no modifications shown
  • Peptide - Peptide with modifications present
  • Next AA
  • Protein Mods - All variable modification present with the protein AA number after the @ symbol. The = refers to the slip score (the modification site localization score). If a modification has a SLIP score less than 6 (less than ~95% confidence) than alternative modification sites will be shown with "|" symbol.
  • Composition - This is a user defined column showing if a requested modification is present.
  • M Cl - # of missed cleavages
  • Fraction - Peaklist/raw file that the spectra is from
  • RT - Retention time
  • Spectrum
  • MSMS Info - Scan number of the spectra from the raw file.
  • Int 126 - Reporter ion intensity
  • Int 127
  • Int 128
  • Int 129
  • Int 130
  • Int 131
  • Start - Protein amino acid position of the first amino acid in the peptide
  • Score - peptide score
  • Expect - expectation value from search
  • '# in DB - number of times the peptide is found in the database
  • Protein MW
  • Species
  • Protein Name

The Reporter Ion Intensity Columns for TMT greater than 6 plex will now have an Isotope label, example: "Int 127N" and "Int 127C"

Modifications can be reported in 5 different ways:

  • Off: Only the DB Peptide column is shown
  • Mods in Peptide: All mods are shown in the peptide in a column named "Peptide" (this is what is shown in the dataset you have)
  • Variable Mods only: Only the variable mods are shown in a separate column "Variable Mods" example: "Oxidation@11;Oxidation@12"
  • All Mods (1 column): All modifications are in one column named "Mods"
  • All Mods (2 Columns): Modifications are split between two columns: "Constant Mods" and "Variable Mods"

For the above settings modifications are reported at the Peptide level. Oxidation@11 refers to the 11th amino acid of the peptide. Protein modification is a separate column discussed above. The default is "Variable Mods only", but "Mods in Peptide" or one of the all mods are used more often.

The TMT modification names in Prospector are: TMT6plex, TMT10plex, and TMT16plex

I'm happy to set up a zoom or call to discuss if that would be helpful.

Cheers,

Jason

@tonywu1999
Copy link
Contributor

@jcmaynard

Hi,

I'd be happy to discuss on a call. I think you answered all my questions but I'm curious on how the modifications are reported and would like more clarity on that.

Could you email me at [email protected] and we can coordinate a time to discuss?

Thanks,
Tony

@tonywu1999
Copy link
Contributor

@jcmaynard

In terms of timeline for the MSstatsPTM converter, I'm anticipating for it to be complete by end of October. I had initially thought it would be complete earlier, but I noticed the code for MSstatsPTM needs some refactoring before implementing the code for the protein prospector converter.

So far, I created the converter from Protein Prospector to MSstatsTMT format, which is accessible at MSstatsConvert.

@jcmaynard
Copy link
Author

jcmaynard commented Sep 5, 2024 via email

@tonywu1999
Copy link
Contributor

tonywu1999 commented Sep 16, 2024

Adding a comment describing notes from the meeting between me and Jason here from July:

General Notes:

  • Fraction column has RAW file name, even if there’s multiple mixtures.
  • Shared proteins can show up - we should throw those out
  • All PSMs is a file. Users should use “Keep replicates” option.
  • There is ambiguity with how MSstatsTMT should handle which feature intensity to use in fractionation scenarios.

Slip Score Notes:

  • 75=11 is slip score (localization score)
    • 75 is the position of the amino acid w.r.t. the protein
    • Slip score 6 = 95% confidence. (Protein Mods)
  • 75|77 - this means identification fell below 95% confidence, unclear which site got modified
    • Typically we should filter these values out.
  • Phospho&Phospho (110 & (113 | 115)) | 112 & (113 | 115)) is ambiguous too.
  • Interesting cases:
    • Methylation - dimethlyation vs trimethylatkion can be reported.
    • We should not see two modifications on the same amino acid at the same time.
    • TMT labels modified every lysine and N-terminus are assumed (constant mods). Protein modifications are mods at the variable mods protein level.

@tonywu1999
Copy link
Contributor

tonywu1999 commented Dec 16, 2024

@jcmaynard

Apologies for the delays. The PR should be merged now and is available on Github at the moment. You can install the package on Github on the R console

devtools::install_github("Vitek-Lab/MSstatsConvert", build_vignettes = TRUE)

We will look to push to bioconductor in the future. Please let me know if you have any problems.

@tonywu1999 tonywu1999 added this to the Beta Testing milestone Dec 16, 2024
@jcmaynard
Copy link
Author

jcmaynard commented Dec 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants