This is an example of loading in a search result from MaxQuant and analysing it in MSstats. We can do this using their shiny application, but doing it in R directly gives us more control over the process and lets us examine intermediate steps in the processing.
The data we are using comes from PRIDE accession PXD043985. It’s a yeast dataset which contains both whole proteome and an affinity pull down using the eIF2a protein. The interest here is to look at the proteins which are enriched in the pull down compared to the whole proteome.
We’re going to base this analysis on the MSstats package, but will supplement this with the standard tidyverse packages. We’ll use pheatmap for drawing heatmaps of the hits.
library(MSstats)
library(MSstatsConvert)
library(tidyverse)
library(pheatmap)
theme_set(theme_bw())
We are going to import two files from the MaxQuant output. These are:
evidence.txt
this is the main quantified file at the
peptide levelproteinGroups.txt
this file provides protein level
quantitation and shows how the peptides were combinedWe’ll read in the evidence file first to look at some of the properties of the data, but we can then get MSStats to convert this into its standard format. We’re importing data from MaxQuant, but this would equally work with data from other search platforms.
We read in the file using the standard read_delim
but we
use the same column name repair that `read.delim
would use
as MSstats expects the names to be in this format.
read_delim(
"evidence.txt",
name_repair = "universal"
) -> evidence
head(evidence)
We can also load the protein level information.
read_delim(
"proteinGroups.txt",
name_repair = "universal"
) -> protein_groups
head(protein_groups)
For the downstream analysis we also need to make up a tibble of annotations to say which group each file belongs to. There are only two groups here, there full proteomes and the affinity tag pull downs. We’ll make the annotation from the data in the evidence file.
We’re not doing a mass tagged experiment so we need to say that all of the samples are using light (L), ie normal masses.
evidence %>%
distinct(`Raw.file`) %>%
mutate(Condition = str_replace(Raw.file,"^.*-","")) %>%
mutate(Condition = str_replace(Condition,"TAP_Prot","Prot")) %>%
mutate(Condition = str_replace(Condition,"_Rep.","")) %>%
arrange(Raw.file) %>%
group_by(Condition) %>%
mutate(BioReplicate = 1:n()) %>%
ungroup() %>%
add_column(IsotypeLabelType="L") -> annotation
annotation
We’ve already looked at the QC of this data using PTXQC, but we can also look directly into the evidence and protein data to see what we’re working with. MSstats will do some filtering for us when the data is loaded, but the exact metrics which are used will vary between different search programs.
It’s good to see that we’re getting a nice even spread of peptides coming into the experiment through the duration of the retention time. We can see this visually.
evidence %>%
ggplot(aes(x=Retention.time, colour=Raw.file)) +
geom_density()