Revision: 35042
Initial Code
Initial URL
Initial Description
Initial Title
Initial Tags
Initial Language
at November 1, 2010 13:58 by leomchugh
Initial Code
/* Learner.cpp :
Copyright (c) 2009, Leo McHugh, University of Sydney, Australia
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the <organization> nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY LEO MCHUGH ''AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL LEO MCHUGH BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
This code requires input files from PreProcess.cpp
updates of this code can be found at: http://www.usyd.edu.au/sydneybioinformatics/research/software.shtml
Contact: [email protected]
*/
/*
PURPOSE AND USE OF THIS SOFTWARE:
This software is designed for use with PINNACLE. Once PINNACLE is run, it generates a list of hits along with a
confidence score for each, and prints out a list of identified fragments. An example is shown as a commented out
section at the very end of this file.
Learner.cpp uses these files to gather statistical information about fragmentation and to build and train a
neural network.
For version1.0, only limited information will be collected.
We calculate approximate probability functions for:
the parent mass error w.r.t mass
fragment mass erro w.r.t mass
probability of fragment identification in correct assignments w.r.t mass / charge
Neural Nets:
Inputs for predicting peak intensity for a fragment:
iontype either no frag, immoium, y or b ion (nofrag, immo, y1, a1, b1) = (0,1,3,2,4); // 5 reserved for z ions but not implemented
pcharge parent charge
fcharge fragment charge
plen the length of the parent peptide
prop_len the proportion of the distance from N to C term that the fragmentation site was at
pr_hkr proportion of HKR and R residues in the fragment over those in the unbroken peptide
n_x identity of AA on the n-term side of cleavage
c_x identity..... c-term
inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x
(taken from working notes07)
==== Add later============
loss 0 if it's a 'native' ion, such as a y or b, 1 if it contains losses.
Output:
log of the intensity, scaled so that the max log(intensity) for the fragment is set to 100 // maybe use later 6 Fuzzy Logic NN outputs, 0, 20, 40 etc to 100.
*/
#include "stdafx.h"
#include <iostream>
#include <string>
#include <fstream>
#include <iostream>
#include <sstream>
#include <vector>
#include <list>
#include <time.h> // for random number generation
// Learner based libraries:
#include "../Utilities/Vector.h"
#include "../Utilities/Matrix.h"
#include "../Utilities/InputTargetDataSet.h"
#include "../MultilayerPerceptron/MultilayerPerceptron.h"
#include "../ObjectiveFunctional/MeanSquaredError.h"
#include "../TrainingAlgorithm/QuasiNewtonMethod.h"
using namespace std;
// ======================= FIXED PARAMETERS =====================
unsigned const MAX_SEQ_SIZE = 50; // make sure this is the same as in preprocess.cpp and PINNACLE.cpp
unsigned const MIN_SEQ_SIZE = 5; // make sure this is the same as in preprocess.cpp and PINNACLE.cpp
unsigned outputNodes = 1; // The number of output nodes in the fuzzy network
unsigned inputNodes = 8; // described above
double propTesting = 0.2; // Proportion of peptides reserved for testing.
double minZscore = 3.7; // minimum required Zscore for a 'hit' to be considered confident enough to be used for testing/training
unsigned const buckets = 50; // mass spectrum division into this many buckets of 100 da. for statistical learning part
bool fuzzyOutputs=false; // if true, splits the output to a fuzzy output, if false, there is a single output node.
// ======================= OBJECTS ==============================
// ============== MUST BE THE SAME AS IN PINNACLE.CPP ===========
struct fragmatch { // records a matching fragment in from an indentified peptide
unsigned intensity;
double obsmz;
double expmz; // expected mz
int fcharge; // fragment charge
string fragType; // ie y1+
string fragDesc; // decription of the fragment. ie: ........NQLLQFR
unsigned fragSite; // from N to c term. zero is no frag, 1 is first bion / last y ion etc,
double fragLOD; // LOD for this frag
};
struct format // converts a hit into a neural net input set
{
double pmasserror; // error in the pmass calculation
vector<vector<double>> fmzerror; // fragment mass error array <fmass,error>
vector<vector<double>> NNlines; // formatted lines to be appended to the NN training matrix.
};
struct hit {
string seq; // sequence of the matched peptide
double Zscore; // number of std devs above the random mean
int pcharge; // charge of the observed parent peptide
int basichits; // the number of no loss y and b ions
double prelimScore; // the proportion of the top 4N peaks covered by no loss y and b ions.
double obspmass; // observed pmass of the input spectrum (uncharged)
double exppmass; // theoretical pmass of the matched peptide (uncharged)
vector<fragmatch> peaks; // contains the top 4N peaks
vector<double> mods; // one entry for each AA from N to C term
double overallLOD; // LOD score generated by PINNACLE for this peptide match
};
struct learnedParameters { // holds the parameters learned by this file for passing to PINNACLE
// THIS STRUCTURE PASSES ANY LEARNED PARAMETERS TO PINNACLE
// IF YOU WANT TO INCLUDE MORE STATISTICAL LEARNING, ADD THE LEARNED PARAMETERS
// TO THIS OBJECT, AND UPDATE PINNACLE SO THAT IT CAN LOAD THE SAME OBJECT
// THIS OBJECT IS WRITTEN TO BINARY FILE AFTER LEARNING SO THAT PINNACLE CAN READ IT IN
double pmassErrors[buckets][2]; // mean and std dev for every bucket (100 Da) for parent mass errors
double fmassErrors[buckets][2]; // mean and std dev for every bucket (100 Da) for fragment mass errors
// FEEL FREE TO ADD MORE LEARNED PARAMETERS IN HERE (REMEMBERING TO UPDATE THE SAME OBJECT IN PINNACLE
};
// =================================== FILE LOCATIONS ================================================================
// location of the output of the PINNACLE run (this is a serial file writen by PINNACLE storing all the hits as objects)
char *inputFile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\output.ser"; // .ser (serial) - see PINNACLE for format details
// the below files are all outputs from this program. savedModel and learnedParameters store the neural network and
// other general learned parameters respectively and these 2 files must be written to the working directory of PINNACLE
// neural network related
char *tempTrainingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\tempTrainingSet.dat"; // just contains instances
char *trainingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingSet.dat";
char *testingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\testingSet.dat";
char *trainingHistory = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingHistory.dat";
char *savedModel = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Debug\\multilayerPerceptron.dat";
// other learned parameters (see object for details)
char *learnedParams = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\learnedParameters.bin"; // binary
// ===================================== FUNCTIONS ===========================
// LOAD IN AND DATA PREPARATION
void createTrainingFile();
hit loadInHit(ifstream&); // loads in the next hit
vector<string> processHit(hit, bool); // returns a block of lines which are converted into a NN input file (Flood style)
double getMean(vector<double>&); // returns the mean
double getStdDev(vector<double>&, double); // returns the standard deviation of input array, second arg is the precomputed vector mean.
void learnMassErrors(); // learns parent and fragment mass errors for this dataset
// TESTING OF THE MODEL
void loadAndProcessTrainingFile();
void testModel(); // compares against a flat spectrum.
double dotProduct(vector<double>, vector<double>); // return inner dot product of input vectors
double testPeptide(vector<double>); // returns the dot product of a flat and the predicted spect.
fragmatch findMatch(string, unsigned, hit); // (iontype, breakSite,matchArray)
vector<double> fuzzyOut(float, float); // produces a fuzzy output of the intensity
void trainingSetHeader(ofstream&,int,int,int,string,string);// sets up the header for the output training file
// OTHER HANDY FUNCTIONS
vector<double> normalise(vector<double>, double sumTotal); // takes an input vector and normalises to sum to 100
vector<double> convert2mods(string); //
vector<double> convert2doubleVect(string); // converts a space separated line into a vector of doubles (from test set)
vector<fragmatch> convert2peaks(string); // used during loading in to convert .ser file to a hit object
double convert2double(string); // converts a string to a double
int convert2int(string); // converts a string to an int
string convert2str(vector<double>); // converts doubles into a string for output
double encodeAA(char);
double countHKR(string&);
string stringify(double x); // convert double to string
// Neural network
Flood::Vector<double> getAttributes(fragmatch& fm,string seq, unsigned pcharge);
Flood::MultilayerPerceptron ANN;
void printFloodVector(Flood::Vector<double>& v);
// ============== DEBUGGING ===========================
int verbosity = 4;
int _tmain(int argc, _TCHAR* argv[])
{
// NEURAL NETWORK LEARNING
// reads the identified peptides and creates a NN training file
// createTrainingFile();
// imports the processed training file, sets the neural net parameters for the model and learning
// loadAndProcessTrainingFile();
// tests neural network model against the testing file generated by createTrainingFile.
testModel();
// STATISTICAL MACHINE LEARNING
// collects basic stats on hits to optimise search parameters.
// learnMassErrors(); // collects stats on parent and fragment mass erros.
return 0;
}
void learnMassErrors() { // learn the mass errors from the hits
cout << "Learning from parent and fragment mass errors from: " << inputFile << endl;
ifstream inputHits;
inputHits.open(inputFile); // load in the the input file in binary format
learnedParameters lp; // object for storing learned parameters
// we divide the mass range up into buckets of 100 Da, storing a mean and variance for each bucket
// PINNACLE will then use this information in a prob density function
vector<vector<double>> pmassbuckets; // stores the parent mass errors for each bin
vector<vector<double>> fmassbuckets; // stores the fragment mass errors for each bin
for (unsigned i=0; i < buckets; i++) {
vector<double> v;
pmassbuckets.push_back(v); // create elements (blank memory spacers for the time being)
fmassbuckets.push_back(v); // create elements
}
unsigned hitsLoaded=0;
while (!inputHits.eof()) {
hit h = loadInHit(inputHits);
if (h.seq.compare("END")==0) {cout << "hits loaded: " << hitsLoaded << endl; break;} // no more available (loadInHit returns 'END' at eof)
if (h.Zscore < minZscore) {continue;} // not confident enough hit to use for learning
hitsLoaded++;
// RECORD PARENT MASS ERROR
unsigned index = (unsigned) floor(h.obspmass/100); // assign bucket (100 Da per bucket)
if (index > buckets-1) {index = buckets-1;} // prevent overflow
//double modmass =0; // account for mods
//for (unsigned i=0; i < h.mods.size(); i++) {modmass+=h.mods.at(i);}
double difference = h.obspmass-(h.exppmass);
if (fabs(difference) > 1) {
cout << "obs mass: " << h.obspmass << " expmass: " << h.exppmass << endl; // " modmass: " << modmass << endl;
cout << "unexpectedly high parent mass error - skip"; continue; // sometimes a peptide is IDed twice, each with different mods, keep searching..
}
pmassbuckets.at(index).push_back(difference); // record parent mass error
// RECORD FRAGMENT MASS ERRORS
for (unsigned i=0; i<h.peaks.size(); i++) {
if (h.peaks.at(i).obsmz == 0) {continue;} // no matching peak
unsigned index = (unsigned) floor(h.peaks.at(i).obsmz/100); // assign bucket (100 Da per bucket)
if (index > buckets-1) {index = buckets-1;} // prevent overflow
// using absolute error values for frag mass due to found bimodal
// distributions for some datasets. non absolute values would be
// better (but can't then use mean and std dev in probability functions
//cout << " fmass: " << h.peaks.at(i).obsmz << " error: " << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;
fmassbuckets.at(index).push_back(fabs(h.peaks.at(i).obsmz-h.peaks.at(i).expmz)); // record parent mass error
//if (fragError.at(1) < 0 && h.peaks.at(i).obsmz > 700) {cout << "UNDER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;}
//if (fragError.at(1) > 0 && h.peaks.at(i).obsmz > 700) {cout << " OVER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;}
}
} // end loading in all hits from file
inputHits.close();
// SHOW SOME OF THE DISTROS IN THE BUCKETS
unsigned testBucket = 8;
cout << "bucket" << testBucket << "={";
for (unsigned i=0; i < fmassbuckets.at(testBucket).size(); i++) {cout << fmassbuckets.at(testBucket).at(i) << ',';}
cout << "};\n";
exit(0);
// GET PARAMETERS FOR MEAN AND VARIANCE FOR EACH BUCKET
unsigned minDatapoints = 10; // the minimum number of values in a bucket for the calculation of mean and std dev
// ================ PARENT MASS ERRORS ============================================================
vector<vector<double>> pe; // pe = pmass errors for each bucket, stores <mean, stdDev>
for (unsigned bucket=0; bucket < buckets; bucket++) {
vector<double> v;
v.clear();
if (pmassbuckets.at(bucket).size() < minDatapoints) {pe.push_back(v); continue;} // too few values to calculate
double mean = getMean(pmassbuckets.at(bucket));
double stdDev = getStdDev(pmassbuckets.at(bucket), mean);
//cout << "massbucket: " << bucket << ": {";
//for (unsigned j=0; j<pmassbuckets.at(bucket).size(); j++) {cout << pmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl;
//cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl;
v.push_back(mean); v.push_back(stdDev);
pe.push_back(v);
}
// if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours.
// first back-propogate the values to leading empty buckets
for (unsigned i=0; i < pe.size(); i++) { // pe = parent errors
if (pe.at(i).empty()) {continue;}
vector<double> v = pe.at(i); // first non-empty bucket
while (1) {pe.at(i) = v; if (i==0) {break;} i--;}
break;
}
// now propagate forward any values to any empty buckets
vector<double> v = pe.at(0); // first bucket's value (mean,stdDev)
for (unsigned i=0; i < pe.size(); i++) { // pe = parent errors
if (pe.at(i).empty()) {pe.at(i)=v; continue;} // assign same values as bucket to the left
v = pe.at(i); // update
}
cout << "pmass errors: \n"; for (unsigned i=0; i < pe.size(); i++) {
if (pe.at(i).empty()) {cout << "i: " << i << "{}\n";}
else {cout << "bucket: " << i << " {" << pe.at(i).at(0) << ',' << pe.at(i).at(1) << "}\n";}
} cout << endl;
// load into object
for (unsigned i=0; i < buckets; i++) { // pe = parent errors
lp.pmassErrors[i][0] = pe.at(i).at(0); // mean
lp.pmassErrors[i][1] = pe.at(i).at(1); // standard Deviation
}
// ================ FRAGMENT MASS ERRORS ============================================================
vector<vector<double>> fe; // fe = fragment errors for each bucket, stores <mean, stdDev>
for (unsigned bucket=0; bucket < buckets; bucket++) {
vector<double> v;
v.clear();
if (fmassbuckets.at(bucket).size() < minDatapoints) {fe.push_back(v); continue;} // too few values to calculate
double mean = getMean(fmassbuckets.at(bucket));
double stdDev = getStdDev(fmassbuckets.at(bucket), mean);
//cout << "fmassbucket: " << bucket << ": {";
//for (unsigned j=0; j<fmassbuckets.at(bucket).size(); j++) {cout << fmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl;
//cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl;
v.push_back(mean); v.push_back(stdDev);
fe.push_back(v);
}
// if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours.
// first back-propogate the values to leading empty buckets
for (unsigned i=0; i < fe.size(); i++) { // pe = parent errors
if (fe.at(i).empty()) {continue;}
vector<double> v = fe.at(i); // first non-empty bucket
//cout << "first non-empty fragmass bucket: {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}" << endl;
while (1) {fe.at(i) = v; if (i==0) {break;} i--;}
break;
}
// now propagate forward any values to any empty buckets
v = fe.at(0); // first bucket's value (mean,stdDev)
for (unsigned i=0; i < fe.size(); i++) { // pe = parent errors
if (fe.at(i).empty()) {fe.at(i)=v; continue;} // assign same values as bucket to the left
v = fe.at(i); // update
}
cout << "fragment errors: \n"; for (unsigned i=0; i < fe.size(); i++) {
if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";}
else {cout << "bucket: " << i << " {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}\n";}
} cout << endl;
// output for Mathematica graphing - buckets of 100 da {{fragmass,AvgError,StdDev},{...}}
cout << "ferrors={"; for (unsigned i=0; i < fe.size(); i++) {
if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";}
else {cout << '{' << i*100 << ',' << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "},";}
} cout << '}' << endl;
// load into object
for (unsigned i=0; i < buckets; i++) { // pe = parent errors
lp.fmassErrors[i][0] = fe.at(i).at(0); // mean
lp.fmassErrors[i][1] = fe.at(i).at(1); // standard Deviation
}
ofstream out; // write object to file
out.open(learnedParams, ios::binary);
out.write((char *)(&lp),sizeof(lp)); // write object out to binary file (used by PINNACLE)
out.close();
cout << "quiting after learned parameters writen to file";
exit(0);
}
void testModel() {
// first load in the model
Flood::MultilayerPerceptron ANN; ANN.load(savedModel);
cout << "loading ANN from " << savedModel << endl;
ifstream testingSet;
testingSet.open(testingfile);
vector<double> improvements;
// load in a single instance from the test set
string line;
while (!testingSet.eof()) {
vector<vector<double>> instance; // instance of the matching hits for a given peptide
instance.clear();
line = "NULL";
while (!line.empty() && !testingSet.eof()) {
getline(testingSet, line );
if (!line.empty()) {
//cout << " pushing back line: " << line << " of size: " << convert2doubleVect(line).size() << endl;
instance.push_back(convert2doubleVect(line));
}
}
//cout << "instance loaded with " << instance.size() << endl;
vector<double> predicted;
vector<double> flat(instance.size(),1);
vector<double> observed;
for (unsigned i=0; i < instance.size(); i++) {
vector<double> input = instance.at(i);
double obsPeakInt = input.back(); // the correct output
observed.push_back(obsPeakInt);
input.pop_back(); // remove the last element (answer)
// need to convert a normal vector to a flood style vector
Flood::Vector<double> attributes(input.size());
for (unsigned j=0; j<input.size(); j++) {attributes[j] = input.at(j);}
Flood::Vector<double> fuzzyOutput = ANN.getOutput(attributes);
if (fuzzyOutputs) {
// take the highest node as the prediction of intensity (rounded to nearest 20%)
unsigned maxnode=0; double maxval=0;
for (int k=0; k < fuzzyOutput.getSize(); k++) {
if (fuzzyOutput[k] > maxval) {maxnode=k; maxval=fuzzyOutput[k];}
}
predicted.push_back(maxnode); // just take the node with the highest output
}
else {
//cout << "attributes: "; for (int k=0; k < attributes.getSize(); k++) {cout << attributes[k] << ' ';}
//cout << "predicted output: " << fuzzyOutput[0] << endl;
predicted.push_back(fuzzyOutput[0]); // just the raw single number coming out.
//double exponent = pow(2.17,fuzzyOutput[0]);
//predicted.push_back(exponent); // blow it back up
}
// print out the fuzzy nodes, max node and observed intensity
/*cout << "fuzzyOut " << i << ": ";
for (int k=0; k < fuzzyOutput.getSize(); k++) {
char buffer [50];
int n, a=5, b=3;
n=sprintf_s(buffer, 50, "%0.3f",fuzzyOutput[k]);
cout << buffer << " ";
}cout << "| " << maxnode*0.2 << " | " << obsPeakInt << endl;
*/
} // end prediction of each peak height
// end of single test peptide
// Normalise vectors
predicted = normalise(predicted,100); // normalises spectrum to a sum total of 100
flat = normalise(flat,100);
observed = normalise(observed,100);
// show normalized vectors
cout << "flat : {"; for (unsigned i=0; i < flat.size()-1; i++) {cout << flat.at(i) << ',';} cout << flat.back() << "}\n";
cout << "predicted: {"; for (unsigned i=0; i < predicted.size()-1; i++) {cout << predicted.at(i) << ',';} cout << predicted.back() << "}\n";
cout << "observed : {"; for (unsigned i=0; i < observed.size()-1; i++) {cout << observed.at(i) << ',';} cout << observed.back() << "}\n";
double flatDP = dotProduct(observed,flat);
double annDP = dotProduct(observed,predicted);
improvements.push_back(annDP/flatDP);
cout << "imp: " << annDP/flatDP << " flatDP: " << flatDP << " annDP: " << annDP << endl << endl;
//exit(0);
}
cout << "improvements={"; for (unsigned i=0; i < improvements.size(); i++) {cout << improvements.at(i) << ',';} cout << "};";
// use a dot product model.
// for the FOUND peaks, create a spectrum. then normalise
return;
}
double dotProduct(vector<double> v1, vector<double> v2) {
double dp=0;
if (v1.size() != v2.size()) {cout << "vectors not equal length!"; exit(1);}
for (unsigned i=0; i < v1.size(); i++) {
dp += v1.at(i)*v2.at(i);
}
return dp;
}
vector<double> normalise(vector<double> v, double sumTotal) {
double sum = 0;
for (unsigned i=0; i < v.size(); i++) {sum+=v.at(i);}
// make sum to 100
double scalar = sumTotal/sum;
for (unsigned i=0; i < v.size(); i++) {v.at(i)*=scalar;}
return v;
}
double testPeptide(vector<double> fragmatches) { // returns a dot product for the predicted and a flat spect.
return 0;
}
vector<double> convert2doubleVect (string line) { // converts a space separate line into a double vector
unsigned i=0; unsigned j=0;
vector<double> out;
while (j < line.size()) {
if (j == line.size()-1) { // last one
string tmp = line.substr(i,j-i+1); //cout << "element: " << tmp << endl;
double val = convert2double(tmp);
out.push_back(val);
}
if (line.at(j) != ' ' && j < line.size()) {j++; continue; }
string tmp = line.substr(i,j-i); //cout << "element: " << tmp << endl;
double val = convert2double(tmp);
out.push_back(val);
i=j; j++;
}
//exit(0);
return out;
}
void loadAndProcessTrainingFile() {
// LOAD IN TRAINING SET INTO NN FOR LEARNING>
Flood::InputTargetDataSet inputTargetDataSet ;
inputTargetDataSet.load(trainingfile);
// set up a new neural network
Flood::MultilayerPerceptron multilayerPerceptron(inputNodes,inputNodes,outputNodes); // input, middle layer, output - same number in middle layer as inputs
multilayerPerceptron.initFreeParametersNormal(0.0,1.0); // initialise the network within the chosen range
// set up the names of the variables (both in and out)
Flood::Vector<string> nameOfInputVariables = inputTargetDataSet.getNameOfInputVariables();
Flood::Vector<string> nameOfTargetVariables = inputTargetDataSet.getNameOfTargetVariables();
multilayerPerceptron.setNameOfInputVariables(nameOfInputVariables);
multilayerPerceptron.setNameOfOutputVariables(nameOfTargetVariables);
// get the parameters for the input data - bound in necessary
//Flood::Matrix<double> meanAndStandardDeviationOfInputData = inputTargetDataSet.getMeanAndStandardDeviationOfInputData();
Flood::Matrix<double> minimumAndMaximumOfInputData = inputTargetDataSet.getMinimumAndMaximumOfInputData();
Flood::Vector<double> minOfInputVariables(inputNodes); // these values are not inclusive, so min must be lower than any expected input
minOfInputVariables[0] = -0.01; // iontype
minOfInputVariables[1] = 0.99; // pcharge
minOfInputVariables[2] = 0.99; // fcharge
minOfInputVariables[3] = MIN_SEQ_SIZE; // plen
minOfInputVariables[4] = -0.01; // prop_len
minOfInputVariables[5] = -0.01; // pr_hkr
minOfInputVariables[6] = -0.01; // n_x
minOfInputVariables[7] = -0.01; // c_x
multilayerPerceptron.setMinimumOfInputVariables(minOfInputVariables);
Flood::Vector<double> maxOfInputVariables(inputNodes);
maxOfInputVariables[0] = 4.01; // iontype
maxOfInputVariables[1] = 3.01; // pcharge
maxOfInputVariables[2] = 2.01; // fcharge
maxOfInputVariables[3] = MAX_SEQ_SIZE; // plen
maxOfInputVariables[4] = 1.01; // prop_len
maxOfInputVariables[5] = 1.01; // pr_hkr
maxOfInputVariables[6] = 1.01; // n_x
maxOfInputVariables[7] = 1.01; // c_x
multilayerPerceptron.setMaximumOfInputVariables(maxOfInputVariables);
// set up the relevant max and min etc for the output neuron(s)
Flood::Matrix<double> minimumAndMaximumOfTargetData = inputTargetDataSet.getMinimumAndMaximumOfTargetData();
cout << "min encountered output: " << minimumAndMaximumOfTargetData[0][0] << endl;
cout << "max encountered output: " << minimumAndMaximumOfTargetData[1][0] << endl;
double minBuffer = minimumAndMaximumOfTargetData[0][0] - minimumAndMaximumOfTargetData[0][0]/10; // min plus some buffer for unexpectedly low results
double maxBuffer = minimumAndMaximumOfTargetData[1][0] + minimumAndMaximumOfTargetData[1][0]/10;
cout << "minBuffer: " << minBuffer << " maxBuffer: " << maxBuffer <<endl;
Flood::Vector<double> minOfOutputVariables(outputNodes);
minOfOutputVariables[0] = minBuffer; // 10% buffer under the min value encountered in training set
Flood::Vector<double> maxOfOutputVariables(outputNodes);
maxOfOutputVariables[0] = maxBuffer; // 10% buffer over the max value encountered in training set
multilayerPerceptron.setMinimumOfOutputVariables(minOfOutputVariables);
multilayerPerceptron.setMaximumOfOutputVariables(maxOfOutputVariables);
multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MinimumAndMaximum);
multilayerPerceptron.print();
Flood::MeanSquaredError meanSquaredError(&multilayerPerceptron,&inputTargetDataSet);
//multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MeanAndStandardDeviation);
Flood::QuasiNewtonMethod quasiNewtonMethod(&meanSquaredError);
quasiNewtonMethod.setEvaluationGoal(0.0);
quasiNewtonMethod.setGradientNormGoal(0.0001);
quasiNewtonMethod.setMaximumTime(320.0); // seconds
quasiNewtonMethod.setMaximumNumberOfEpochs(20);
quasiNewtonMethod.train();
quasiNewtonMethod.saveTrainingHistory(trainingHistory);
multilayerPerceptron.save(savedModel);
cout << "neural network saved to: " << savedModel << endl;
return;
}
void createTrainingFile() {
// READ IN IDENTIFIED PEPTIDES
cout << "Learning from: " << inputFile << endl;
ifstream inputHits;
inputHits.open(inputFile); // // load in the the input file in binary format
ofstream tempTrainingSet;
tempTrainingSet.open(tempTrainingfile);
ofstream trainingSet; // writes to immediately if we're not using fuzzy outputs (as it doesn't need to be balanced).
trainingSet.open(trainingfile);
ofstream testingSet;
testingSet.open(testingfile);
ofstream testTrainLog;
testTrainLog.open("testTrain.log");
vector<vector<double>> pmasses; // records <mass,error> for all parent masses
vector<vector<double>> fragmasses; // records <mass,error> for all fragment masses
vector<string> trainSet; // training set - saved in MultiLayerPerceptron input file
vector<string> testSet; // testing set - saved as a text list of inputs with correct output as last figure.
string line ;
int verbosity = 0;
double loaded = 0; // total
double testing = 0; // those used in testing
while (!inputHits.eof()) {
hit h = loadInHit(inputHits); loaded++;
if (h.seq.compare("END")==0) {break;} // no more available (loadInHit returns 'END' at eof)
if (h.Zscore < minZscore) {continue;} // not confident enough hit to use for learning
bool train; // if set 1, used for training, else used for testing
if ((testing/loaded) <= propTesting) {train=0; testing++;} else {train=1;} // if not train, then test
cout << "processing " << h.seq << endl;
vector<string> instances = processHit(h,train); // creates attributes for training with this peptide
//for (unsigned i=0; i < instances.size(); i++) {
// cout << instances.at(i);
//}
if (train==0) { // peptide allocated for testing set - dump to file
for (unsigned i=0; i < instances.size(); i++) {testingSet << instances.at(i);}
testingSet << endl; // peptides separated by newlines.
testing++;
}
if (train==1) { // peptide allocated for testing set - save later writing
for (unsigned i=0; i < instances.size(); i++) {trainSet.push_back(instances.at(i));}
}
//if (processed > 50) {break;}
//break;
}
//cout << "training set size: " << trainSet.size(); exit(0);
string inputs = "ION_TYPE PCHARGE FCHARGE PLEN PROPLEN PRHKR N_X C_X"; //inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x
string outputs = "LOG_INTEN_NORM"; // proportion of the total intensity
trainingSetHeader(trainingSet, trainSet.size(), inputNodes, outputNodes, inputs, outputs);
for (unsigned i=0; i < trainSet.size(); i++) {trainingSet << trainSet.at(i);}
return;
}
vector<string> processHit(hit h, bool train) { // set train to 1 if it's going to the training set, else it goes to the testing set
vector<Flood::Vector<double>> matrix; // a vector of Flood Style vectors
for (unsigned i=0; i < h.peaks.size(); i++) { // for each peak
fragmatch fm = h.peaks.at(i);
Flood::Vector<double> attributes = getAttributes(h.peaks.at(i),h.seq,h.pcharge);
if (attributes.getSize()==0) {
cout << "no attributes generated for peak: " << i << " with " << fm.fragType << endl;
continue;
} // no identified matching fragment
//cout << "for hit: " << fm.fragType << " " << fm.fragDesc << " " << " int: " << fm.intensity;
//cout << " attributes="; printFloodVector(attributes);
matrix.push_back(attributes);
}
// PEAK PROCESSING GOES IN HERE
unsigned sumIntensities=0;
for (unsigned i=0; i < h.peaks.size(); i++) {sumIntensities+=h.peaks.at(i).intensity;}
//cout << "sumInt: " << sumIntensities << endl;
vector<double> processedIntensities;
for (unsigned i=0; i < h.peaks.size(); i++) { // for each peak
fragmatch fm = h.peaks.at(i);
if (fm.expmz == 0) {continue;}
unsigned intensity = fm.intensity;
//double prop = ((double) intensity / (double) sumIntensities);
double processed = (double) intensity/ (double) sumIntensities;
if (processed>0.2) {processed=0.2;}
processedIntensities.push_back(processed);
}
//exit(0);
//processedIntensities = normalise(processedIntensities,100);
// convert the matrix to strings of attributes with the processed answer
vector<string> outBlock; // each string corresponds to a line in a training or testing set
for (unsigned i=0; i < matrix.size(); i++) {
Flood::Vector<double> in = matrix.at(i);
string out;
for (int j=0; j < in.getSize(); j++) {
out+=stringify(in[j]);
out+=string(" ");
}
out+=stringify(processedIntensities.at(i));
out+="\n";
outBlock.push_back(out);
}
return outBlock; /// just a list of all hits and misses (with fuzzy outputs).
}
void trainingSetHeader(ofstream& f, int samples, int inputs, int outputs, string inputNames, string outputNames) {
f << "% Testing Dataset for PINNACLE\n";
f << "NumberOfSamples:\n" << samples << endl;
f << "NumberOfInputVariables:\n" << inputs << endl;
f << "NumberOfTargetVariables:\n" << outputs << endl;
f << "NameOfInputVariables:\n" << inputNames << endl;
f << "NameOfTargetVariables:\n" << outputNames << endl;
f << "InputAndTargetData:\n";;
return;
}
string convert2str(vector<double> v) { // adds newline at end
string out;
for (unsigned i=0; i < v.size(); i++) {
char buffer [50]; int n;
n=sprintf_s (buffer, 50, "%4.6f", v.at(i));
out += string(buffer);
if (i<v.size()-1) {out+=" ";} else {out+="\n";}
}
//cout << "string out: " << out;
return out;
}
vector<double> fuzzyOut(float intensity, float maxpeak) { // maxpeak may also be the sumOfPeaks if normalising against total peak intensities.
vector<double> out (outputNodes,0);
double prop = (double) intensity/(double) maxpeak; // the proportional intensity of this peak w.r.t. the maximum in the spect
prop*=1000; // multiplier to make all values greater then zero (since we'll be taking a log)
// DIRECT OUTPUT - LOG of the proportion of the sum of intensities this fragment has. Gives us a nice normal curve for learning.
if (intensity==0) {out.at(0)=0;}
else {out.at(0) = log(prop); if (out.at(0) < 0) {out.at(0) =0;}} // don't want negative logs thanks
return out;
// FUZZY OUTPUT
//cout << "for proportion: " << prop << endl;
for (unsigned i=0; i < outputNodes; i++) {
double node = i*(1/((double) outputNodes-1)); //cout << "node " << i << " with value " << node << endl; // current node
double diff = fabs(prop-node);
double decay = 3*diff; // decay will be zero when bang on - tells the program how much weight to put on neighbouring nodes.
double value = 1-decay; if (value < 0){value=0;}// cout << "diff: " << diff << " decay: " << decay << " value: " << value << endl;
out.at(i) = value;
}
cout << "fuzzyOut: <"; for (unsigned i=0; i < outputNodes; i++) {cout << out.at(i) << ',';} cout << '>' << endl;
return out;
}
fragmatch findMatch(string iontype,unsigned breakSite,hit h) {
// always returns the first such hit. Ie ignores whether is a water loss etc.
// basically, scan through the list of matches, and if a hit is found, return it.
//cout << "searching for ion type: " << iontype << " with breakSite at " << breakSite << endl;
if (iontype.compare("nofrag")==0) { // parent peptide
for (unsigned i=0; i < h.peaks.size(); i++) {
fragmatch f = h.peaks.at(0);
if (!f.fragType.substr(0,6).compare("nofrag")) {return f;}
}
}
if (iontype.compare("immo")==0) { // immonium fragmatches have a default fragSite of 0.
// here we just need to find a fragment match with the right description.
char aa = h.seq.at(breakSite); // this is the specifc immonium ion we're looking for
//cout << "looking for immo ion: " << aa << endl;
for (unsigned i=0; i < h.peaks.size(); i++) {
fragmatch f = h.peaks.at(i);
//cout << "looking at peak: " << f.fragType << " obsmz: " << f.obsmz << endl;
if (f.obsmz == 0) {continue;} // there is no match here.
if (f.fragType.at(0) == 'i' && f.fragDesc.at(f.fragDesc.size()-1) == aa) {
//cout << "this peak has been found: " << f.fragType << " intensity: " << f.intensity << endl;
return f;
}
}
}
if (iontype.substr(0,2).compare("y1") == 0) {
for (unsigned i=0; i < h.peaks.size(); i++) {
fragmatch f = h.peaks.at(i);
//cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl;
if (f.fragSite == breakSite) {
if (!f.fragType.substr(0,2).compare("y1")) {return f;}
}
}
}
if (iontype.substr(0,2).compare("b1") == 0) {
for (unsigned i=0; i < h.peaks.size(); i++) {
fragmatch f = h.peaks.at(i);
//cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl;
if (f.fragSite == breakSite) {
if (!f.fragType.substr(0,2).compare("b1")) {return f;}
}
}
}
// getting to here means we didn't find a matching peak.
fragmatch nomatch;
nomatch.obsmz = 0; // this how we define a missing peak.
return nomatch;
}
hit loadInHit(ifstream& file) {
// given the filestream to the output file, loads in a hit. Should be replaced with binary files.
// seq:pcharge:obspmass:exppmass:stdDevs:basicHits:prelimScore:LOD:{mod0:mod1..}:{match1,match2...}:overallLOD\n
// where match = [intensity:obsfmass:expfmass:type:description:LOD]
string line;
hit h;
getline( file, line );
if (file.eof()) {h.seq = "END"; return h;}
//cout << line;
unsigned i=0; unsigned j=0; while (line.at(j) != ':') {j++;}
h.seq = line.substr(i,j-i); // identified sequence
if (verbosity > 3) {cout << "seq loaded: " << h.seq << endl;}
i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.pcharge = convert2int(line.substr(i,j-i));
i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.obspmass = convert2double(line.substr(i,j-i));
i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.exppmass = convert2double(line.substr(i,j-i));
i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.Zscore = convert2int(line.substr(i,j-i));
i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.basichits = convert2int(line.substr(i,j-i));
i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.prelimScore = convert2double(line.substr(i,j-i));
i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.overallLOD = convert2double(line.substr(i,j-i));
i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.mods = convert2mods(line.substr(i,j-i));
i=j+1; j=j+1; while (line.at(j) != '}') {j++;} h.peaks = convert2peaks(line.substr(i,j-i));
//cout << "exiting"; exit(0);
return h;
}
vector<double> convert2mods(string token) {
vector<double> mods;
//cout << "converting to mod vector<double>: " << token << endl;
unsigned i=1; unsigned j=2; // first char is '{'
while (j < token.size()) { // for each mod mass
if (token.at(j) != ',' || j < token.size()-2) {j++;}
string tmp = token.substr(i,j-i-1); //cout << "modmass: " << tmp << endl;
double modmass = convert2double(tmp);
mods.push_back(modmass);
i=j; j++;
}
return mods;
}
vector<fragmatch> convert2peaks(string token) {
vector<fragmatch> matches;
// first break the line into individual matches
vector<string> lines;
unsigned i=0; unsigned j=0;
while (j < token.size() && i < token.size()) {
while (token.at(i) != '[') {i++;}
while (token.at(j) != ']') {j++;}
//cout << "pushing back: " << token.substr(i+1,j-i-1) << endl;
string strfrag = token.substr(i+1,j-i-1); // cout << "parts: " << strfrag << endl;
unsigned k=0;
vector<unsigned> commaPos; // positions of the commas in the string
fragmatch f;
while (k < strfrag.size()) {if (strfrag.at(k) == ',') {commaPos.push_back(k);} k++;}
/*cout << token.at(i) << endl;
cout << "commaPos: "; for (unsigned p=0; p < commaPos.size(); p++) {cout << commaPos.at(p) << ',';} cout << endl;
cout << "intensity: " << strfrag.substr(0,commaPos.at(0)) << endl;
cout << "obsMz: " << strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1) << endl;
cout << "expMZ: " << strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1) << endl;
cout << "fcharge: " << strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1) << endl;
cout << "type: " << strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1) << endl;
cout << "desc: " << strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1) << endl;
cout << "fragLOD: " << strfrag.substr(commaPos.at(5)+1) << endl;*/
f.intensity = convert2int(strfrag.substr(0,commaPos.at(0)));
f.obsmz = convert2double(strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1));
f.expmz = convert2double(strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1));
f.fcharge = convert2int(strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1));
if (f.expmz == 0.0) {i=j; j++; continue;} // missing peak isn't of any use at the moment.
f.fragType = strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1);
f.fragDesc = strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1);
f.fragLOD = convert2double(strfrag.substr(commaPos.at(5)+1));
// calculate the fragmentation site. Zero is the whole peptide and we go from N to C term
if (f.fragType.at(0) == 'i' || f.fragType.at(0) == 'n') {f.fragSite = 0;} // no frag or immonium
else { // x/y/z or a/b/c ion
if (f.fragType.at(0) == 'y') {
unsigned site=1;
while (f.fragDesc.at(site) == '.') {site++;}
f.fragSite = site; //continue;
}
if (f.fragType.at(0) == 'a' || f.fragType.at(0) == 'b') {
unsigned site=1;
while (f.fragDesc.at(site) != '.') {site++;}
f.fragSite = site; //continue;
}
if (f.fragType.at(0) != 'a' && f.fragType.at(0) != 'b' && f.fragType.at(0) != 'y') {
cout << "fragType " << f.fragType << " not yet implemented" << endl;
i=j; j++; continue; // does not record frag site
}
}
matches.push_back(f); // add this peak
i=j; j++; // continue onto next fragment.
}
return matches;
}
double convert2double(string str) {
char * cstr;
cstr = new char [str.size()+1]; // convert to c_str
strcpy_s (cstr, str.size()+1, str.c_str());
double value = atof(cstr); //convert to double
return value;
}
int convert2int(string str) {
char * cstr;
cstr = new char [str.size()+1]; // convert to c_str
strcpy_s (cstr, str.size()+1, str.c_str());
int value = atoi(cstr); //convert to int
return value;
}
double encodeAA(char aa) { // encodes AA letter names into numbers
if (aa == 'X') {return 0.00;} // if there is no AA related here, ie N-term side of a b1 ion
if (aa == 'V') {return 0.00;} // hydrophobic side groups
if (aa == 'L') {return 0.05;}
if (aa == 'I') {return 0.10;}
if (aa == 'M') {return 0.15;}
if (aa == 'F') {return 0.20;}
if (aa == 'G') {return 0.25;} // in between
if (aa == 'A') {return 0.30;}
if (aa == 'S') {return 0.35;}
if (aa == 'T') {return 0.40;}
if (aa == 'Y') {return 0.45;}
if (aa == 'W') {return 0.50;}
if (aa == 'C') {return 0.55;}
if (aa == 'P') {return 0.60;}
if (aa == 'N') {return 0.65;} // hydrophilic side groups
if (aa == 'E') {return 0.70;}
if (aa == 'Q') {return 0.75;}
if (aa == 'D') {return 0.80;}
if (aa == 'H') {return 0.85;}
if (aa == 'K') {return 0.90;}
if (aa == 'R') {return 0.95;}
//std::cout << "EncodeAA -> unknown AA letter code: " << aa << std::endl;
return 0.0; // separate value for unknown
};
double countHKR(string& seq) { // count the number of H, K or R residues in the input sequence
double HKR=0;
for (unsigned i=0; i<seq.size(); i++) {
char aa = seq.at(i);
if (aa == 'H') {HKR++; continue;}
if (aa == 'K') {HKR++; continue;}
if (aa == 'R') {HKR++; continue;}
}
return HKR;
}
double getMean(vector<double>& v) { // returns the mean
double total=0;
for (unsigned i=0; i < v.size(); i++) {total+= v.at(i);}
return total/v.size();
}
double getStdDev(vector<double>& v, double mean){ // returns the standard deviation of input array, second arg is the precomputed vector mean.
double numerator=0;
for (unsigned i=0; i < v.size(); i++) {numerator+= pow((v.at(i)-mean),2);}
return pow(numerator/v.size(),0.5);
}
/*
EXAMPLE OF AN OUTPUT FILE WHICH MUST BE PROCESSED AND LEARNED:
Search Parameters:
parentTol=1
tol=0.5
minBasicHits=4
minPrelimScore=0.05
randomSamples=25
randomLODspread=20
reportMin=5
topCandidates=0.3
maxTopCandidates=20
newmetric=0
maxFragCharge=2
verbosity=1
devsAboveMeanForHit=2
collectRandomHits=0
MOD0 = iodacetamide (static) mass: 57 acting on: C
MOD1 = oxidation (variable) mass: 16 acting on: M H W
MOD2 = deamidation (variable) mass: 1 acting on: N Q
MOD0 MOD1 MOD2 Delta
0 0 0 0
0 1 0 16
0 2 0 32
1 0 0 57
1 1 0 73
1 2 0 89
2 0 0 114
2 1 0 130
2 2 0 146
loading peptide database: Forward Db: 1689106 unique peptides loaded from peptides.dat
loading peptide database: Reverse Db: 683090 unique peptides loaded from reverse.dat
DTAFiles\AGTEPSDAIR.mgf.dta
top ten candidates for pmass: 1015.49 DTAFiles\AGTEPSDAIR.mgf.dta
AGTEPSDAIR prelim: 0.652732 LOD: 25.9144
ASAEQSGGGPR prelim: 0.498535 LOD: 23.7549
ASAVSELSPR prelim: 0.630028 LOD: 23.7549
TFAHNSALR prelim: 0.559211 LOD: 22.6751
APQFTILAR prelim: 0.651944 LOD: 22.6751
ASAAPPHPLR prelim: 0.507099 LOD: 20.5156
TAEEKVSPR prelim: 0.501239 LOD: 19.4358
EDSVLNIAR prelim: 0.502028 LOD: 19.4358
TTSISPALAR prelim: 0.597014 LOD: 19.4358
TGDPQETLR prelim: 0.523944 LOD: 18.356
random mean: 9.58833 random StdDev: 2.37195
std Devs above mean: 6.88297
IDENTIFIED: AGTEPSDAIR
observed pmass: 1015.49 pcharge: 1
identified pmass: 1015.48
mods: {0,0,0,0,0,0,0,0,0,0,}
basicHits: 8
prelimScore: 0.652732
Top 4N peaks: AGTEPSDAIR
intensity obsMZ expMZ type seq LOD
2265 86.10 86.08 immo I 1.080
2049 70.07 70.05 immo P 1.080
1672 175.13 175.10 y1+ .........R 2.160
1532 359.26 359.15 b1+ AGTE...... 2.160
816 658.43 658.26 b1+ AGTEPSD... 2.160
751 74.07 74.05 immo T 1.080
746 87.09 0.00 0.000
576 112.09 0.00 0.000
460 69.88 70.05 immo P 1.080
453 102.06 102.04 immo E 1.080
390 85.84 86.08 immo I 1.080
390 110.09 0.00 0.000
385 84.07 0.00 0.000
382 371.19 0.00 0.000
378 300.15 0.00 0.000
370 72.09 72.04 b1+ A......... 2.160
355 230.15 230.11 b1+ AGT....... 2.160
348 159.11 0.00 0.000
322 185.12 0.00 0.000
321 101.09 101.06 a1+ AG........ 1.080
318 641.39 641.33 y1+-1NH3 ....PSDAIR 1.080
273 120.09 0.00 0.000
264 71.88 72.04 b1+ A......... 2.160
251 158.11 158.10 y1+-1NH3 .........R 1.080
242 157.13 0.00 0.000
229 109.71 0.00 0.000
210 136.09 0.00 0.000
209 129.11 129.06 b1+ AG........ 2.160
200 83.81 0.00 0.000
200 119.66 0.00 0.000
199 212.13 212.11 b1+-1H2O AGT....... 1.080
194 130.11 0.00 0.000
overall LOD score: 25.9144
DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta
loading obs spect: 0
top ten candidates for pmass: 2882.55 DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta
ALIPSSQNEVELGELLLSLNYLPSAGR prelim: 0.252294 LOD: 66.5238
QMNLNPMDSPHSPISPLPPTLSPQPR prelim: 0.10953 LOD: 34.8458
GETLGNIPLLAIGPAICLPGIAAIALARK prelim: 0.0974537 LOD: 26.9263
VSSDPAWAVEWIELPRGLSLSSLGSAR prelim: 0.0921176 LOD: 25.3424
DAAIPAPHAPSAPGCLASVKPGGWKGAAGR prelim: 0.0996068 LOD: 25.3424
MEDTDFVGWALDVLSPNLISTSMLGR prelim: 0.0916495 LOD: 23.7585
MALLLLSLGLSLIAAQEFDPHTVMQR prelim: 0.0923048 LOD: 23.7585
QALTPEQVSFCATHMQQYMDPRGR prelim: 0.0931474 LOD: 23.7585
NTSIRVADFGSATFDHEHHTTIVATR prelim: 0.0745179 LOD: 22.1746
DLLEDKILHSHLVDYFPEFDGPQR prelim: 0.107471 LOD: 22.1746
random mean: 11.9109 random StdDev: 4.05973
std Devs above mean: 13.4523
IDENTIFIED: ALIPSSQNEVELGELLLSLNYLPSAGR
observed pmass: 2882.55 pcharge: 1
identified pmass: 2882.47
mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,}
basicHits: 18
prelimScore: 0.252294
Top 4N peaks: ALIPSSQNEVELGELLLSLNYLPSAGR
intensity obsMZ expMZ type seq LOD
3138 95.95 0.00 0.000
307 487.20 487.24 y1+ ......................PSAGR 3.168
279 86.09 86.08 immo L 1.584
253 112.09 0.00 0.000
237 763.35 763.38 y1+ ....................YLPSAGR 3.168
198 175.09 175.10 y1+ ..........................R 3.168
174 600.31 600.32 y1+ .....................LPSAGR 3.168
171 70.05 70.05 immo P 1.584
155 470.23 470.24 y1+-1NH3 ......................PSAGR 1.584
151 1716.85 0.00 0.000
141 1416.73 1416.77 y1+ ..............LLLSLNYLPSAGR 3.168
141 1715.73 1715.91 y1+ ...........LGELLLSLNYLPSAGR 3.168
133 232.10 232.12 y1+ .........................GR 3.168
119 400.13 0.00 0.000
108 990.44 990.50 y1+ ..................LNYLPSAGR 3.168
104 185.09 185.12 b1+ AL......................... 3.168
103 542.20 0.00 0.000
102 877.33 877.42 y1+ ...................NYLPSAGR 3.168
98 303.16 303.16 y1+ ........................AGR 3.168
97 1303.62 1303.69 y1+ ...............LLSLNYLPSAGR 3.168
96 1943.86 1944.02 y1+ .........VELGELLLSLNYLPSAGR 3.168
95 471.21 0.00 0.000
95 871.25 0.00 0.000
92 1077.48 1077.53 y1+ .................SLNYLPSAGR 3.168
80 742.28 0.00 0.000
73 298.18 298.20 b1+ ALI........................ 3.168
71 643.19 0.00 0.000
71 860.35 860.42 y1+-1NH3 ...................NYLPSAGR 1.584
66 644.09 644.37 b1+-1NH3-2H2O ALIPSSQ.................... 1.584
66 743.40 0.00 0.000
63 756.27 0.00 0.000
60 187.09 0.00 0.000
58 391.19 0.00 0.000
58 449.25 0.00 0.000
58 2588.10 0.00 0.000
57 655.32 0.00 0.000
57 1602.76 1602.83 y1+ ............GELLLSLNYLPSAGR 3.168
56 497.16 0.00 0.000
54 228.08 0.00 0.000
54 859.82 859.42 y1+-1H2O ...................NYLPSAGR 1.584
53 288.63 0.00 0.000
53 289.17 0.00 0.000
51 243.16 0.00 0.000
51 320.48 0.00 0.000
51 409.64 0.00 0.000
47 2072.96 2073.06 y1+ ........EVELGELLLSLNYLPSAGR 3.168
46 1843.89 0.00 0.000
40 2187.00 2187.10 y1+ .......NEVELGELLLSLNYLPSAGR 3.168
36 2167.81 0.00 0.000
36 2404.79 0.00 0.000
1655 96.07 0.00 0.000
974 96.18 0.00 0.000
958 95.89 0.00 0.000
overall LOD score: 66.5238
DTAFiles\AQMDSSQLIHCR.mgf.dta
top ten candidates for pmass: 1444.43 DTAFiles\AQMDSSQLIHCR.mgf.dta
FDHNAIMKILDL prelim: 0.272246 LOD: 23.6381
MAEPGHSHHLSAR prelim: 0.381321 LOD: 21.0117
MPHLRLVLPITR prelim: 0.299034 LOD: 19.6984
VMSKSAADIIALAR prelim: 0.295296 LOD: 19.6984
IEHSTFDGLDRR prelim: 0.361541 LOD: 18.3852
MESDRLIISLPR prelim: 0.217111 LOD: 17.072
IDAMHGVMGPYVR prelim: 0.284083 LOD: 17.072
WQLNPGMYQHR prelim: 0.317101 LOD: 17.072
GTHPAIHDILALR prelim: 0.294206 LOD: 17.072
SPSSLLHDPALHR prelim: 0.294206 LOD: 15.7588
random mean: 10.2432 random StdDev: 3.34294
std Devs above mean: 4.00694
IDENTIFIED: FDHNAIMKILDL
observed pmass: 1444.43 pcharge: 1
identified pmass: 1444.66
mods: {0,0,0,0,0,0,16,0,0,0,0,0,}
basicHits: 5
prelimScore: 0.272246
Top 4N peaks: FDHNAIMKILDL
intensity obsMZ expMZ type seq LOD
2157 1000.35 0.00 0.000
1812 110.06 110.00 immo H 1.313
1360 175.08 0.00 0.000
1032 200.09 0.00 0.000
922 455.11 0.00 0.000
830 86.08 86.08 immo I 1.313
648 101.06 101.09 immo K 1.313
616 472.15 0.00 0.000
552 70.06 0.00 0.000
504 335.09 0.00 0.000
492 89.06 0.00 0.000
383 112.08 0.00 0.000
330 585.21 585.18 b1+ FDHNA....... 2.626
326 133.02 0.00 0.000
298 1289.38 0.00 0.000
284 229.11 0.00 0.000
279 87.04 87.04 immo N 1.313
265 913.36 0.00 0.000
263 826.31 0.00 0.000
254 104.05 104.04 immo M 1.313
240 698.29 698.26 b1+ FDHNAI...... 2.626
240 748.23 748.40 y1+ ......MKILDL 2.626
239 298.05 0.00 0.000
236 334.07 0.00 0.000
230 166.04 0.00 0.000
228 183.07 0.00 0.000
223 247.07 247.11 y1+ ..........DL 2.626
221 120.10 120.07 immo F 1.313
221 1424.88 0.00 0.000
213 1325.48 0.00 0.000
210 329.15 0.00 0.000
204 731.14 731.40 y1+-1NH3 ......MKILDL 1.313
203 1354.55 0.00 0.000
201 316.08 0.00 0.000
201 1008.42 0.00 0.000
199 568.22 568.18 b1+-1NH3 FDHNA....... 1.313
196 639.28 0.00 0.000
188 366.20 0.00 0.000
186 84.04 0.00 0.000
184 203.04 0.00 0.000
181 331.14 0.00 0.000
181 855.49 0.00 0.000
177 217.08 0.00 0.000
177 791.28 0.00 0.000
174 318.11 0.00 0.000
170 216.07 0.00 0.000
167 861.24 861.48 y1+ .....IMKILDL 2.626
165 531.15 0.00 0.000
overall LOD score: 23.6381
DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta
top ten candidates for pmass: 1908.96 DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta
ATAAPAGAPPQPQDLEFTK prelim: 0.464155 LOD: 54.2672
KPTNLGTSCLLRHLQR prelim: 0.338948 LOD: 27.1336
APQPQRSASPALPFWTR prelim: 0.201213 LOD: 24.6669
STGWYPQPQINWSDSK prelim: 0.355488 LOD: 24.6669
ALQSAGPFGPYIPGGPAPGR prelim: 0.303074 LOD: 23.4336
QVSAESSSSMNSNTPLVR prelim: 0.205814 LOD: 23.4336
GKQSPHHERPEPEPPR prelim: 0.211954 LOD: 22.2002
ALRPKPTLEKDLEEDR prelim: 0.218912 LOD: 20.9669
IVEVLGIPPAHILDQAPK prelim: 0.343929 LOD: 20.9669
GQELHKPETGCQQELR prelim: 0.197297 LOD: 20.9669
random mean: 12.1361 random StdDev: 4.61053
std Devs above mean: 9.138
IDENTIFIED: ATAAPAGAPPQPQDLEFTK
observed pmass: 1908.96 pcharge: 1
identified pmass: 1908.95
mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,}
basicHits: 15
prelimScore: 0.464155
Top 4N peaks: ATAAPAGAPPQPQDLEFTK
intensity obsMZ expMZ type seq LOD
6560 1299.53 1299.63 y1+ ........PPQPQDLEFTK 2.467
5095 63.86 0.00 0.000
5056 70.06 70.05 immo P 1.233
2954 977.39 977.47 y1+ ...........PQDLEFTK 2.467
2364 226.10 226.13 b1+-1H2O ATA................ 1.233
2318 323.13 0.00 0.000
1331 297.12 297.17 b1+-1H2O ATAA............... 1.233
1183 315.12 315.17 b1+ ATAA............... 2.467
1105 341.14 0.00 0.000
987 663.24 0.00 0.000
894 244.08 244.13 b1+ ATA................ 2.467
879 86.08 86.08 immo L 1.233
833 1595.70 1595.78 y1+ ....PAGAPPQPQDLEFTK 2.467
828 101.06 101.06 immo Q 1.233
701 611.25 611.32 b1+ ATAAPAGA........... 2.467
688 141.08 0.00 0.000
668 175.07 0.00 0.000
661 169.08 0.00 0.000
650 1427.57 1427.69 y1+ ......GAPPQPQDLEFTK 2.467
613 269.12 0.00 0.000
613 548.20 0.00 0.000
596 933.36 933.48 b1+ ATAAPAGAPPQ........ 2.467
588 905.35 905.48 a1+ ATAAPAGAPPQ........ 1.233
572 1281.52 1281.63 y1+-1H2O ........PPQPQDLEFTK 1.233
512 368.15 0.00 0.000
495 110.06 0.00 0.000
480 394.16 394.22 b1+-1H2O ATAAP.............. 1.233
480 1202.47 1202.58 y1+ .........PQPQDLEFTK 2.467
454 619.24 619.33 y1+-1H2O ..............LEFTK 1.233
447 270.13 0.00 0.000
439 776.31 0.00 0.000
437 1370.58 1370.67 y1+ .......APPQPQDLEFTK 2.467
433 98.05 0.00 0.000
414 365.16 0.00 0.000
403 72.07 72.04 b1+ A.................. 2.467
396 337.15 0.00 0.000
394 173.07 173.09 b1+ AT................. 2.467
386 243.11 0.00 0.000
370 126.03 0.00 0.000
367 708.29 708.37 b1+ ATAAPAGAP.......... 2.467
360 195.09 0.00 0.000
350 248.13 248.14 y1+ .................TK 2.467
330 112.06 0.00 0.000
330 115.07 0.00 0.000
329 212.10 0.00 0.000
313 540.22 540.28 b1+ ATAAPAG............ 2.467
311 209.06 0.00 0.000
311 295.13 0.00 0.000
310 167.09 0.00 0.000
309 906.38 0.00 0.000
303 155.07 155.09 b1+-1H2O AT................. 1.233
299 382.15 0.00 0.000
293 186.09 0.00 0.000
278 74.05 74.05 immo T 1.233
278 1052.46 0.00 0.000
273 583.21 583.32 a1+ ATAAPAGA........... 1.233
264 1088.41 1088.53 y1+-1NH3 ..........QPQDLEFTK 1.233
262 355.12 0.00 0.000
259 102.06 102.04 immo E 1.233
245 338.15 0.00 0.000
242 453.19 0.00 0.000
239 520.24 0.00 0.000
237 452.16 0.00 0.000
232 198.10 0.00 0.000
overall LOD score: 54.2672
*/
Flood::Vector<double> getAttributes(fragmatch& f,string seq, unsigned pcharge) {
//if (params.verbosity < 5) {cout << "getting attributes for frag " << f.type << " " << f.desc << endl; }
// given the seq, and the immonium fragment, returns the attribute vector to be fed into the neural net
vector<double> attributes; // holds the attributes for this fragment
// NO FRAGMENTATION - TYPE 0
if (f.fragType.substr(0,6).compare("nofrag")==0) {
attributes.push_back(0); // iontype 0 = unfragmented
attributes.push_back(pcharge);
attributes.push_back(f.fcharge);
attrib
Initial URL
Initial Description
Harvest paper: Citation: McHugh L, Arthur JW (2008) Computational methods for protein identification from mass spectrometry data. PLoSComputBiol 4(2): e12. doi:10.1371/journal.pcbi.0040012 PINNACLE automatically learns the statistical and neural network models for the Harvest algorithm. Contact [email protected] for a Visual Studio project solution.
Initial Title
PINNACLE - Automated Machine Learning Module for the Harvest peptide identification package
Initial Tags
Initial Language
C++