Revision: 35042
Initial Code
Initial URL
Initial Description
Initial Title
Initial Tags
Initial Language
at November 1, 2010 13:58 by leomchugh
Initial Code
/* Learner.cpp : Copyright (c) 2009, Leo McHugh, University of Sydney, Australia All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the <organization> nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY LEO MCHUGH ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL LEO MCHUGH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. This code requires input files from PreProcess.cpp updates of this code can be found at: http://www.usyd.edu.au/sydneybioinformatics/research/software.shtml Contact: [email protected] */ /* PURPOSE AND USE OF THIS SOFTWARE: This software is designed for use with PINNACLE. Once PINNACLE is run, it generates a list of hits along with a confidence score for each, and prints out a list of identified fragments. An example is shown as a commented out section at the very end of this file. Learner.cpp uses these files to gather statistical information about fragmentation and to build and train a neural network. For version1.0, only limited information will be collected. We calculate approximate probability functions for: the parent mass error w.r.t mass fragment mass erro w.r.t mass probability of fragment identification in correct assignments w.r.t mass / charge Neural Nets: Inputs for predicting peak intensity for a fragment: iontype either no frag, immoium, y or b ion (nofrag, immo, y1, a1, b1) = (0,1,3,2,4); // 5 reserved for z ions but not implemented pcharge parent charge fcharge fragment charge plen the length of the parent peptide prop_len the proportion of the distance from N to C term that the fragmentation site was at pr_hkr proportion of HKR and R residues in the fragment over those in the unbroken peptide n_x identity of AA on the n-term side of cleavage c_x identity..... c-term inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x (taken from working notes07) ==== Add later============ loss 0 if it's a 'native' ion, such as a y or b, 1 if it contains losses. Output: log of the intensity, scaled so that the max log(intensity) for the fragment is set to 100 // maybe use later 6 Fuzzy Logic NN outputs, 0, 20, 40 etc to 100. */ #include "stdafx.h" #include <iostream> #include <string> #include <fstream> #include <iostream> #include <sstream> #include <vector> #include <list> #include <time.h> // for random number generation // Learner based libraries: #include "../Utilities/Vector.h" #include "../Utilities/Matrix.h" #include "../Utilities/InputTargetDataSet.h" #include "../MultilayerPerceptron/MultilayerPerceptron.h" #include "../ObjectiveFunctional/MeanSquaredError.h" #include "../TrainingAlgorithm/QuasiNewtonMethod.h" using namespace std; // ======================= FIXED PARAMETERS ===================== unsigned const MAX_SEQ_SIZE = 50; // make sure this is the same as in preprocess.cpp and PINNACLE.cpp unsigned const MIN_SEQ_SIZE = 5; // make sure this is the same as in preprocess.cpp and PINNACLE.cpp unsigned outputNodes = 1; // The number of output nodes in the fuzzy network unsigned inputNodes = 8; // described above double propTesting = 0.2; // Proportion of peptides reserved for testing. double minZscore = 3.7; // minimum required Zscore for a 'hit' to be considered confident enough to be used for testing/training unsigned const buckets = 50; // mass spectrum division into this many buckets of 100 da. for statistical learning part bool fuzzyOutputs=false; // if true, splits the output to a fuzzy output, if false, there is a single output node. // ======================= OBJECTS ============================== // ============== MUST BE THE SAME AS IN PINNACLE.CPP =========== struct fragmatch { // records a matching fragment in from an indentified peptide unsigned intensity; double obsmz; double expmz; // expected mz int fcharge; // fragment charge string fragType; // ie y1+ string fragDesc; // decription of the fragment. ie: ........NQLLQFR unsigned fragSite; // from N to c term. zero is no frag, 1 is first bion / last y ion etc, double fragLOD; // LOD for this frag }; struct format // converts a hit into a neural net input set { double pmasserror; // error in the pmass calculation vector<vector<double>> fmzerror; // fragment mass error array <fmass,error> vector<vector<double>> NNlines; // formatted lines to be appended to the NN training matrix. }; struct hit { string seq; // sequence of the matched peptide double Zscore; // number of std devs above the random mean int pcharge; // charge of the observed parent peptide int basichits; // the number of no loss y and b ions double prelimScore; // the proportion of the top 4N peaks covered by no loss y and b ions. double obspmass; // observed pmass of the input spectrum (uncharged) double exppmass; // theoretical pmass of the matched peptide (uncharged) vector<fragmatch> peaks; // contains the top 4N peaks vector<double> mods; // one entry for each AA from N to C term double overallLOD; // LOD score generated by PINNACLE for this peptide match }; struct learnedParameters { // holds the parameters learned by this file for passing to PINNACLE // THIS STRUCTURE PASSES ANY LEARNED PARAMETERS TO PINNACLE // IF YOU WANT TO INCLUDE MORE STATISTICAL LEARNING, ADD THE LEARNED PARAMETERS // TO THIS OBJECT, AND UPDATE PINNACLE SO THAT IT CAN LOAD THE SAME OBJECT // THIS OBJECT IS WRITTEN TO BINARY FILE AFTER LEARNING SO THAT PINNACLE CAN READ IT IN double pmassErrors[buckets][2]; // mean and std dev for every bucket (100 Da) for parent mass errors double fmassErrors[buckets][2]; // mean and std dev for every bucket (100 Da) for fragment mass errors // FEEL FREE TO ADD MORE LEARNED PARAMETERS IN HERE (REMEMBERING TO UPDATE THE SAME OBJECT IN PINNACLE }; // =================================== FILE LOCATIONS ================================================================ // location of the output of the PINNACLE run (this is a serial file writen by PINNACLE storing all the hits as objects) char *inputFile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\output.ser"; // .ser (serial) - see PINNACLE for format details // the below files are all outputs from this program. savedModel and learnedParameters store the neural network and // other general learned parameters respectively and these 2 files must be written to the working directory of PINNACLE // neural network related char *tempTrainingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\tempTrainingSet.dat"; // just contains instances char *trainingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingSet.dat"; char *testingfile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\testingSet.dat"; char *trainingHistory = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingHistory.dat"; char *savedModel = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Debug\\multilayerPerceptron.dat"; // other learned parameters (see object for details) char *learnedParams = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\learnedParameters.bin"; // binary // ===================================== FUNCTIONS =========================== // LOAD IN AND DATA PREPARATION void createTrainingFile(); hit loadInHit(ifstream&); // loads in the next hit vector<string> processHit(hit, bool); // returns a block of lines which are converted into a NN input file (Flood style) double getMean(vector<double>&); // returns the mean double getStdDev(vector<double>&, double); // returns the standard deviation of input array, second arg is the precomputed vector mean. void learnMassErrors(); // learns parent and fragment mass errors for this dataset // TESTING OF THE MODEL void loadAndProcessTrainingFile(); void testModel(); // compares against a flat spectrum. double dotProduct(vector<double>, vector<double>); // return inner dot product of input vectors double testPeptide(vector<double>); // returns the dot product of a flat and the predicted spect. fragmatch findMatch(string, unsigned, hit); // (iontype, breakSite,matchArray) vector<double> fuzzyOut(float, float); // produces a fuzzy output of the intensity void trainingSetHeader(ofstream&,int,int,int,string,string);// sets up the header for the output training file // OTHER HANDY FUNCTIONS vector<double> normalise(vector<double>, double sumTotal); // takes an input vector and normalises to sum to 100 vector<double> convert2mods(string); // vector<double> convert2doubleVect(string); // converts a space separated line into a vector of doubles (from test set) vector<fragmatch> convert2peaks(string); // used during loading in to convert .ser file to a hit object double convert2double(string); // converts a string to a double int convert2int(string); // converts a string to an int string convert2str(vector<double>); // converts doubles into a string for output double encodeAA(char); double countHKR(string&); string stringify(double x); // convert double to string // Neural network Flood::Vector<double> getAttributes(fragmatch& fm,string seq, unsigned pcharge); Flood::MultilayerPerceptron ANN; void printFloodVector(Flood::Vector<double>& v); // ============== DEBUGGING =========================== int verbosity = 4; int _tmain(int argc, _TCHAR* argv[]) { // NEURAL NETWORK LEARNING // reads the identified peptides and creates a NN training file // createTrainingFile(); // imports the processed training file, sets the neural net parameters for the model and learning // loadAndProcessTrainingFile(); // tests neural network model against the testing file generated by createTrainingFile. testModel(); // STATISTICAL MACHINE LEARNING // collects basic stats on hits to optimise search parameters. // learnMassErrors(); // collects stats on parent and fragment mass erros. return 0; } void learnMassErrors() { // learn the mass errors from the hits cout << "Learning from parent and fragment mass errors from: " << inputFile << endl; ifstream inputHits; inputHits.open(inputFile); // load in the the input file in binary format learnedParameters lp; // object for storing learned parameters // we divide the mass range up into buckets of 100 Da, storing a mean and variance for each bucket // PINNACLE will then use this information in a prob density function vector<vector<double>> pmassbuckets; // stores the parent mass errors for each bin vector<vector<double>> fmassbuckets; // stores the fragment mass errors for each bin for (unsigned i=0; i < buckets; i++) { vector<double> v; pmassbuckets.push_back(v); // create elements (blank memory spacers for the time being) fmassbuckets.push_back(v); // create elements } unsigned hitsLoaded=0; while (!inputHits.eof()) { hit h = loadInHit(inputHits); if (h.seq.compare("END")==0) {cout << "hits loaded: " << hitsLoaded << endl; break;} // no more available (loadInHit returns 'END' at eof) if (h.Zscore < minZscore) {continue;} // not confident enough hit to use for learning hitsLoaded++; // RECORD PARENT MASS ERROR unsigned index = (unsigned) floor(h.obspmass/100); // assign bucket (100 Da per bucket) if (index > buckets-1) {index = buckets-1;} // prevent overflow //double modmass =0; // account for mods //for (unsigned i=0; i < h.mods.size(); i++) {modmass+=h.mods.at(i);} double difference = h.obspmass-(h.exppmass); if (fabs(difference) > 1) { cout << "obs mass: " << h.obspmass << " expmass: " << h.exppmass << endl; // " modmass: " << modmass << endl; cout << "unexpectedly high parent mass error - skip"; continue; // sometimes a peptide is IDed twice, each with different mods, keep searching.. } pmassbuckets.at(index).push_back(difference); // record parent mass error // RECORD FRAGMENT MASS ERRORS for (unsigned i=0; i<h.peaks.size(); i++) { if (h.peaks.at(i).obsmz == 0) {continue;} // no matching peak unsigned index = (unsigned) floor(h.peaks.at(i).obsmz/100); // assign bucket (100 Da per bucket) if (index > buckets-1) {index = buckets-1;} // prevent overflow // using absolute error values for frag mass due to found bimodal // distributions for some datasets. non absolute values would be // better (but can't then use mean and std dev in probability functions //cout << " fmass: " << h.peaks.at(i).obsmz << " error: " << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl; fmassbuckets.at(index).push_back(fabs(h.peaks.at(i).obsmz-h.peaks.at(i).expmz)); // record parent mass error //if (fragError.at(1) < 0 && h.peaks.at(i).obsmz > 700) {cout << "UNDER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;} //if (fragError.at(1) > 0 && h.peaks.at(i).obsmz > 700) {cout << " OVER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;} } } // end loading in all hits from file inputHits.close(); // SHOW SOME OF THE DISTROS IN THE BUCKETS unsigned testBucket = 8; cout << "bucket" << testBucket << "={"; for (unsigned i=0; i < fmassbuckets.at(testBucket).size(); i++) {cout << fmassbuckets.at(testBucket).at(i) << ',';} cout << "};\n"; exit(0); // GET PARAMETERS FOR MEAN AND VARIANCE FOR EACH BUCKET unsigned minDatapoints = 10; // the minimum number of values in a bucket for the calculation of mean and std dev // ================ PARENT MASS ERRORS ============================================================ vector<vector<double>> pe; // pe = pmass errors for each bucket, stores <mean, stdDev> for (unsigned bucket=0; bucket < buckets; bucket++) { vector<double> v; v.clear(); if (pmassbuckets.at(bucket).size() < minDatapoints) {pe.push_back(v); continue;} // too few values to calculate double mean = getMean(pmassbuckets.at(bucket)); double stdDev = getStdDev(pmassbuckets.at(bucket), mean); //cout << "massbucket: " << bucket << ": {"; //for (unsigned j=0; j<pmassbuckets.at(bucket).size(); j++) {cout << pmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl; //cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl; v.push_back(mean); v.push_back(stdDev); pe.push_back(v); } // if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours. // first back-propogate the values to leading empty buckets for (unsigned i=0; i < pe.size(); i++) { // pe = parent errors if (pe.at(i).empty()) {continue;} vector<double> v = pe.at(i); // first non-empty bucket while (1) {pe.at(i) = v; if (i==0) {break;} i--;} break; } // now propagate forward any values to any empty buckets vector<double> v = pe.at(0); // first bucket's value (mean,stdDev) for (unsigned i=0; i < pe.size(); i++) { // pe = parent errors if (pe.at(i).empty()) {pe.at(i)=v; continue;} // assign same values as bucket to the left v = pe.at(i); // update } cout << "pmass errors: \n"; for (unsigned i=0; i < pe.size(); i++) { if (pe.at(i).empty()) {cout << "i: " << i << "{}\n";} else {cout << "bucket: " << i << " {" << pe.at(i).at(0) << ',' << pe.at(i).at(1) << "}\n";} } cout << endl; // load into object for (unsigned i=0; i < buckets; i++) { // pe = parent errors lp.pmassErrors[i][0] = pe.at(i).at(0); // mean lp.pmassErrors[i][1] = pe.at(i).at(1); // standard Deviation } // ================ FRAGMENT MASS ERRORS ============================================================ vector<vector<double>> fe; // fe = fragment errors for each bucket, stores <mean, stdDev> for (unsigned bucket=0; bucket < buckets; bucket++) { vector<double> v; v.clear(); if (fmassbuckets.at(bucket).size() < minDatapoints) {fe.push_back(v); continue;} // too few values to calculate double mean = getMean(fmassbuckets.at(bucket)); double stdDev = getStdDev(fmassbuckets.at(bucket), mean); //cout << "fmassbucket: " << bucket << ": {"; //for (unsigned j=0; j<fmassbuckets.at(bucket).size(); j++) {cout << fmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl; //cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl; v.push_back(mean); v.push_back(stdDev); fe.push_back(v); } // if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours. // first back-propogate the values to leading empty buckets for (unsigned i=0; i < fe.size(); i++) { // pe = parent errors if (fe.at(i).empty()) {continue;} vector<double> v = fe.at(i); // first non-empty bucket //cout << "first non-empty fragmass bucket: {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}" << endl; while (1) {fe.at(i) = v; if (i==0) {break;} i--;} break; } // now propagate forward any values to any empty buckets v = fe.at(0); // first bucket's value (mean,stdDev) for (unsigned i=0; i < fe.size(); i++) { // pe = parent errors if (fe.at(i).empty()) {fe.at(i)=v; continue;} // assign same values as bucket to the left v = fe.at(i); // update } cout << "fragment errors: \n"; for (unsigned i=0; i < fe.size(); i++) { if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";} else {cout << "bucket: " << i << " {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}\n";} } cout << endl; // output for Mathematica graphing - buckets of 100 da {{fragmass,AvgError,StdDev},{...}} cout << "ferrors={"; for (unsigned i=0; i < fe.size(); i++) { if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";} else {cout << '{' << i*100 << ',' << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "},";} } cout << '}' << endl; // load into object for (unsigned i=0; i < buckets; i++) { // pe = parent errors lp.fmassErrors[i][0] = fe.at(i).at(0); // mean lp.fmassErrors[i][1] = fe.at(i).at(1); // standard Deviation } ofstream out; // write object to file out.open(learnedParams, ios::binary); out.write((char *)(&lp),sizeof(lp)); // write object out to binary file (used by PINNACLE) out.close(); cout << "quiting after learned parameters writen to file"; exit(0); } void testModel() { // first load in the model Flood::MultilayerPerceptron ANN; ANN.load(savedModel); cout << "loading ANN from " << savedModel << endl; ifstream testingSet; testingSet.open(testingfile); vector<double> improvements; // load in a single instance from the test set string line; while (!testingSet.eof()) { vector<vector<double>> instance; // instance of the matching hits for a given peptide instance.clear(); line = "NULL"; while (!line.empty() && !testingSet.eof()) { getline(testingSet, line ); if (!line.empty()) { //cout << " pushing back line: " << line << " of size: " << convert2doubleVect(line).size() << endl; instance.push_back(convert2doubleVect(line)); } } //cout << "instance loaded with " << instance.size() << endl; vector<double> predicted; vector<double> flat(instance.size(),1); vector<double> observed; for (unsigned i=0; i < instance.size(); i++) { vector<double> input = instance.at(i); double obsPeakInt = input.back(); // the correct output observed.push_back(obsPeakInt); input.pop_back(); // remove the last element (answer) // need to convert a normal vector to a flood style vector Flood::Vector<double> attributes(input.size()); for (unsigned j=0; j<input.size(); j++) {attributes[j] = input.at(j);} Flood::Vector<double> fuzzyOutput = ANN.getOutput(attributes); if (fuzzyOutputs) { // take the highest node as the prediction of intensity (rounded to nearest 20%) unsigned maxnode=0; double maxval=0; for (int k=0; k < fuzzyOutput.getSize(); k++) { if (fuzzyOutput[k] > maxval) {maxnode=k; maxval=fuzzyOutput[k];} } predicted.push_back(maxnode); // just take the node with the highest output } else { //cout << "attributes: "; for (int k=0; k < attributes.getSize(); k++) {cout << attributes[k] << ' ';} //cout << "predicted output: " << fuzzyOutput[0] << endl; predicted.push_back(fuzzyOutput[0]); // just the raw single number coming out. //double exponent = pow(2.17,fuzzyOutput[0]); //predicted.push_back(exponent); // blow it back up } // print out the fuzzy nodes, max node and observed intensity /*cout << "fuzzyOut " << i << ": "; for (int k=0; k < fuzzyOutput.getSize(); k++) { char buffer [50]; int n, a=5, b=3; n=sprintf_s(buffer, 50, "%0.3f",fuzzyOutput[k]); cout << buffer << " "; }cout << "| " << maxnode*0.2 << " | " << obsPeakInt << endl; */ } // end prediction of each peak height // end of single test peptide // Normalise vectors predicted = normalise(predicted,100); // normalises spectrum to a sum total of 100 flat = normalise(flat,100); observed = normalise(observed,100); // show normalized vectors cout << "flat : {"; for (unsigned i=0; i < flat.size()-1; i++) {cout << flat.at(i) << ',';} cout << flat.back() << "}\n"; cout << "predicted: {"; for (unsigned i=0; i < predicted.size()-1; i++) {cout << predicted.at(i) << ',';} cout << predicted.back() << "}\n"; cout << "observed : {"; for (unsigned i=0; i < observed.size()-1; i++) {cout << observed.at(i) << ',';} cout << observed.back() << "}\n"; double flatDP = dotProduct(observed,flat); double annDP = dotProduct(observed,predicted); improvements.push_back(annDP/flatDP); cout << "imp: " << annDP/flatDP << " flatDP: " << flatDP << " annDP: " << annDP << endl << endl; //exit(0); } cout << "improvements={"; for (unsigned i=0; i < improvements.size(); i++) {cout << improvements.at(i) << ',';} cout << "};"; // use a dot product model. // for the FOUND peaks, create a spectrum. then normalise return; } double dotProduct(vector<double> v1, vector<double> v2) { double dp=0; if (v1.size() != v2.size()) {cout << "vectors not equal length!"; exit(1);} for (unsigned i=0; i < v1.size(); i++) { dp += v1.at(i)*v2.at(i); } return dp; } vector<double> normalise(vector<double> v, double sumTotal) { double sum = 0; for (unsigned i=0; i < v.size(); i++) {sum+=v.at(i);} // make sum to 100 double scalar = sumTotal/sum; for (unsigned i=0; i < v.size(); i++) {v.at(i)*=scalar;} return v; } double testPeptide(vector<double> fragmatches) { // returns a dot product for the predicted and a flat spect. return 0; } vector<double> convert2doubleVect (string line) { // converts a space separate line into a double vector unsigned i=0; unsigned j=0; vector<double> out; while (j < line.size()) { if (j == line.size()-1) { // last one string tmp = line.substr(i,j-i+1); //cout << "element: " << tmp << endl; double val = convert2double(tmp); out.push_back(val); } if (line.at(j) != ' ' && j < line.size()) {j++; continue; } string tmp = line.substr(i,j-i); //cout << "element: " << tmp << endl; double val = convert2double(tmp); out.push_back(val); i=j; j++; } //exit(0); return out; } void loadAndProcessTrainingFile() { // LOAD IN TRAINING SET INTO NN FOR LEARNING> Flood::InputTargetDataSet inputTargetDataSet ; inputTargetDataSet.load(trainingfile); // set up a new neural network Flood::MultilayerPerceptron multilayerPerceptron(inputNodes,inputNodes,outputNodes); // input, middle layer, output - same number in middle layer as inputs multilayerPerceptron.initFreeParametersNormal(0.0,1.0); // initialise the network within the chosen range // set up the names of the variables (both in and out) Flood::Vector<string> nameOfInputVariables = inputTargetDataSet.getNameOfInputVariables(); Flood::Vector<string> nameOfTargetVariables = inputTargetDataSet.getNameOfTargetVariables(); multilayerPerceptron.setNameOfInputVariables(nameOfInputVariables); multilayerPerceptron.setNameOfOutputVariables(nameOfTargetVariables); // get the parameters for the input data - bound in necessary //Flood::Matrix<double> meanAndStandardDeviationOfInputData = inputTargetDataSet.getMeanAndStandardDeviationOfInputData(); Flood::Matrix<double> minimumAndMaximumOfInputData = inputTargetDataSet.getMinimumAndMaximumOfInputData(); Flood::Vector<double> minOfInputVariables(inputNodes); // these values are not inclusive, so min must be lower than any expected input minOfInputVariables[0] = -0.01; // iontype minOfInputVariables[1] = 0.99; // pcharge minOfInputVariables[2] = 0.99; // fcharge minOfInputVariables[3] = MIN_SEQ_SIZE; // plen minOfInputVariables[4] = -0.01; // prop_len minOfInputVariables[5] = -0.01; // pr_hkr minOfInputVariables[6] = -0.01; // n_x minOfInputVariables[7] = -0.01; // c_x multilayerPerceptron.setMinimumOfInputVariables(minOfInputVariables); Flood::Vector<double> maxOfInputVariables(inputNodes); maxOfInputVariables[0] = 4.01; // iontype maxOfInputVariables[1] = 3.01; // pcharge maxOfInputVariables[2] = 2.01; // fcharge maxOfInputVariables[3] = MAX_SEQ_SIZE; // plen maxOfInputVariables[4] = 1.01; // prop_len maxOfInputVariables[5] = 1.01; // pr_hkr maxOfInputVariables[6] = 1.01; // n_x maxOfInputVariables[7] = 1.01; // c_x multilayerPerceptron.setMaximumOfInputVariables(maxOfInputVariables); // set up the relevant max and min etc for the output neuron(s) Flood::Matrix<double> minimumAndMaximumOfTargetData = inputTargetDataSet.getMinimumAndMaximumOfTargetData(); cout << "min encountered output: " << minimumAndMaximumOfTargetData[0][0] << endl; cout << "max encountered output: " << minimumAndMaximumOfTargetData[1][0] << endl; double minBuffer = minimumAndMaximumOfTargetData[0][0] - minimumAndMaximumOfTargetData[0][0]/10; // min plus some buffer for unexpectedly low results double maxBuffer = minimumAndMaximumOfTargetData[1][0] + minimumAndMaximumOfTargetData[1][0]/10; cout << "minBuffer: " << minBuffer << " maxBuffer: " << maxBuffer <<endl; Flood::Vector<double> minOfOutputVariables(outputNodes); minOfOutputVariables[0] = minBuffer; // 10% buffer under the min value encountered in training set Flood::Vector<double> maxOfOutputVariables(outputNodes); maxOfOutputVariables[0] = maxBuffer; // 10% buffer over the max value encountered in training set multilayerPerceptron.setMinimumOfOutputVariables(minOfOutputVariables); multilayerPerceptron.setMaximumOfOutputVariables(maxOfOutputVariables); multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MinimumAndMaximum); multilayerPerceptron.print(); Flood::MeanSquaredError meanSquaredError(&multilayerPerceptron,&inputTargetDataSet); //multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MeanAndStandardDeviation); Flood::QuasiNewtonMethod quasiNewtonMethod(&meanSquaredError); quasiNewtonMethod.setEvaluationGoal(0.0); quasiNewtonMethod.setGradientNormGoal(0.0001); quasiNewtonMethod.setMaximumTime(320.0); // seconds quasiNewtonMethod.setMaximumNumberOfEpochs(20); quasiNewtonMethod.train(); quasiNewtonMethod.saveTrainingHistory(trainingHistory); multilayerPerceptron.save(savedModel); cout << "neural network saved to: " << savedModel << endl; return; } void createTrainingFile() { // READ IN IDENTIFIED PEPTIDES cout << "Learning from: " << inputFile << endl; ifstream inputHits; inputHits.open(inputFile); // // load in the the input file in binary format ofstream tempTrainingSet; tempTrainingSet.open(tempTrainingfile); ofstream trainingSet; // writes to immediately if we're not using fuzzy outputs (as it doesn't need to be balanced). trainingSet.open(trainingfile); ofstream testingSet; testingSet.open(testingfile); ofstream testTrainLog; testTrainLog.open("testTrain.log"); vector<vector<double>> pmasses; // records <mass,error> for all parent masses vector<vector<double>> fragmasses; // records <mass,error> for all fragment masses vector<string> trainSet; // training set - saved in MultiLayerPerceptron input file vector<string> testSet; // testing set - saved as a text list of inputs with correct output as last figure. string line ; int verbosity = 0; double loaded = 0; // total double testing = 0; // those used in testing while (!inputHits.eof()) { hit h = loadInHit(inputHits); loaded++; if (h.seq.compare("END")==0) {break;} // no more available (loadInHit returns 'END' at eof) if (h.Zscore < minZscore) {continue;} // not confident enough hit to use for learning bool train; // if set 1, used for training, else used for testing if ((testing/loaded) <= propTesting) {train=0; testing++;} else {train=1;} // if not train, then test cout << "processing " << h.seq << endl; vector<string> instances = processHit(h,train); // creates attributes for training with this peptide //for (unsigned i=0; i < instances.size(); i++) { // cout << instances.at(i); //} if (train==0) { // peptide allocated for testing set - dump to file for (unsigned i=0; i < instances.size(); i++) {testingSet << instances.at(i);} testingSet << endl; // peptides separated by newlines. testing++; } if (train==1) { // peptide allocated for testing set - save later writing for (unsigned i=0; i < instances.size(); i++) {trainSet.push_back(instances.at(i));} } //if (processed > 50) {break;} //break; } //cout << "training set size: " << trainSet.size(); exit(0); string inputs = "ION_TYPE PCHARGE FCHARGE PLEN PROPLEN PRHKR N_X C_X"; //inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x string outputs = "LOG_INTEN_NORM"; // proportion of the total intensity trainingSetHeader(trainingSet, trainSet.size(), inputNodes, outputNodes, inputs, outputs); for (unsigned i=0; i < trainSet.size(); i++) {trainingSet << trainSet.at(i);} return; } vector<string> processHit(hit h, bool train) { // set train to 1 if it's going to the training set, else it goes to the testing set vector<Flood::Vector<double>> matrix; // a vector of Flood Style vectors for (unsigned i=0; i < h.peaks.size(); i++) { // for each peak fragmatch fm = h.peaks.at(i); Flood::Vector<double> attributes = getAttributes(h.peaks.at(i),h.seq,h.pcharge); if (attributes.getSize()==0) { cout << "no attributes generated for peak: " << i << " with " << fm.fragType << endl; continue; } // no identified matching fragment //cout << "for hit: " << fm.fragType << " " << fm.fragDesc << " " << " int: " << fm.intensity; //cout << " attributes="; printFloodVector(attributes); matrix.push_back(attributes); } // PEAK PROCESSING GOES IN HERE unsigned sumIntensities=0; for (unsigned i=0; i < h.peaks.size(); i++) {sumIntensities+=h.peaks.at(i).intensity;} //cout << "sumInt: " << sumIntensities << endl; vector<double> processedIntensities; for (unsigned i=0; i < h.peaks.size(); i++) { // for each peak fragmatch fm = h.peaks.at(i); if (fm.expmz == 0) {continue;} unsigned intensity = fm.intensity; //double prop = ((double) intensity / (double) sumIntensities); double processed = (double) intensity/ (double) sumIntensities; if (processed>0.2) {processed=0.2;} processedIntensities.push_back(processed); } //exit(0); //processedIntensities = normalise(processedIntensities,100); // convert the matrix to strings of attributes with the processed answer vector<string> outBlock; // each string corresponds to a line in a training or testing set for (unsigned i=0; i < matrix.size(); i++) { Flood::Vector<double> in = matrix.at(i); string out; for (int j=0; j < in.getSize(); j++) { out+=stringify(in[j]); out+=string(" "); } out+=stringify(processedIntensities.at(i)); out+="\n"; outBlock.push_back(out); } return outBlock; /// just a list of all hits and misses (with fuzzy outputs). } void trainingSetHeader(ofstream& f, int samples, int inputs, int outputs, string inputNames, string outputNames) { f << "% Testing Dataset for PINNACLE\n"; f << "NumberOfSamples:\n" << samples << endl; f << "NumberOfInputVariables:\n" << inputs << endl; f << "NumberOfTargetVariables:\n" << outputs << endl; f << "NameOfInputVariables:\n" << inputNames << endl; f << "NameOfTargetVariables:\n" << outputNames << endl; f << "InputAndTargetData:\n";; return; } string convert2str(vector<double> v) { // adds newline at end string out; for (unsigned i=0; i < v.size(); i++) { char buffer [50]; int n; n=sprintf_s (buffer, 50, "%4.6f", v.at(i)); out += string(buffer); if (i<v.size()-1) {out+=" ";} else {out+="\n";} } //cout << "string out: " << out; return out; } vector<double> fuzzyOut(float intensity, float maxpeak) { // maxpeak may also be the sumOfPeaks if normalising against total peak intensities. vector<double> out (outputNodes,0); double prop = (double) intensity/(double) maxpeak; // the proportional intensity of this peak w.r.t. the maximum in the spect prop*=1000; // multiplier to make all values greater then zero (since we'll be taking a log) // DIRECT OUTPUT - LOG of the proportion of the sum of intensities this fragment has. Gives us a nice normal curve for learning. if (intensity==0) {out.at(0)=0;} else {out.at(0) = log(prop); if (out.at(0) < 0) {out.at(0) =0;}} // don't want negative logs thanks return out; // FUZZY OUTPUT //cout << "for proportion: " << prop << endl; for (unsigned i=0; i < outputNodes; i++) { double node = i*(1/((double) outputNodes-1)); //cout << "node " << i << " with value " << node << endl; // current node double diff = fabs(prop-node); double decay = 3*diff; // decay will be zero when bang on - tells the program how much weight to put on neighbouring nodes. double value = 1-decay; if (value < 0){value=0;}// cout << "diff: " << diff << " decay: " << decay << " value: " << value << endl; out.at(i) = value; } cout << "fuzzyOut: <"; for (unsigned i=0; i < outputNodes; i++) {cout << out.at(i) << ',';} cout << '>' << endl; return out; } fragmatch findMatch(string iontype,unsigned breakSite,hit h) { // always returns the first such hit. Ie ignores whether is a water loss etc. // basically, scan through the list of matches, and if a hit is found, return it. //cout << "searching for ion type: " << iontype << " with breakSite at " << breakSite << endl; if (iontype.compare("nofrag")==0) { // parent peptide for (unsigned i=0; i < h.peaks.size(); i++) { fragmatch f = h.peaks.at(0); if (!f.fragType.substr(0,6).compare("nofrag")) {return f;} } } if (iontype.compare("immo")==0) { // immonium fragmatches have a default fragSite of 0. // here we just need to find a fragment match with the right description. char aa = h.seq.at(breakSite); // this is the specifc immonium ion we're looking for //cout << "looking for immo ion: " << aa << endl; for (unsigned i=0; i < h.peaks.size(); i++) { fragmatch f = h.peaks.at(i); //cout << "looking at peak: " << f.fragType << " obsmz: " << f.obsmz << endl; if (f.obsmz == 0) {continue;} // there is no match here. if (f.fragType.at(0) == 'i' && f.fragDesc.at(f.fragDesc.size()-1) == aa) { //cout << "this peak has been found: " << f.fragType << " intensity: " << f.intensity << endl; return f; } } } if (iontype.substr(0,2).compare("y1") == 0) { for (unsigned i=0; i < h.peaks.size(); i++) { fragmatch f = h.peaks.at(i); //cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl; if (f.fragSite == breakSite) { if (!f.fragType.substr(0,2).compare("y1")) {return f;} } } } if (iontype.substr(0,2).compare("b1") == 0) { for (unsigned i=0; i < h.peaks.size(); i++) { fragmatch f = h.peaks.at(i); //cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl; if (f.fragSite == breakSite) { if (!f.fragType.substr(0,2).compare("b1")) {return f;} } } } // getting to here means we didn't find a matching peak. fragmatch nomatch; nomatch.obsmz = 0; // this how we define a missing peak. return nomatch; } hit loadInHit(ifstream& file) { // given the filestream to the output file, loads in a hit. Should be replaced with binary files. // seq:pcharge:obspmass:exppmass:stdDevs:basicHits:prelimScore:LOD:{mod0:mod1..}:{match1,match2...}:overallLOD\n // where match = [intensity:obsfmass:expfmass:type:description:LOD] string line; hit h; getline( file, line ); if (file.eof()) {h.seq = "END"; return h;} //cout << line; unsigned i=0; unsigned j=0; while (line.at(j) != ':') {j++;} h.seq = line.substr(i,j-i); // identified sequence if (verbosity > 3) {cout << "seq loaded: " << h.seq << endl;} i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.pcharge = convert2int(line.substr(i,j-i)); i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.obspmass = convert2double(line.substr(i,j-i)); i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.exppmass = convert2double(line.substr(i,j-i)); i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.Zscore = convert2int(line.substr(i,j-i)); i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.basichits = convert2int(line.substr(i,j-i)); i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.prelimScore = convert2double(line.substr(i,j-i)); i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.overallLOD = convert2double(line.substr(i,j-i)); i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.mods = convert2mods(line.substr(i,j-i)); i=j+1; j=j+1; while (line.at(j) != '}') {j++;} h.peaks = convert2peaks(line.substr(i,j-i)); //cout << "exiting"; exit(0); return h; } vector<double> convert2mods(string token) { vector<double> mods; //cout << "converting to mod vector<double>: " << token << endl; unsigned i=1; unsigned j=2; // first char is '{' while (j < token.size()) { // for each mod mass if (token.at(j) != ',' || j < token.size()-2) {j++;} string tmp = token.substr(i,j-i-1); //cout << "modmass: " << tmp << endl; double modmass = convert2double(tmp); mods.push_back(modmass); i=j; j++; } return mods; } vector<fragmatch> convert2peaks(string token) { vector<fragmatch> matches; // first break the line into individual matches vector<string> lines; unsigned i=0; unsigned j=0; while (j < token.size() && i < token.size()) { while (token.at(i) != '[') {i++;} while (token.at(j) != ']') {j++;} //cout << "pushing back: " << token.substr(i+1,j-i-1) << endl; string strfrag = token.substr(i+1,j-i-1); // cout << "parts: " << strfrag << endl; unsigned k=0; vector<unsigned> commaPos; // positions of the commas in the string fragmatch f; while (k < strfrag.size()) {if (strfrag.at(k) == ',') {commaPos.push_back(k);} k++;} /*cout << token.at(i) << endl; cout << "commaPos: "; for (unsigned p=0; p < commaPos.size(); p++) {cout << commaPos.at(p) << ',';} cout << endl; cout << "intensity: " << strfrag.substr(0,commaPos.at(0)) << endl; cout << "obsMz: " << strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1) << endl; cout << "expMZ: " << strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1) << endl; cout << "fcharge: " << strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1) << endl; cout << "type: " << strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1) << endl; cout << "desc: " << strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1) << endl; cout << "fragLOD: " << strfrag.substr(commaPos.at(5)+1) << endl;*/ f.intensity = convert2int(strfrag.substr(0,commaPos.at(0))); f.obsmz = convert2double(strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1)); f.expmz = convert2double(strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1)); f.fcharge = convert2int(strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1)); if (f.expmz == 0.0) {i=j; j++; continue;} // missing peak isn't of any use at the moment. f.fragType = strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1); f.fragDesc = strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1); f.fragLOD = convert2double(strfrag.substr(commaPos.at(5)+1)); // calculate the fragmentation site. Zero is the whole peptide and we go from N to C term if (f.fragType.at(0) == 'i' || f.fragType.at(0) == 'n') {f.fragSite = 0;} // no frag or immonium else { // x/y/z or a/b/c ion if (f.fragType.at(0) == 'y') { unsigned site=1; while (f.fragDesc.at(site) == '.') {site++;} f.fragSite = site; //continue; } if (f.fragType.at(0) == 'a' || f.fragType.at(0) == 'b') { unsigned site=1; while (f.fragDesc.at(site) != '.') {site++;} f.fragSite = site; //continue; } if (f.fragType.at(0) != 'a' && f.fragType.at(0) != 'b' && f.fragType.at(0) != 'y') { cout << "fragType " << f.fragType << " not yet implemented" << endl; i=j; j++; continue; // does not record frag site } } matches.push_back(f); // add this peak i=j; j++; // continue onto next fragment. } return matches; } double convert2double(string str) { char * cstr; cstr = new char [str.size()+1]; // convert to c_str strcpy_s (cstr, str.size()+1, str.c_str()); double value = atof(cstr); //convert to double return value; } int convert2int(string str) { char * cstr; cstr = new char [str.size()+1]; // convert to c_str strcpy_s (cstr, str.size()+1, str.c_str()); int value = atoi(cstr); //convert to int return value; } double encodeAA(char aa) { // encodes AA letter names into numbers if (aa == 'X') {return 0.00;} // if there is no AA related here, ie N-term side of a b1 ion if (aa == 'V') {return 0.00;} // hydrophobic side groups if (aa == 'L') {return 0.05;} if (aa == 'I') {return 0.10;} if (aa == 'M') {return 0.15;} if (aa == 'F') {return 0.20;} if (aa == 'G') {return 0.25;} // in between if (aa == 'A') {return 0.30;} if (aa == 'S') {return 0.35;} if (aa == 'T') {return 0.40;} if (aa == 'Y') {return 0.45;} if (aa == 'W') {return 0.50;} if (aa == 'C') {return 0.55;} if (aa == 'P') {return 0.60;} if (aa == 'N') {return 0.65;} // hydrophilic side groups if (aa == 'E') {return 0.70;} if (aa == 'Q') {return 0.75;} if (aa == 'D') {return 0.80;} if (aa == 'H') {return 0.85;} if (aa == 'K') {return 0.90;} if (aa == 'R') {return 0.95;} //std::cout << "EncodeAA -> unknown AA letter code: " << aa << std::endl; return 0.0; // separate value for unknown }; double countHKR(string& seq) { // count the number of H, K or R residues in the input sequence double HKR=0; for (unsigned i=0; i<seq.size(); i++) { char aa = seq.at(i); if (aa == 'H') {HKR++; continue;} if (aa == 'K') {HKR++; continue;} if (aa == 'R') {HKR++; continue;} } return HKR; } double getMean(vector<double>& v) { // returns the mean double total=0; for (unsigned i=0; i < v.size(); i++) {total+= v.at(i);} return total/v.size(); } double getStdDev(vector<double>& v, double mean){ // returns the standard deviation of input array, second arg is the precomputed vector mean. double numerator=0; for (unsigned i=0; i < v.size(); i++) {numerator+= pow((v.at(i)-mean),2);} return pow(numerator/v.size(),0.5); } /* EXAMPLE OF AN OUTPUT FILE WHICH MUST BE PROCESSED AND LEARNED: Search Parameters: parentTol=1 tol=0.5 minBasicHits=4 minPrelimScore=0.05 randomSamples=25 randomLODspread=20 reportMin=5 topCandidates=0.3 maxTopCandidates=20 newmetric=0 maxFragCharge=2 verbosity=1 devsAboveMeanForHit=2 collectRandomHits=0 MOD0 = iodacetamide (static) mass: 57 acting on: C MOD1 = oxidation (variable) mass: 16 acting on: M H W MOD2 = deamidation (variable) mass: 1 acting on: N Q MOD0 MOD1 MOD2 Delta 0 0 0 0 0 1 0 16 0 2 0 32 1 0 0 57 1 1 0 73 1 2 0 89 2 0 0 114 2 1 0 130 2 2 0 146 loading peptide database: Forward Db: 1689106 unique peptides loaded from peptides.dat loading peptide database: Reverse Db: 683090 unique peptides loaded from reverse.dat DTAFiles\AGTEPSDAIR.mgf.dta top ten candidates for pmass: 1015.49 DTAFiles\AGTEPSDAIR.mgf.dta AGTEPSDAIR prelim: 0.652732 LOD: 25.9144 ASAEQSGGGPR prelim: 0.498535 LOD: 23.7549 ASAVSELSPR prelim: 0.630028 LOD: 23.7549 TFAHNSALR prelim: 0.559211 LOD: 22.6751 APQFTILAR prelim: 0.651944 LOD: 22.6751 ASAAPPHPLR prelim: 0.507099 LOD: 20.5156 TAEEKVSPR prelim: 0.501239 LOD: 19.4358 EDSVLNIAR prelim: 0.502028 LOD: 19.4358 TTSISPALAR prelim: 0.597014 LOD: 19.4358 TGDPQETLR prelim: 0.523944 LOD: 18.356 random mean: 9.58833 random StdDev: 2.37195 std Devs above mean: 6.88297 IDENTIFIED: AGTEPSDAIR observed pmass: 1015.49 pcharge: 1 identified pmass: 1015.48 mods: {0,0,0,0,0,0,0,0,0,0,} basicHits: 8 prelimScore: 0.652732 Top 4N peaks: AGTEPSDAIR intensity obsMZ expMZ type seq LOD 2265 86.10 86.08 immo I 1.080 2049 70.07 70.05 immo P 1.080 1672 175.13 175.10 y1+ .........R 2.160 1532 359.26 359.15 b1+ AGTE...... 2.160 816 658.43 658.26 b1+ AGTEPSD... 2.160 751 74.07 74.05 immo T 1.080 746 87.09 0.00 0.000 576 112.09 0.00 0.000 460 69.88 70.05 immo P 1.080 453 102.06 102.04 immo E 1.080 390 85.84 86.08 immo I 1.080 390 110.09 0.00 0.000 385 84.07 0.00 0.000 382 371.19 0.00 0.000 378 300.15 0.00 0.000 370 72.09 72.04 b1+ A......... 2.160 355 230.15 230.11 b1+ AGT....... 2.160 348 159.11 0.00 0.000 322 185.12 0.00 0.000 321 101.09 101.06 a1+ AG........ 1.080 318 641.39 641.33 y1+-1NH3 ....PSDAIR 1.080 273 120.09 0.00 0.000 264 71.88 72.04 b1+ A......... 2.160 251 158.11 158.10 y1+-1NH3 .........R 1.080 242 157.13 0.00 0.000 229 109.71 0.00 0.000 210 136.09 0.00 0.000 209 129.11 129.06 b1+ AG........ 2.160 200 83.81 0.00 0.000 200 119.66 0.00 0.000 199 212.13 212.11 b1+-1H2O AGT....... 1.080 194 130.11 0.00 0.000 overall LOD score: 25.9144 DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta loading obs spect: 0 top ten candidates for pmass: 2882.55 DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta ALIPSSQNEVELGELLLSLNYLPSAGR prelim: 0.252294 LOD: 66.5238 QMNLNPMDSPHSPISPLPPTLSPQPR prelim: 0.10953 LOD: 34.8458 GETLGNIPLLAIGPAICLPGIAAIALARK prelim: 0.0974537 LOD: 26.9263 VSSDPAWAVEWIELPRGLSLSSLGSAR prelim: 0.0921176 LOD: 25.3424 DAAIPAPHAPSAPGCLASVKPGGWKGAAGR prelim: 0.0996068 LOD: 25.3424 MEDTDFVGWALDVLSPNLISTSMLGR prelim: 0.0916495 LOD: 23.7585 MALLLLSLGLSLIAAQEFDPHTVMQR prelim: 0.0923048 LOD: 23.7585 QALTPEQVSFCATHMQQYMDPRGR prelim: 0.0931474 LOD: 23.7585 NTSIRVADFGSATFDHEHHTTIVATR prelim: 0.0745179 LOD: 22.1746 DLLEDKILHSHLVDYFPEFDGPQR prelim: 0.107471 LOD: 22.1746 random mean: 11.9109 random StdDev: 4.05973 std Devs above mean: 13.4523 IDENTIFIED: ALIPSSQNEVELGELLLSLNYLPSAGR observed pmass: 2882.55 pcharge: 1 identified pmass: 2882.47 mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,} basicHits: 18 prelimScore: 0.252294 Top 4N peaks: ALIPSSQNEVELGELLLSLNYLPSAGR intensity obsMZ expMZ type seq LOD 3138 95.95 0.00 0.000 307 487.20 487.24 y1+ ......................PSAGR 3.168 279 86.09 86.08 immo L 1.584 253 112.09 0.00 0.000 237 763.35 763.38 y1+ ....................YLPSAGR 3.168 198 175.09 175.10 y1+ ..........................R 3.168 174 600.31 600.32 y1+ .....................LPSAGR 3.168 171 70.05 70.05 immo P 1.584 155 470.23 470.24 y1+-1NH3 ......................PSAGR 1.584 151 1716.85 0.00 0.000 141 1416.73 1416.77 y1+ ..............LLLSLNYLPSAGR 3.168 141 1715.73 1715.91 y1+ ...........LGELLLSLNYLPSAGR 3.168 133 232.10 232.12 y1+ .........................GR 3.168 119 400.13 0.00 0.000 108 990.44 990.50 y1+ ..................LNYLPSAGR 3.168 104 185.09 185.12 b1+ AL......................... 3.168 103 542.20 0.00 0.000 102 877.33 877.42 y1+ ...................NYLPSAGR 3.168 98 303.16 303.16 y1+ ........................AGR 3.168 97 1303.62 1303.69 y1+ ...............LLSLNYLPSAGR 3.168 96 1943.86 1944.02 y1+ .........VELGELLLSLNYLPSAGR 3.168 95 471.21 0.00 0.000 95 871.25 0.00 0.000 92 1077.48 1077.53 y1+ .................SLNYLPSAGR 3.168 80 742.28 0.00 0.000 73 298.18 298.20 b1+ ALI........................ 3.168 71 643.19 0.00 0.000 71 860.35 860.42 y1+-1NH3 ...................NYLPSAGR 1.584 66 644.09 644.37 b1+-1NH3-2H2O ALIPSSQ.................... 1.584 66 743.40 0.00 0.000 63 756.27 0.00 0.000 60 187.09 0.00 0.000 58 391.19 0.00 0.000 58 449.25 0.00 0.000 58 2588.10 0.00 0.000 57 655.32 0.00 0.000 57 1602.76 1602.83 y1+ ............GELLLSLNYLPSAGR 3.168 56 497.16 0.00 0.000 54 228.08 0.00 0.000 54 859.82 859.42 y1+-1H2O ...................NYLPSAGR 1.584 53 288.63 0.00 0.000 53 289.17 0.00 0.000 51 243.16 0.00 0.000 51 320.48 0.00 0.000 51 409.64 0.00 0.000 47 2072.96 2073.06 y1+ ........EVELGELLLSLNYLPSAGR 3.168 46 1843.89 0.00 0.000 40 2187.00 2187.10 y1+ .......NEVELGELLLSLNYLPSAGR 3.168 36 2167.81 0.00 0.000 36 2404.79 0.00 0.000 1655 96.07 0.00 0.000 974 96.18 0.00 0.000 958 95.89 0.00 0.000 overall LOD score: 66.5238 DTAFiles\AQMDSSQLIHCR.mgf.dta top ten candidates for pmass: 1444.43 DTAFiles\AQMDSSQLIHCR.mgf.dta FDHNAIMKILDL prelim: 0.272246 LOD: 23.6381 MAEPGHSHHLSAR prelim: 0.381321 LOD: 21.0117 MPHLRLVLPITR prelim: 0.299034 LOD: 19.6984 VMSKSAADIIALAR prelim: 0.295296 LOD: 19.6984 IEHSTFDGLDRR prelim: 0.361541 LOD: 18.3852 MESDRLIISLPR prelim: 0.217111 LOD: 17.072 IDAMHGVMGPYVR prelim: 0.284083 LOD: 17.072 WQLNPGMYQHR prelim: 0.317101 LOD: 17.072 GTHPAIHDILALR prelim: 0.294206 LOD: 17.072 SPSSLLHDPALHR prelim: 0.294206 LOD: 15.7588 random mean: 10.2432 random StdDev: 3.34294 std Devs above mean: 4.00694 IDENTIFIED: FDHNAIMKILDL observed pmass: 1444.43 pcharge: 1 identified pmass: 1444.66 mods: {0,0,0,0,0,0,16,0,0,0,0,0,} basicHits: 5 prelimScore: 0.272246 Top 4N peaks: FDHNAIMKILDL intensity obsMZ expMZ type seq LOD 2157 1000.35 0.00 0.000 1812 110.06 110.00 immo H 1.313 1360 175.08 0.00 0.000 1032 200.09 0.00 0.000 922 455.11 0.00 0.000 830 86.08 86.08 immo I 1.313 648 101.06 101.09 immo K 1.313 616 472.15 0.00 0.000 552 70.06 0.00 0.000 504 335.09 0.00 0.000 492 89.06 0.00 0.000 383 112.08 0.00 0.000 330 585.21 585.18 b1+ FDHNA....... 2.626 326 133.02 0.00 0.000 298 1289.38 0.00 0.000 284 229.11 0.00 0.000 279 87.04 87.04 immo N 1.313 265 913.36 0.00 0.000 263 826.31 0.00 0.000 254 104.05 104.04 immo M 1.313 240 698.29 698.26 b1+ FDHNAI...... 2.626 240 748.23 748.40 y1+ ......MKILDL 2.626 239 298.05 0.00 0.000 236 334.07 0.00 0.000 230 166.04 0.00 0.000 228 183.07 0.00 0.000 223 247.07 247.11 y1+ ..........DL 2.626 221 120.10 120.07 immo F 1.313 221 1424.88 0.00 0.000 213 1325.48 0.00 0.000 210 329.15 0.00 0.000 204 731.14 731.40 y1+-1NH3 ......MKILDL 1.313 203 1354.55 0.00 0.000 201 316.08 0.00 0.000 201 1008.42 0.00 0.000 199 568.22 568.18 b1+-1NH3 FDHNA....... 1.313 196 639.28 0.00 0.000 188 366.20 0.00 0.000 186 84.04 0.00 0.000 184 203.04 0.00 0.000 181 331.14 0.00 0.000 181 855.49 0.00 0.000 177 217.08 0.00 0.000 177 791.28 0.00 0.000 174 318.11 0.00 0.000 170 216.07 0.00 0.000 167 861.24 861.48 y1+ .....IMKILDL 2.626 165 531.15 0.00 0.000 overall LOD score: 23.6381 DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta top ten candidates for pmass: 1908.96 DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta ATAAPAGAPPQPQDLEFTK prelim: 0.464155 LOD: 54.2672 KPTNLGTSCLLRHLQR prelim: 0.338948 LOD: 27.1336 APQPQRSASPALPFWTR prelim: 0.201213 LOD: 24.6669 STGWYPQPQINWSDSK prelim: 0.355488 LOD: 24.6669 ALQSAGPFGPYIPGGPAPGR prelim: 0.303074 LOD: 23.4336 QVSAESSSSMNSNTPLVR prelim: 0.205814 LOD: 23.4336 GKQSPHHERPEPEPPR prelim: 0.211954 LOD: 22.2002 ALRPKPTLEKDLEEDR prelim: 0.218912 LOD: 20.9669 IVEVLGIPPAHILDQAPK prelim: 0.343929 LOD: 20.9669 GQELHKPETGCQQELR prelim: 0.197297 LOD: 20.9669 random mean: 12.1361 random StdDev: 4.61053 std Devs above mean: 9.138 IDENTIFIED: ATAAPAGAPPQPQDLEFTK observed pmass: 1908.96 pcharge: 1 identified pmass: 1908.95 mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,} basicHits: 15 prelimScore: 0.464155 Top 4N peaks: ATAAPAGAPPQPQDLEFTK intensity obsMZ expMZ type seq LOD 6560 1299.53 1299.63 y1+ ........PPQPQDLEFTK 2.467 5095 63.86 0.00 0.000 5056 70.06 70.05 immo P 1.233 2954 977.39 977.47 y1+ ...........PQDLEFTK 2.467 2364 226.10 226.13 b1+-1H2O ATA................ 1.233 2318 323.13 0.00 0.000 1331 297.12 297.17 b1+-1H2O ATAA............... 1.233 1183 315.12 315.17 b1+ ATAA............... 2.467 1105 341.14 0.00 0.000 987 663.24 0.00 0.000 894 244.08 244.13 b1+ ATA................ 2.467 879 86.08 86.08 immo L 1.233 833 1595.70 1595.78 y1+ ....PAGAPPQPQDLEFTK 2.467 828 101.06 101.06 immo Q 1.233 701 611.25 611.32 b1+ ATAAPAGA........... 2.467 688 141.08 0.00 0.000 668 175.07 0.00 0.000 661 169.08 0.00 0.000 650 1427.57 1427.69 y1+ ......GAPPQPQDLEFTK 2.467 613 269.12 0.00 0.000 613 548.20 0.00 0.000 596 933.36 933.48 b1+ ATAAPAGAPPQ........ 2.467 588 905.35 905.48 a1+ ATAAPAGAPPQ........ 1.233 572 1281.52 1281.63 y1+-1H2O ........PPQPQDLEFTK 1.233 512 368.15 0.00 0.000 495 110.06 0.00 0.000 480 394.16 394.22 b1+-1H2O ATAAP.............. 1.233 480 1202.47 1202.58 y1+ .........PQPQDLEFTK 2.467 454 619.24 619.33 y1+-1H2O ..............LEFTK 1.233 447 270.13 0.00 0.000 439 776.31 0.00 0.000 437 1370.58 1370.67 y1+ .......APPQPQDLEFTK 2.467 433 98.05 0.00 0.000 414 365.16 0.00 0.000 403 72.07 72.04 b1+ A.................. 2.467 396 337.15 0.00 0.000 394 173.07 173.09 b1+ AT................. 2.467 386 243.11 0.00 0.000 370 126.03 0.00 0.000 367 708.29 708.37 b1+ ATAAPAGAP.......... 2.467 360 195.09 0.00 0.000 350 248.13 248.14 y1+ .................TK 2.467 330 112.06 0.00 0.000 330 115.07 0.00 0.000 329 212.10 0.00 0.000 313 540.22 540.28 b1+ ATAAPAG............ 2.467 311 209.06 0.00 0.000 311 295.13 0.00 0.000 310 167.09 0.00 0.000 309 906.38 0.00 0.000 303 155.07 155.09 b1+-1H2O AT................. 1.233 299 382.15 0.00 0.000 293 186.09 0.00 0.000 278 74.05 74.05 immo T 1.233 278 1052.46 0.00 0.000 273 583.21 583.32 a1+ ATAAPAGA........... 1.233 264 1088.41 1088.53 y1+-1NH3 ..........QPQDLEFTK 1.233 262 355.12 0.00 0.000 259 102.06 102.04 immo E 1.233 245 338.15 0.00 0.000 242 453.19 0.00 0.000 239 520.24 0.00 0.000 237 452.16 0.00 0.000 232 198.10 0.00 0.000 overall LOD score: 54.2672 */ Flood::Vector<double> getAttributes(fragmatch& f,string seq, unsigned pcharge) { //if (params.verbosity < 5) {cout << "getting attributes for frag " << f.type << " " << f.desc << endl; } // given the seq, and the immonium fragment, returns the attribute vector to be fed into the neural net vector<double> attributes; // holds the attributes for this fragment // NO FRAGMENTATION - TYPE 0 if (f.fragType.substr(0,6).compare("nofrag")==0) { attributes.push_back(0); // iontype 0 = unfragmented attributes.push_back(pcharge); attributes.push_back(f.fcharge); attrib
Initial URL
Initial Description
Harvest paper: Citation: McHugh L, Arthur JW (2008) Computational methods for protein identification from mass spectrometry data. PLoSComputBiol 4(2): e12. doi:10.1371/journal.pcbi.0040012 PINNACLE automatically learns the statistical and neural network models for the Harvest algorithm. Contact [email protected] for a Visual Studio project solution.
Initial Title
PINNACLE - Automated Machine Learning Module for the Harvest peptide identification package
Initial Tags
Initial Language
C++