PINNACLE - Automated Machine Learning Module for the Harvest peptide identification package

Revision: 35042

at November 1, 2010 13:58 by leomchugh

Initial Code

/* Learner.cpp :
Copyright (c) 2009, Leo McHugh, University of Sydney, Australia
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    * Neither the name of the <organization> nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY LEO MCHUGH ''AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL LEO MCHUGH BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

This code requires input files from PreProcess.cpp

updates of this code can be found at: http://www.usyd.edu.au/sydneybioinformatics/research/software.shtml

Contact: [email protected]

*/

/*
PURPOSE AND USE OF THIS SOFTWARE:

This software is designed for use with PINNACLE. Once PINNACLE is run, it generates a list of hits along with a
confidence score for each, and prints out a list of identified fragments. An example is shown as a commented out 
section at the very end of this file.
Learner.cpp uses these files to gather statistical information about fragmentation and to build and train a 
neural network. 

For version1.0, only limited information will be collected.

We calculate approximate probability functions for:
the parent mass error w.r.t mass
fragment mass erro w.r.t mass
probability of fragment identification in correct assignments w.r.t mass / charge

Neural Nets:

Inputs for predicting peak intensity for a fragment:
iontype		either no frag, immoium, y or b ion (nofrag, immo, y1, a1, b1) = (0,1,3,2,4); // 5 reserved for z ions but not implemented
pcharge		parent charge
fcharge		fragment charge
plen		the length of the parent peptide
prop_len    the proportion of the distance from N to C term that the fragmentation site was at
pr_hkr		proportion of HKR and R residues in the fragment over those in the unbroken peptide
n_x			identity of AA on the n-term side of cleavage
c_x			identity..... c-term

inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x

(taken from working notes07)

==== Add later============
loss		0 if it's a 'native' ion, such as a y or b, 1 if it contains losses. 

Output: 
log of the intensity, scaled so that the max log(intensity) for the fragment is set to 100 // maybe use later 6 Fuzzy Logic NN outputs, 0, 20, 40 etc to 100.


*/

#include "stdafx.h"
#include <iostream>
#include <string>
#include <fstream>
#include <iostream>
#include <sstream>
#include <vector>
#include <list>
#include <time.h>		// for random number generation

// Learner based libraries:
#include "../Utilities/Vector.h"
#include "../Utilities/Matrix.h"
#include "../Utilities/InputTargetDataSet.h"
#include "../MultilayerPerceptron/MultilayerPerceptron.h"
#include "../ObjectiveFunctional/MeanSquaredError.h"
#include "../TrainingAlgorithm/QuasiNewtonMethod.h"


using namespace std; 

// ======================= FIXED PARAMETERS =====================
unsigned const MAX_SEQ_SIZE = 50;   // make sure this is the same as in preprocess.cpp and PINNACLE.cpp
unsigned const MIN_SEQ_SIZE = 5;	// make sure this is the same as in preprocess.cpp and PINNACLE.cpp
unsigned outputNodes = 1;			// The number of output nodes in the fuzzy network
unsigned inputNodes = 8;			// described above
double propTesting = 0.2;			// Proportion of peptides reserved for testing.
double   minZscore = 3.7;				// minimum required Zscore for a 'hit' to be considered confident enough to be used for testing/training
unsigned const buckets = 50;		// mass spectrum division into this many buckets of 100 da. for statistical learning part
bool fuzzyOutputs=false;			// if true, splits the output to a fuzzy output, if false, there is a single output node. 
	
// ======================= OBJECTS ==============================
// ============== MUST BE THE SAME AS IN PINNACLE.CPP ===========
struct fragmatch {								// records a matching fragment in from an indentified peptide
	unsigned	intensity;
	double		obsmz;							
	double		expmz;							// expected mz
	int			fcharge;						// fragment charge
	string		fragType;						// ie y1+
	string		fragDesc;						// decription of the fragment. ie: ........NQLLQFR
	unsigned	fragSite;						// from N to c term. zero is no frag, 1 is first bion / last y ion etc, 
	double		fragLOD;						// LOD for this frag
};
struct format									// converts a hit into a neural net input set
{
	double pmasserror;							// error in the pmass calculation
	vector<vector<double>>	fmzerror;			// fragment mass error array <fmass,error>
	vector<vector<double>>	NNlines;			// formatted lines to be appended to the NN training matrix. 
};

struct hit {
	string seq;					// sequence of the matched peptide
	double Zscore;				// number of std devs above the random mean
	int pcharge;				// charge of the observed parent peptide
	int basichits;				// the number of no loss y and b ions
	double prelimScore;			// the proportion of the top 4N peaks covered by no loss y and b ions. 
	double obspmass;			// observed pmass of the input spectrum (uncharged)
	double exppmass;			// theoretical pmass of the matched peptide (uncharged)
	vector<fragmatch> peaks;	// contains the top 4N peaks
	vector<double> mods;		// one entry for each AA from N to C term
	double overallLOD;			// LOD score generated by PINNACLE for this peptide match
};
struct learnedParameters {		// holds the parameters learned by this file for passing to PINNACLE
	// THIS STRUCTURE PASSES ANY LEARNED PARAMETERS TO PINNACLE
	// IF YOU WANT TO INCLUDE MORE STATISTICAL LEARNING, ADD THE LEARNED PARAMETERS
	// TO THIS OBJECT, AND UPDATE PINNACLE SO THAT IT CAN LOAD THE SAME OBJECT
	// THIS OBJECT IS WRITTEN TO BINARY FILE AFTER LEARNING SO THAT PINNACLE CAN READ IT IN
	double pmassErrors[buckets][2];		// mean and std dev for every bucket (100 Da) for parent mass errors
	double fmassErrors[buckets][2];		// mean and std dev for every bucket (100 Da) for fragment mass errors

	// FEEL FREE TO ADD MORE LEARNED PARAMETERS IN HERE (REMEMBERING TO UPDATE THE SAME OBJECT IN PINNACLE

};
// =================================== FILE LOCATIONS ================================================================
// location of the output of the PINNACLE run (this is a serial file writen by PINNACLE storing all the hits as objects)
char *inputFile = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\output.ser"; // .ser (serial) - see PINNACLE for format details

// the below files are all outputs from this program. savedModel and learnedParameters store the neural network and 
// other general learned parameters respectively and these 2 files must be written to the working directory of PINNACLE

// neural network related
char *tempTrainingfile  = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\tempTrainingSet.dat"; // just contains instances
char *trainingfile  = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingSet.dat";
char *testingfile  = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\testingSet.dat";
char *trainingHistory  = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\Learner\\Debug\\trainingHistory.dat";
char *savedModel  = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Debug\\multilayerPerceptron.dat";

// other learned parameters (see object for details)
char *learnedParams  = "C:\\Users\\Leo McHugh\\Desktop\\PhD\\Algs\\PINNACLE\\Release\\learnedParameters.bin"; // binary

// ===================================== FUNCTIONS ===========================
// LOAD IN AND DATA PREPARATION
void createTrainingFile();
hit loadInHit(ifstream&);									// loads in the next hit
vector<string> processHit(hit, bool);				// returns a block of lines which are converted into a NN input file (Flood style)
double getMean(vector<double>&);							// returns the mean
double getStdDev(vector<double>&, double);					// returns the standard deviation of input array, second arg is the precomputed vector mean.
void learnMassErrors();										// learns parent and fragment mass errors for this dataset

// TESTING OF THE MODEL
void loadAndProcessTrainingFile();
void testModel();											// compares against a flat spectrum. 
double dotProduct(vector<double>, vector<double>);			// return inner dot product of input vectors
double testPeptide(vector<double>);							// returns the dot product of a flat and the predicted spect. 
fragmatch findMatch(string, unsigned, hit);					// (iontype, breakSite,matchArray)
vector<double> fuzzyOut(float, float);						// produces a fuzzy output of the intensity
void trainingSetHeader(ofstream&,int,int,int,string,string);// sets up the header for the output training file

// OTHER HANDY FUNCTIONS
vector<double> normalise(vector<double>, double sumTotal);	// takes an input vector and normalises to sum to 100
vector<double> convert2mods(string);						//
vector<double> convert2doubleVect(string);					// converts a space separated line into a vector of doubles (from test set)
vector<fragmatch> convert2peaks(string);					// used during loading in to convert .ser file to a hit object
double convert2double(string);								// converts a string to a double
int convert2int(string);									// converts a string to an int
string convert2str(vector<double>);							// converts doubles into a string for output
double encodeAA(char);
double countHKR(string&);
string stringify(double x);									// convert double to string

// Neural network
Flood::Vector<double> getAttributes(fragmatch& fm,string seq, unsigned pcharge);
Flood::MultilayerPerceptron ANN;
void printFloodVector(Flood::Vector<double>& v);


// ============== DEBUGGING ===========================
int verbosity = 4;

int _tmain(int argc, _TCHAR* argv[])
{
	// NEURAL NETWORK LEARNING
	// reads the identified peptides and creates a NN training file
	// createTrainingFile();  

	// imports the processed training file, sets the neural net parameters for the model and learning
	// loadAndProcessTrainingFile(); 

	// tests neural network model against the testing file generated by createTrainingFile. 
	 testModel();			

	// STATISTICAL MACHINE LEARNING
	// collects basic stats on hits to optimise search parameters. 
	// learnMassErrors();		// collects stats on parent and fragment mass erros. 
	

	return 0;
}
void learnMassErrors() {  // learn the mass errors from the hits
	cout << "Learning from parent and fragment mass errors from: " << inputFile << endl; 
	ifstream inputHits;
	inputHits.open(inputFile);				// load in the the input file in binary format
	
	learnedParameters lp;					// object for storing learned parameters

	// we divide the mass range up into buckets of 100 Da, storing a mean and variance for each bucket
	// PINNACLE will then use this information in a prob density function
	vector<vector<double>> pmassbuckets;		// stores the parent mass errors for each bin 
	vector<vector<double>> fmassbuckets;		// stores the fragment mass errors for each bin 
	for (unsigned i=0; i < buckets; i++) {
		vector<double> v; 
		pmassbuckets.push_back(v);	// create elements (blank memory spacers for the time being)
		fmassbuckets.push_back(v);	// create elements
	}
	
	unsigned hitsLoaded=0;
	while (!inputHits.eof()) {
		hit h = loadInHit(inputHits); 
		if (h.seq.compare("END")==0) {cout << "hits loaded: " << hitsLoaded << endl; break;}	// no more available (loadInHit returns 'END' at eof)	 
		if (h.Zscore < minZscore) {continue;}	// not confident enough hit to use for learning
		hitsLoaded++;
		
		// RECORD PARENT MASS ERROR
		unsigned index = (unsigned) floor(h.obspmass/100);								// assign bucket (100 Da per bucket)
		if (index > buckets-1) {index = buckets-1;} // prevent overflow
		//double modmass =0;																// account for mods
		//for (unsigned i=0; i < h.mods.size(); i++) {modmass+=h.mods.at(i);}
		double difference = h.obspmass-(h.exppmass); 
		if (fabs(difference) > 1) {
			cout << "obs mass: " << h.obspmass << " expmass: " << h.exppmass << endl; // " modmass: " << modmass << endl; 
			cout << "unexpectedly high parent mass error - skip"; continue; // sometimes a peptide is IDed twice, each with different mods, keep searching..
		}
		pmassbuckets.at(index).push_back(difference);	// record parent mass error
		
		// RECORD FRAGMENT MASS ERRORS
		for (unsigned i=0; i<h.peaks.size(); i++) {
			if (h.peaks.at(i).obsmz == 0) {continue;}									// no matching peak
			unsigned index = (unsigned) floor(h.peaks.at(i).obsmz/100);					// assign bucket (100 Da per bucket)
			if (index > buckets-1) {index = buckets-1;} // prevent overflow
			// using absolute error values for frag mass due to found bimodal 
			// distributions for some datasets. non absolute values would be 
			// better (but can't then use mean and std dev in probability functions
			//cout << " fmass: " << h.peaks.at(i).obsmz << " error: " << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl; 
			fmassbuckets.at(index).push_back(fabs(h.peaks.at(i).obsmz-h.peaks.at(i).expmz));	// record parent mass error
			//if (fragError.at(1) < 0 && h.peaks.at(i).obsmz > 700) {cout << "UNDER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;}
			//if (fragError.at(1) > 0 && h.peaks.at(i).obsmz > 700) {cout << " OVER: " << h.peaks.at(i).fragType << " " << h.peaks.at(i).fragDesc << ' ' << h.peaks.at(i).obsmz-h.peaks.at(i).expmz << endl;}
		}
	} // end loading in all hits from file
	inputHits.close();

	// SHOW SOME OF THE DISTROS IN THE BUCKETS
	unsigned testBucket = 8;
	cout << "bucket" << testBucket << "={";
	for (unsigned i=0; i < fmassbuckets.at(testBucket).size(); i++) {cout << fmassbuckets.at(testBucket).at(i) << ',';}
	cout << "};\n";

	exit(0);

	// GET PARAMETERS FOR MEAN AND VARIANCE FOR EACH BUCKET 
	unsigned minDatapoints = 10;	// the minimum number of values in a bucket for the calculation of mean and std dev

	// ================  PARENT MASS ERRORS ============================================================
	vector<vector<double>> pe;										// pe = pmass errors for each bucket, stores <mean, stdDev>
	for (unsigned bucket=0; bucket < buckets; bucket++) {
		vector<double> v;
		v.clear(); 
		if (pmassbuckets.at(bucket).size() < minDatapoints) {pe.push_back(v); continue;}		// too few values to calculate
		double mean = getMean(pmassbuckets.at(bucket));	
		double stdDev = getStdDev(pmassbuckets.at(bucket), mean);
		//cout << "massbucket: " << bucket << ": {";
		//for (unsigned j=0; j<pmassbuckets.at(bucket).size(); j++) {cout << pmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl; 
		//cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl; 
		v.push_back(mean); v.push_back(stdDev);
		pe.push_back(v);
	}

	// if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours. 
	// first back-propogate the values to leading empty buckets
	for (unsigned i=0; i < pe.size(); i++) {	// pe = parent errors
		if (pe.at(i).empty()) {continue;}
		vector<double> v = pe.at(i); // first non-empty bucket
		while (1) {pe.at(i) = v; if (i==0) {break;} i--;}
		break;
	}
	
	// now propagate forward any values to any empty buckets
	vector<double> v = pe.at(0);				// first bucket's value (mean,stdDev)
	for (unsigned i=0; i < pe.size(); i++) {	// pe = parent errors
		if (pe.at(i).empty()) {pe.at(i)=v; continue;}		// assign same values as bucket to the left
		v = pe.at(i);			// update
	}
	
	cout << "pmass errors: \n"; for (unsigned i=0; i < pe.size(); i++) {
		if (pe.at(i).empty()) {cout << "i: " << i << "{}\n";}
		else {cout << "bucket: " << i << " {" << pe.at(i).at(0) << ',' << pe.at(i).at(1) << "}\n";}
	} cout << endl;

	// load into object 
	for (unsigned i=0; i < buckets; i++) {		// pe = parent errors
		lp.pmassErrors[i][0] = pe.at(i).at(0);		// mean
		lp.pmassErrors[i][1] = pe.at(i).at(1);		// standard Deviation
	}


	// ================  FRAGMENT MASS ERRORS ============================================================
	vector<vector<double>> fe;										// fe = fragment errors for each bucket, stores <mean, stdDev>
	for (unsigned bucket=0; bucket < buckets; bucket++) {
		vector<double> v;
		v.clear(); 
		if (fmassbuckets.at(bucket).size() < minDatapoints) {fe.push_back(v); continue;}		// too few values to calculate
		double mean = getMean(fmassbuckets.at(bucket));	
		double stdDev = getStdDev(fmassbuckets.at(bucket), mean);
		//cout << "fmassbucket: " << bucket << ": {";
		//for (unsigned j=0; j<fmassbuckets.at(bucket).size(); j++) {cout << fmassbuckets.at(bucket).at(j) << ',';} cout << "}" << endl; 
		//cout << "bucket: " << bucket << " mean: " << mean << " stdDev: " << stdDev << endl; 
		v.push_back(mean); v.push_back(stdDev);
		fe.push_back(v);
	}

	// if there are not enough values in some buckets, they will be empty. In this case we just give them the values of their neighbours. 
	// first back-propogate the values to leading empty buckets
	for (unsigned i=0; i < fe.size(); i++) {	// pe = parent errors
		if (fe.at(i).empty()) {continue;}
		vector<double> v = fe.at(i); // first non-empty bucket
		//cout << "first non-empty fragmass bucket: {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}" << endl;  
		while (1) {fe.at(i) = v; if (i==0) {break;} i--;}
		break;
	}
	
	// now propagate forward any values to any empty buckets
	v = fe.at(0);				// first bucket's value (mean,stdDev)
	for (unsigned i=0; i < fe.size(); i++) {	// pe = parent errors
		if (fe.at(i).empty()) {fe.at(i)=v; continue;}		// assign same values as bucket to the left
		v = fe.at(i);										// update
	}
	
	cout << "fragment errors: \n"; for (unsigned i=0; i < fe.size(); i++) {
		if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";}
		else {cout << "bucket: " << i << " {" << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "}\n";}
	} cout << endl;

	// output for Mathematica graphing  - buckets of 100 da  {{fragmass,AvgError,StdDev},{...}}
	cout << "ferrors={"; for (unsigned i=0; i < fe.size(); i++) {
		if (fe.at(i).empty()) {cout << "i: " << i << "{}\n";}
		else {cout << '{' << i*100 << ',' << fe.at(i).at(0) << ',' << fe.at(i).at(1) << "},";}
	} cout << '}' << endl;


	// load into object 
	for (unsigned i=0; i < buckets; i++) {		// pe = parent errors
		lp.fmassErrors[i][0] = fe.at(i).at(0);		// mean
		lp.fmassErrors[i][1] = fe.at(i).at(1);		// standard Deviation
	}



	ofstream out;								// write object to file
	out.open(learnedParams, ios::binary);
	out.write((char *)(&lp),sizeof(lp));		// write object out to binary file (used by PINNACLE)
	out.close();
	
	cout << "quiting after learned parameters writen to file"; 
	exit(0);
}
void testModel() {
	// first load in the model
	Flood::MultilayerPerceptron ANN; ANN.load(savedModel); 
	cout << "loading ANN from " << savedModel << endl; 

	ifstream testingSet;
	testingSet.open(testingfile);
	vector<double> improvements; 

	// load in a single instance from the test set
	string line; 

	while (!testingSet.eof()) {
		vector<vector<double>> instance; // instance of the matching hits for a given peptide
		instance.clear(); 
		line = "NULL"; 
		while (!line.empty() && !testingSet.eof()) {
			getline(testingSet, line );
			if (!line.empty()) {
				//cout << " pushing back line: " << line << " of size: " << convert2doubleVect(line).size() << endl; 
				instance.push_back(convert2doubleVect(line));
			}
		}
		//cout << "instance loaded with " << instance.size() << endl; 

		vector<double> predicted;
		vector<double> flat(instance.size(),1);
		vector<double> observed;

		for (unsigned i=0; i < instance.size(); i++) {
			vector<double> input = instance.at(i);
			double obsPeakInt = input.back();	// the correct output
			observed.push_back(obsPeakInt);		
			input.pop_back();				// remove the last element (answer)
			// need to convert a normal vector to a flood style vector
			Flood::Vector<double> attributes(input.size());
			for (unsigned j=0; j<input.size(); j++) {attributes[j] = input.at(j);}


			Flood::Vector<double> fuzzyOutput = ANN.getOutput(attributes); 

			if (fuzzyOutputs) {
				// take the highest node as the prediction of intensity (rounded to nearest 20%)
				unsigned maxnode=0; double maxval=0;
				for (int k=0; k < fuzzyOutput.getSize(); k++) { 				
					if (fuzzyOutput[k] > maxval) {maxnode=k; maxval=fuzzyOutput[k];}				
				}
				predicted.push_back(maxnode); // just take the node with the highest output
			}
			else {
				//cout << "attributes: "; for (int k=0; k < attributes.getSize(); k++) {cout << attributes[k] <<  ' ';} 
				//cout << "predicted output: " << fuzzyOutput[0] << endl; 
				predicted.push_back(fuzzyOutput[0]);  // just the raw single number coming out.
				//double exponent = pow(2.17,fuzzyOutput[0]); 
				//predicted.push_back(exponent);	// blow it back up
			}


			// print out the fuzzy nodes, max node and observed intensity
			/*cout << "fuzzyOut " << i << ": "; 
			for (int k=0; k < fuzzyOutput.getSize(); k++) {
				char buffer [50];
				int n, a=5, b=3;
				n=sprintf_s(buffer, 50, "%0.3f",fuzzyOutput[k]);
				cout << buffer << " ";
			}cout << "| " << maxnode*0.2 << " | " << obsPeakInt << endl; 	
			*/
		} // end prediction of each peak height

		// end of single test peptide
		// Normalise vectors
		predicted	= normalise(predicted,100); // normalises spectrum to a sum total of 100
		flat		= normalise(flat,100);
		observed	= normalise(observed,100);

		// show normalized vectors
		cout << "flat     : {"; for (unsigned i=0; i < flat.size()-1; i++) {cout << flat.at(i) << ',';} cout << flat.back() << "}\n";
		cout << "predicted: {"; for (unsigned i=0; i < predicted.size()-1; i++) {cout << predicted.at(i) << ',';} cout << predicted.back() << "}\n";
		cout << "observed : {"; for (unsigned i=0; i < observed.size()-1; i++) {cout << observed.at(i) << ',';} cout << observed.back() << "}\n";
		
		double flatDP	= dotProduct(observed,flat);
		double annDP	= dotProduct(observed,predicted);
		improvements.push_back(annDP/flatDP);

		cout << "imp: " << annDP/flatDP << " flatDP: " << flatDP << " annDP: " << annDP << endl << endl; 

		//exit(0);
	
	}

	
	cout << "improvements={"; for (unsigned i=0; i < improvements.size(); i++) {cout << improvements.at(i) << ',';} cout << "};";

	// use a dot product model. 
	// for the FOUND peaks, create a spectrum. then normalise



	return;
}
double dotProduct(vector<double> v1, vector<double> v2) {
	double dp=0;
	if (v1.size() != v2.size()) {cout << "vectors not equal length!"; exit(1);}
	for (unsigned i=0; i < v1.size(); i++) {
		dp += v1.at(i)*v2.at(i);
	}
	return dp; 
}

vector<double> normalise(vector<double> v, double sumTotal) {
	double sum = 0;
	for (unsigned i=0; i < v.size(); i++) {sum+=v.at(i);}
	// make sum to 100
	double scalar = sumTotal/sum;
	for (unsigned i=0; i < v.size(); i++) {v.at(i)*=scalar;}
	return v; 

}
double testPeptide(vector<double> fragmatches) { // returns a dot product for the predicted and a flat spect.

	return 0;
}
vector<double> convert2doubleVect (string line) { // converts a space separate line into a double vector
	unsigned i=0; unsigned j=0;
	vector<double> out; 
	while (j < line.size()) { 
		if (j == line.size()-1) {  // last one
			string tmp = line.substr(i,j-i+1); //cout << "element: " << tmp << endl; 
			double val = convert2double(tmp);
			out.push_back(val);
		}
		if (line.at(j) != ' ' && j < line.size()) {j++; continue; }   
		string tmp = line.substr(i,j-i); //cout << "element: " << tmp << endl; 
		double val = convert2double(tmp);
		out.push_back(val);
		i=j; j++;
	}
	//exit(0); 
	return out; 

}
void loadAndProcessTrainingFile() {

	// LOAD IN TRAINING SET INTO NN FOR LEARNING>
	Flood::InputTargetDataSet inputTargetDataSet ;
	inputTargetDataSet.load(trainingfile);

	// set up a new neural network
	Flood::MultilayerPerceptron multilayerPerceptron(inputNodes,inputNodes,outputNodes);  // input, middle layer, output - same number in middle layer as inputs
	multilayerPerceptron.initFreeParametersNormal(0.0,1.0);		// initialise the network within the chosen range
	
	// set up the names of the variables (both in and out)
	Flood::Vector<string> nameOfInputVariables = inputTargetDataSet.getNameOfInputVariables();
	Flood::Vector<string> nameOfTargetVariables = inputTargetDataSet.getNameOfTargetVariables();
	multilayerPerceptron.setNameOfInputVariables(nameOfInputVariables);
	multilayerPerceptron.setNameOfOutputVariables(nameOfTargetVariables);

	
	
	
	// get the parameters for the input data - bound in necessary
	//Flood::Matrix<double> meanAndStandardDeviationOfInputData	= inputTargetDataSet.getMeanAndStandardDeviationOfInputData();
	Flood::Matrix<double> minimumAndMaximumOfInputData	= inputTargetDataSet.getMinimumAndMaximumOfInputData();

	Flood::Vector<double> minOfInputVariables(inputNodes);	 // these values are not inclusive, so min must be lower than any expected input
	minOfInputVariables[0] = -0.01;				// iontype
	minOfInputVariables[1] =  0.99;				// pcharge
	minOfInputVariables[2] =  0.99;				// fcharge
	minOfInputVariables[3] = MIN_SEQ_SIZE;		// plen
	minOfInputVariables[4] = -0.01;				// prop_len
	minOfInputVariables[5] = -0.01;				// pr_hkr
	minOfInputVariables[6] = -0.01;				// n_x
	minOfInputVariables[7] = -0.01;				// c_x
	multilayerPerceptron.setMinimumOfInputVariables(minOfInputVariables);

	Flood::Vector<double> maxOfInputVariables(inputNodes);	
	maxOfInputVariables[0] = 4.01;  // iontype
	maxOfInputVariables[1] = 3.01;  // pcharge
	maxOfInputVariables[2] = 2.01;  // fcharge
	maxOfInputVariables[3] = MAX_SEQ_SIZE;  // plen
	maxOfInputVariables[4] = 1.01;  // prop_len
	maxOfInputVariables[5] = 1.01;  // pr_hkr
	maxOfInputVariables[6] = 1.01;  // n_x
	maxOfInputVariables[7] = 1.01;  // c_x
	multilayerPerceptron.setMaximumOfInputVariables(maxOfInputVariables);

	
	// set up the relevant max and min etc for the output neuron(s)
	Flood::Matrix<double> minimumAndMaximumOfTargetData	= inputTargetDataSet.getMinimumAndMaximumOfTargetData();
	cout << "min encountered output: " << minimumAndMaximumOfTargetData[0][0] << endl; 
	cout << "max encountered output: " << minimumAndMaximumOfTargetData[1][0] << endl; 
	double minBuffer = minimumAndMaximumOfTargetData[0][0] - minimumAndMaximumOfTargetData[0][0]/10;  // min plus some buffer for unexpectedly low results
	double maxBuffer = minimumAndMaximumOfTargetData[1][0] + minimumAndMaximumOfTargetData[1][0]/10;
	cout << "minBuffer: " << minBuffer << " maxBuffer: " << maxBuffer <<endl; 

	Flood::Vector<double> minOfOutputVariables(outputNodes);
	minOfOutputVariables[0] =  minBuffer;	// 10% buffer under the min value encountered in training set
	Flood::Vector<double> maxOfOutputVariables(outputNodes);
	maxOfOutputVariables[0] =  maxBuffer;	// 10% buffer over the max value encountered in training set
	multilayerPerceptron.setMinimumOfOutputVariables(minOfOutputVariables);
	multilayerPerceptron.setMaximumOfOutputVariables(maxOfOutputVariables);

	multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MinimumAndMaximum);
	multilayerPerceptron.print();
	
	Flood::MeanSquaredError meanSquaredError(&multilayerPerceptron,&inputTargetDataSet);
	//multilayerPerceptron.setPreAndPostProcessingMethod(Flood::MultilayerPerceptron::MeanAndStandardDeviation);
	Flood::QuasiNewtonMethod quasiNewtonMethod(&meanSquaredError);
	quasiNewtonMethod.setEvaluationGoal(0.0);
	quasiNewtonMethod.setGradientNormGoal(0.0001);
	quasiNewtonMethod.setMaximumTime(320.0); // seconds
	quasiNewtonMethod.setMaximumNumberOfEpochs(20);
	quasiNewtonMethod.train();


	quasiNewtonMethod.saveTrainingHistory(trainingHistory);
	multilayerPerceptron.save(savedModel);
	cout << "neural network saved to: " << savedModel << endl; 


	return;
}
void createTrainingFile() { 
	// READ IN IDENTIFIED PEPTIDES
	cout << "Learning from: " << inputFile << endl; 
	ifstream inputHits;
	inputHits.open(inputFile); // // load in the the input file in binary format
	
	ofstream tempTrainingSet;
	tempTrainingSet.open(tempTrainingfile);

	ofstream trainingSet;		// writes to immediately if we're not using fuzzy outputs (as it doesn't need to be balanced).
	trainingSet.open(trainingfile);

	ofstream testingSet;
	testingSet.open(testingfile);

	ofstream testTrainLog;
	testTrainLog.open("testTrain.log");	

	vector<vector<double>> pmasses;			// records <mass,error> for all parent masses
	vector<vector<double>> fragmasses;		// records <mass,error> for all fragment masses

	vector<string> trainSet;	// training set - saved in MultiLayerPerceptron input file
	vector<string> testSet;		// testing set - saved as a text list of inputs with correct output as last figure. 
	
	string line ;
	int verbosity = 0;
	double loaded = 0;	// total
	double testing	  = 0; // those used in testing
	while (!inputHits.eof()) { 
		hit h = loadInHit(inputHits); loaded++;
		if (h.seq.compare("END")==0) {break;}	// no more available (loadInHit returns 'END' at eof)	
		if (h.Zscore < minZscore) {continue;}	// not confident enough hit to use for learning
		bool train; // if set 1, used for training, else used for testing
		if ((testing/loaded) <= propTesting) {train=0; testing++;} else {train=1;}  // if not train, then test
		cout << "processing " << h.seq << endl; 

		vector<string> instances = processHit(h,train);					// creates attributes for training with this peptide		
		
		//for (unsigned i=0; i < instances.size(); i++) {
		//	cout << instances.at(i);		
		//}

		if (train==0) { // peptide allocated for testing set - dump to file
			for (unsigned i=0; i < instances.size(); i++) {testingSet << instances.at(i);}
			testingSet << endl; // peptides separated by newlines. 
			testing++;
		}
		if (train==1) { // peptide allocated for testing set - save later writing
			for (unsigned i=0; i < instances.size(); i++) {trainSet.push_back(instances.at(i));}
		}
		//if (processed > 50) {break;}
		//break; 
	}

	//cout << "training set size: " << trainSet.size(); exit(0); 
		
	string inputs  = "ION_TYPE PCHARGE FCHARGE PLEN PROPLEN PRHKR N_X C_X";  //inputs: iontype, pcharge, fcharge, plen, prop_len, pr_hkr, n_x, c_x
	string outputs = "LOG_INTEN_NORM";		// proportion of the total intensity

	trainingSetHeader(trainingSet, trainSet.size(), inputNodes, outputNodes, inputs, outputs);		
	for (unsigned i=0; i < trainSet.size(); i++) {trainingSet << trainSet.at(i);}
	return;
}
vector<string> processHit(hit h, bool train) {	// set train to 1 if it's going to the training set, else it goes to the testing set 
	vector<Flood::Vector<double>> matrix;			// a vector of Flood Style vectors
	for (unsigned i=0; i < h.peaks.size(); i++) {  // for each peak
		fragmatch fm = h.peaks.at(i);  
		Flood::Vector<double> attributes = getAttributes(h.peaks.at(i),h.seq,h.pcharge); 
		if (attributes.getSize()==0) {
			cout << "no attributes generated for peak: " << i << " with " << fm.fragType << endl; 
			continue;
		} // no identified matching fragment
		//cout << "for hit: " << fm.fragType << " " << fm.fragDesc << " " << " int: " << fm.intensity; 
		//cout << " attributes="; printFloodVector(attributes); 	
		matrix.push_back(attributes);
	}

	// PEAK PROCESSING GOES IN HERE
	unsigned sumIntensities=0;
	for (unsigned i=0; i < h.peaks.size(); i++) {sumIntensities+=h.peaks.at(i).intensity;}

	//cout << "sumInt: " << sumIntensities << endl; 
	
	vector<double> processedIntensities;
	for (unsigned i=0; i < h.peaks.size(); i++) {  // for each peak
		fragmatch fm = h.peaks.at(i); 
		if (fm.expmz == 0) {continue;}
		unsigned intensity = fm.intensity;
		//double prop = ((double) intensity / (double) sumIntensities);
		double processed = (double) intensity/ (double) sumIntensities;
		if (processed>0.2) {processed=0.2;}
		processedIntensities.push_back(processed);
	}
	//exit(0); 
	
	//processedIntensities = normalise(processedIntensities,100);

	// convert the matrix to strings of attributes with the processed answer
	vector<string> outBlock;  // each string corresponds to a line in a training or testing set
	for (unsigned i=0; i < matrix.size(); i++) { 
		Flood::Vector<double> in = matrix.at(i);
		string out;
		for (int j=0; j < in.getSize(); j++) {
			out+=stringify(in[j]);
			out+=string(" ");
		}
		out+=stringify(processedIntensities.at(i));
		out+="\n";
		outBlock.push_back(out); 
	}
	return outBlock;  /// just a list of all hits and misses (with fuzzy outputs). 
}
void trainingSetHeader(ofstream& f, int samples, int inputs, int outputs, string inputNames, string outputNames) {
	f << "% Testing Dataset for PINNACLE\n";
	f << "NumberOfSamples:\n" << samples << endl;
	f << "NumberOfInputVariables:\n" << inputs << endl;
	f << "NumberOfTargetVariables:\n" << outputs << endl;
	f << "NameOfInputVariables:\n" << inputNames << endl;
	f << "NameOfTargetVariables:\n" << outputNames << endl;
	f << "InputAndTargetData:\n";;
	return;
}
string convert2str(vector<double> v) { // adds newline at end
	string out;
	for (unsigned i=0; i < v.size(); i++) {
		char buffer [50]; int n;
		n=sprintf_s (buffer, 50, "%4.6f", v.at(i));
		out += string(buffer);
		if (i<v.size()-1) {out+=" ";} else {out+="\n";}

	}
	//cout << "string out: " << out;

	return out; 

}
vector<double> fuzzyOut(float intensity, float maxpeak) {  // maxpeak may also be the sumOfPeaks if normalising against total peak intensities.
	vector<double> out (outputNodes,0); 
	double prop = (double) intensity/(double) maxpeak; // the proportional intensity of this peak w.r.t. the maximum in the spect
	prop*=1000;	// multiplier to make all values greater then zero (since we'll be taking a log)

	// DIRECT OUTPUT  - LOG of the proportion of the sum of intensities this fragment has. Gives us a nice normal curve for learning. 
	if (intensity==0) {out.at(0)=0;} 
	else {out.at(0) = log(prop); if (out.at(0) < 0) {out.at(0) =0;}}  // don't want negative logs thanks
	return out;
	
	// FUZZY OUTPUT
	//cout << "for proportion: " << prop << endl;
	for (unsigned i=0; i < outputNodes; i++) {
		double node = i*(1/((double) outputNodes-1));  //cout << "node " << i << " with value " << node << endl; // current node
		double diff = fabs(prop-node);
		double decay = 3*diff;  // decay will be zero when bang on - tells the program how much weight to put on neighbouring nodes. 
		double value = 1-decay; if (value < 0){value=0;}// cout << "diff: " << diff << " decay: " << decay << " value: " << value << endl; 
		out.at(i) = value;
	}
	cout << "fuzzyOut: <"; for (unsigned i=0; i < outputNodes; i++) {cout << out.at(i) << ',';} cout << '>' << endl;
	
	return out;
}
fragmatch findMatch(string iontype,unsigned breakSite,hit h) {
	// always returns the first such hit. Ie ignores whether is a water loss etc.

	// basically, scan through the list of matches, and if a hit is found, return it. 
	//cout << "searching for ion type: " << iontype << " with breakSite at " << breakSite <<  endl;
	
	if (iontype.compare("nofrag")==0) { // parent peptide
		for (unsigned i=0; i < h.peaks.size(); i++) {
			fragmatch f = h.peaks.at(0);
			if (!f.fragType.substr(0,6).compare("nofrag")) {return f;}
		}
	}
	if (iontype.compare("immo")==0) {  // immonium fragmatches have a default fragSite of 0.
		// here we just need to find a fragment match with the right description.
		char aa = h.seq.at(breakSite); // this is the specifc immonium ion we're looking for
		//cout << "looking for immo ion: " << aa << endl;
		for (unsigned i=0; i < h.peaks.size(); i++) {		
			fragmatch f = h.peaks.at(i);
			//cout << "looking at peak: " << f.fragType << " obsmz: " << f.obsmz << endl; 
			if (f.obsmz == 0) {continue;} // there is no match here. 
			if (f.fragType.at(0) == 'i' && f.fragDesc.at(f.fragDesc.size()-1) == aa) {
				//cout << "this peak has been found: " << f.fragType << " intensity: " << f.intensity << endl;  	
				return f;
			}
		}
	}  
	if (iontype.substr(0,2).compare("y1") == 0) {
		for (unsigned i=0; i < h.peaks.size(); i++) {
			fragmatch f = h.peaks.at(i);
			//cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl; 
			if (f.fragSite == breakSite) {
				if (!f.fragType.substr(0,2).compare("y1")) {return f;}
			}
		}
	}
	if (iontype.substr(0,2).compare("b1") == 0) {
		for (unsigned i=0; i < h.peaks.size(); i++) {
			fragmatch f = h.peaks.at(i);
			//cout << "\tlooking at frag " << f.fragDesc << " with breakSite at: " << f.fragSite << endl; 
			if (f.fragSite == breakSite) {
				if (!f.fragType.substr(0,2).compare("b1")) {return f;}
			}
		}
	}

	// getting to here means we didn't find a matching peak. 
	fragmatch nomatch;
	nomatch.obsmz = 0; // this how we define a missing peak.
	return nomatch;

}

hit loadInHit(ifstream& file) { 
	// given the filestream to the output file, loads in a hit. Should be replaced with binary files. 
	// seq:pcharge:obspmass:exppmass:stdDevs:basicHits:prelimScore:LOD:{mod0:mod1..}:{match1,match2...}:overallLOD\n
	// where match = [intensity:obsfmass:expfmass:type:description:LOD]	
	string line; 
	hit h;
	
    getline( file, line ); 	
	if (file.eof()) {h.seq = "END"; return h;}

	//cout << line; 	
	unsigned i=0; unsigned j=0; while (line.at(j) != ':') {j++;}
	h.seq = line.substr(i,j-i);					// identified sequence
	if (verbosity > 3) {cout << "seq loaded: " << h.seq << endl;}
	
	
	i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.pcharge		= convert2int(line.substr(i,j-i));
	i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.obspmass		= convert2double(line.substr(i,j-i));
	i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.exppmass		= convert2double(line.substr(i,j-i));
	i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.Zscore			= convert2int(line.substr(i,j-i));
	i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.basichits		= convert2int(line.substr(i,j-i));
	i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.prelimScore	= convert2double(line.substr(i,j-i));
	i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.overallLOD		= convert2double(line.substr(i,j-i));
	
	i=j+1; j=j+1; while (line.at(j) != ':') {j++;} h.mods			= convert2mods(line.substr(i,j-i)); 
	i=j+1; j=j+1; while (line.at(j) != '}') {j++;} h.peaks			= convert2peaks(line.substr(i,j-i));

	//cout << "exiting"; exit(0); 
	return h;
}

vector<double> convert2mods(string token) {
	vector<double> mods;
	//cout << "converting to mod vector<double>: " << token << endl;
	unsigned i=1; unsigned j=2;  // first char is '{' 
	while (j < token.size()) { // for each mod mass
		if (token.at(j) != ',' || j < token.size()-2) {j++;} 
		string tmp = token.substr(i,j-i-1); //cout << "modmass: " << tmp << endl; 
		double modmass = convert2double(tmp);
		mods.push_back(modmass);
		i=j; j++;
		
	}
	return mods;
}

vector<fragmatch> convert2peaks(string token) { 
	vector<fragmatch> matches;
	// first break the line into individual matches
	vector<string> lines;
	unsigned i=0; unsigned j=0;
	while (j < token.size() && i < token.size()) {
		while (token.at(i) != '[') {i++;}
		while (token.at(j) != ']') {j++;}
		//cout << "pushing back: " << token.substr(i+1,j-i-1) << endl; 
		string strfrag = token.substr(i+1,j-i-1); // cout << "parts: " << strfrag << endl; 
		unsigned k=0; 
		vector<unsigned> commaPos; // positions of the commas in the string		
		fragmatch f;
		while (k < strfrag.size()) {if (strfrag.at(k) == ',') {commaPos.push_back(k);} k++;}
		
		/*cout << token.at(i) << endl; 
		cout << "commaPos: ";  for (unsigned p=0; p < commaPos.size(); p++) {cout << commaPos.at(p) << ',';} cout << endl; 
		cout << "intensity: " << strfrag.substr(0,commaPos.at(0)) << endl; 
		cout << "obsMz: " << strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1) << endl;
		cout << "expMZ: " << strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1) << endl;
		cout << "fcharge: " << strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1) << endl;
		cout << "type: " << strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1) << endl;		
		cout << "desc: " << strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1) << endl;
		cout << "fragLOD: " << strfrag.substr(commaPos.at(5)+1) << endl;*/
		
		f.intensity = convert2int(strfrag.substr(0,commaPos.at(0)));
		f.obsmz = convert2double(strfrag.substr(commaPos.at(0)+1,commaPos.at(1)-commaPos.at(0)-1));
		f.expmz = convert2double(strfrag.substr(commaPos.at(1)+1,commaPos.at(2)-commaPos.at(1)-1));
		f.fcharge = convert2int(strfrag.substr(commaPos.at(2)+1,commaPos.at(3)-commaPos.at(2)-1));

		if (f.expmz == 0.0) {i=j; j++; continue;} // missing peak isn't of any use at the moment. 

		f.fragType = strfrag.substr(commaPos.at(3)+1,commaPos.at(4)-commaPos.at(3)-1);
		f.fragDesc = strfrag.substr(commaPos.at(4)+1,commaPos.at(5)-commaPos.at(4)-1);
		f.fragLOD = convert2double(strfrag.substr(commaPos.at(5)+1));

		// calculate the fragmentation site. Zero is the whole peptide and we go from N to C term
		if (f.fragType.at(0) == 'i' || f.fragType.at(0) == 'n') {f.fragSite = 0;} // no frag or immonium 
		else { // x/y/z or a/b/c ion
			if (f.fragType.at(0) == 'y') {
				unsigned site=1; 
				while (f.fragDesc.at(site) == '.') {site++;}
				f.fragSite = site; //continue;
			}
			if (f.fragType.at(0) == 'a' || f.fragType.at(0) == 'b') {
				unsigned site=1; 
				while (f.fragDesc.at(site) != '.') {site++;}
				f.fragSite = site; //continue;
			}
			if (f.fragType.at(0) != 'a' && f.fragType.at(0) != 'b' && f.fragType.at(0) != 'y') {
				cout << "fragType " << f.fragType << " not yet implemented" << endl;
				i=j; j++; continue; // does not record frag site
			}

		}
		
		matches.push_back(f); // add this peak
	
		i=j; j++; // continue onto next fragment. 
	}
	return matches;
}

double convert2double(string str) {
	char * cstr;
	cstr = new char [str.size()+1];				// convert to c_str
	strcpy_s (cstr, str.size()+1, str.c_str());
	double value = atof(cstr);					//convert to double
	return value;
}
int convert2int(string str) {
	char * cstr;
	cstr = new char [str.size()+1];				// convert to c_str
	strcpy_s (cstr, str.size()+1, str.c_str());
	int value = atoi(cstr);					//convert to int
	return value;
}
double encodeAA(char aa) {  // encodes AA letter names into numbers
	if (aa == 'X') {return 0.00;} // if there is no AA related here, ie N-term side of a b1 ion
	if (aa == 'V') {return 0.00;} // hydrophobic side groups
	if (aa == 'L') {return 0.05;}
	if (aa == 'I') {return 0.10;}
	if (aa == 'M') {return 0.15;}
	if (aa == 'F') {return 0.20;}
	if (aa == 'G') {return 0.25;} // in between
	if (aa == 'A') {return 0.30;}
	if (aa == 'S') {return 0.35;}
	if (aa == 'T') {return 0.40;}
	if (aa == 'Y') {return 0.45;}
	if (aa == 'W') {return 0.50;}
	if (aa == 'C') {return 0.55;}
	if (aa == 'P') {return 0.60;}
	if (aa == 'N') {return 0.65;} // hydrophilic side groups
	if (aa == 'E') {return 0.70;}
	if (aa == 'Q') {return 0.75;}
	if (aa == 'D') {return 0.80;}
	if (aa == 'H') {return 0.85;}
	if (aa == 'K') {return 0.90;}
	if (aa == 'R') {return 0.95;}
	//std::cout << "EncodeAA -> unknown AA letter code: " << aa << std::endl; 
	return 0.0; // separate value for unknown

};
double countHKR(string& seq) { // count the number of H, K or R residues in the input sequence
	double HKR=0;
	for (unsigned i=0; i<seq.size(); i++) {
		char aa = seq.at(i);
		if (aa == 'H') {HKR++; continue;}
		if (aa == 'K') {HKR++; continue;}
		if (aa == 'R') {HKR++; continue;}
	}
	return HKR;
}
double getMean(vector<double>& v) {  // returns the mean
	double total=0;
	for (unsigned i=0; i < v.size(); i++) {total+= v.at(i);}
	return total/v.size(); 
}
double getStdDev(vector<double>& v, double mean){				// returns the standard deviation of input array, second arg is the precomputed vector mean.
	double numerator=0;
	for (unsigned i=0; i < v.size(); i++) {numerator+= pow((v.at(i)-mean),2);}
	return pow(numerator/v.size(),0.5); 
}


/*
EXAMPLE OF AN OUTPUT FILE WHICH MUST BE PROCESSED AND LEARNED:

Search Parameters:
parentTol=1
tol=0.5
minBasicHits=4
minPrelimScore=0.05
randomSamples=25
randomLODspread=20
reportMin=5
topCandidates=0.3
maxTopCandidates=20
newmetric=0
maxFragCharge=2
verbosity=1
devsAboveMeanForHit=2
collectRandomHits=0
MOD0 = iodacetamide (static) mass: 57 acting on: C 
MOD1 = oxidation (variable) mass: 16 acting on: M H W 
MOD2 = deamidation (variable) mass: 1 acting on: N Q 

MOD0	MOD1	MOD2	Delta
0	0	0	0
0	1	0	16
0	2	0	32
1	0	0	57
1	1	0	73
1	2	0	89
2	0	0	114
2	1	0	130
2	2	0	146
loading peptide database: Forward Db: 1689106 unique peptides loaded from peptides.dat
loading peptide database: Reverse Db: 683090 unique peptides loaded from reverse.dat
DTAFiles\AGTEPSDAIR.mgf.dta
top ten candidates for pmass: 1015.49 DTAFiles\AGTEPSDAIR.mgf.dta
	AGTEPSDAIR	prelim: 0.652732	LOD: 25.9144
	ASAEQSGGGPR	prelim: 0.498535	LOD: 23.7549
	ASAVSELSPR	prelim: 0.630028	LOD: 23.7549
	TFAHNSALR	prelim: 0.559211	LOD: 22.6751
	APQFTILAR	prelim: 0.651944	LOD: 22.6751
	ASAAPPHPLR	prelim: 0.507099	LOD: 20.5156
	TAEEKVSPR	prelim: 0.501239	LOD: 19.4358
	EDSVLNIAR	prelim: 0.502028	LOD: 19.4358
	TTSISPALAR	prelim: 0.597014	LOD: 19.4358
	TGDPQETLR	prelim: 0.523944	LOD: 18.356
random mean: 9.58833 random StdDev: 2.37195
std Devs above mean: 6.88297
IDENTIFIED: AGTEPSDAIR
observed   pmass: 1015.49 pcharge: 1
identified pmass: 1015.48
mods: {0,0,0,0,0,0,0,0,0,0,}
basicHits: 8
prelimScore: 0.652732
Top 4N peaks: AGTEPSDAIR
intensity  obsMZ   expMZ  type                seq            LOD
     2265   86.10   86.08 immo                         I   1.080
     2049   70.07   70.05 immo                         P   1.080
     1672  175.13  175.10 y1+                 .........R   2.160
     1532  359.26  359.15 b1+                 AGTE......   2.160
      816  658.43  658.26 b1+                 AGTEPSD...   2.160
      751   74.07   74.05 immo                         T   1.080
      746   87.09    0.00                                  0.000
      576  112.09    0.00                                  0.000
      460   69.88   70.05 immo                         P   1.080
      453  102.06  102.04 immo                         E   1.080
      390   85.84   86.08 immo                         I   1.080
      390  110.09    0.00                                  0.000
      385   84.07    0.00                                  0.000
      382  371.19    0.00                                  0.000
      378  300.15    0.00                                  0.000
      370   72.09   72.04 b1+                 A.........   2.160
      355  230.15  230.11 b1+                 AGT.......   2.160
      348  159.11    0.00                                  0.000
      322  185.12    0.00                                  0.000
      321  101.09  101.06 a1+                 AG........   1.080
      318  641.39  641.33 y1+-1NH3            ....PSDAIR   1.080
      273  120.09    0.00                                  0.000
      264   71.88   72.04 b1+                 A.........   2.160
      251  158.11  158.10 y1+-1NH3            .........R   1.080
      242  157.13    0.00                                  0.000
      229  109.71    0.00                                  0.000
      210  136.09    0.00                                  0.000
      209  129.11  129.06 b1+                 AG........   2.160
      200   83.81    0.00                                  0.000
      200  119.66    0.00                                  0.000
      199  212.13  212.11 b1+-1H2O            AGT.......   1.080
      194  130.11    0.00                                  0.000
overall LOD score: 25.9144

DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta
loading obs spect: 0
top ten candidates for pmass: 2882.55 DTAFiles\ALIPSSQNEVELGELLLSLNYLPSAGR.mgf.dta
	ALIPSSQNEVELGELLLSLNYLPSAGR	prelim: 0.252294	LOD: 66.5238
	QMNLNPMDSPHSPISPLPPTLSPQPR	prelim: 0.10953	LOD: 34.8458
	GETLGNIPLLAIGPAICLPGIAAIALARK	prelim: 0.0974537	LOD: 26.9263
	VSSDPAWAVEWIELPRGLSLSSLGSAR	prelim: 0.0921176	LOD: 25.3424
	DAAIPAPHAPSAPGCLASVKPGGWKGAAGR	prelim: 0.0996068	LOD: 25.3424
	MEDTDFVGWALDVLSPNLISTSMLGR	prelim: 0.0916495	LOD: 23.7585
	MALLLLSLGLSLIAAQEFDPHTVMQR	prelim: 0.0923048	LOD: 23.7585
	QALTPEQVSFCATHMQQYMDPRGR	prelim: 0.0931474	LOD: 23.7585
	NTSIRVADFGSATFDHEHHTTIVATR	prelim: 0.0745179	LOD: 22.1746
	DLLEDKILHSHLVDYFPEFDGPQR	prelim: 0.107471	LOD: 22.1746
random mean: 11.9109 random StdDev: 4.05973
std Devs above mean: 13.4523
IDENTIFIED: ALIPSSQNEVELGELLLSLNYLPSAGR
observed   pmass: 2882.55 pcharge: 1
identified pmass: 2882.47
mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,}
basicHits: 18
prelimScore: 0.252294
Top 4N peaks: ALIPSSQNEVELGELLLSLNYLPSAGR
intensity  obsMZ   expMZ  type                seq            LOD
     3138   95.95    0.00                                                   0.000
      307  487.20  487.24 y1+                 ......................PSAGR   3.168
      279   86.09   86.08 immo                                          L   1.584
      253  112.09    0.00                                                   0.000
      237  763.35  763.38 y1+                 ....................YLPSAGR   3.168
      198  175.09  175.10 y1+                 ..........................R   3.168
      174  600.31  600.32 y1+                 .....................LPSAGR   3.168
      171   70.05   70.05 immo                                          P   1.584
      155  470.23  470.24 y1+-1NH3            ......................PSAGR   1.584
      151 1716.85    0.00                                                   0.000
      141 1416.73 1416.77 y1+                 ..............LLLSLNYLPSAGR   3.168
      141 1715.73 1715.91 y1+                 ...........LGELLLSLNYLPSAGR   3.168
      133  232.10  232.12 y1+                 .........................GR   3.168
      119  400.13    0.00                                                   0.000
      108  990.44  990.50 y1+                 ..................LNYLPSAGR   3.168
      104  185.09  185.12 b1+                 AL.........................   3.168
      103  542.20    0.00                                                   0.000
      102  877.33  877.42 y1+                 ...................NYLPSAGR   3.168
       98  303.16  303.16 y1+                 ........................AGR   3.168
       97 1303.62 1303.69 y1+                 ...............LLSLNYLPSAGR   3.168
       96 1943.86 1944.02 y1+                 .........VELGELLLSLNYLPSAGR   3.168
       95  471.21    0.00                                                   0.000
       95  871.25    0.00                                                   0.000
       92 1077.48 1077.53 y1+                 .................SLNYLPSAGR   3.168
       80  742.28    0.00                                                   0.000
       73  298.18  298.20 b1+                 ALI........................   3.168
       71  643.19    0.00                                                   0.000
       71  860.35  860.42 y1+-1NH3            ...................NYLPSAGR   1.584
       66  644.09  644.37 b1+-1NH3-2H2O       ALIPSSQ....................   1.584
       66  743.40    0.00                                                   0.000
       63  756.27    0.00                                                   0.000
       60  187.09    0.00                                                   0.000
       58  391.19    0.00                                                   0.000
       58  449.25    0.00                                                   0.000
       58 2588.10    0.00                                                   0.000
       57  655.32    0.00                                                   0.000
       57 1602.76 1602.83 y1+                 ............GELLLSLNYLPSAGR   3.168
       56  497.16    0.00                                                   0.000
       54  228.08    0.00                                                   0.000
       54  859.82  859.42 y1+-1H2O            ...................NYLPSAGR   1.584
       53  288.63    0.00                                                   0.000
       53  289.17    0.00                                                   0.000
       51  243.16    0.00                                                   0.000
       51  320.48    0.00                                                   0.000
       51  409.64    0.00                                                   0.000
       47 2072.96 2073.06 y1+                 ........EVELGELLLSLNYLPSAGR   3.168
       46 1843.89    0.00                                                   0.000
       40 2187.00 2187.10 y1+                 .......NEVELGELLLSLNYLPSAGR   3.168
       36 2167.81    0.00                                                   0.000
       36 2404.79    0.00                                                   0.000
     1655   96.07    0.00                                                   0.000
      974   96.18    0.00                                                   0.000
      958   95.89    0.00                                                   0.000
overall LOD score: 66.5238

DTAFiles\AQMDSSQLIHCR.mgf.dta
top ten candidates for pmass: 1444.43 DTAFiles\AQMDSSQLIHCR.mgf.dta
	FDHNAIMKILDL	prelim: 0.272246	LOD: 23.6381
	MAEPGHSHHLSAR	prelim: 0.381321	LOD: 21.0117
	MPHLRLVLPITR	prelim: 0.299034	LOD: 19.6984
	VMSKSAADIIALAR	prelim: 0.295296	LOD: 19.6984
	IEHSTFDGLDRR	prelim: 0.361541	LOD: 18.3852
	MESDRLIISLPR	prelim: 0.217111	LOD: 17.072
	IDAMHGVMGPYVR	prelim: 0.284083	LOD: 17.072
	WQLNPGMYQHR	prelim: 0.317101	LOD: 17.072
	GTHPAIHDILALR	prelim: 0.294206	LOD: 17.072
	SPSSLLHDPALHR	prelim: 0.294206	LOD: 15.7588
random mean: 10.2432 random StdDev: 3.34294
std Devs above mean: 4.00694
IDENTIFIED: FDHNAIMKILDL
observed   pmass: 1444.43 pcharge: 1
identified pmass: 1444.66
mods: {0,0,0,0,0,0,16,0,0,0,0,0,}
basicHits: 5
prelimScore: 0.272246
Top 4N peaks: FDHNAIMKILDL
intensity  obsMZ   expMZ  type                seq            LOD
     2157 1000.35    0.00                                    0.000
     1812  110.06  110.00 immo                           H   1.313
     1360  175.08    0.00                                    0.000
     1032  200.09    0.00                                    0.000
      922  455.11    0.00                                    0.000
      830   86.08   86.08 immo                           I   1.313
      648  101.06  101.09 immo                           K   1.313
      616  472.15    0.00                                    0.000
      552   70.06    0.00                                    0.000
      504  335.09    0.00                                    0.000
      492   89.06    0.00                                    0.000
      383  112.08    0.00                                    0.000
      330  585.21  585.18 b1+                 FDHNA.......   2.626
      326  133.02    0.00                                    0.000
      298 1289.38    0.00                                    0.000
      284  229.11    0.00                                    0.000
      279   87.04   87.04 immo                           N   1.313
      265  913.36    0.00                                    0.000
      263  826.31    0.00                                    0.000
      254  104.05  104.04 immo                           M   1.313
      240  698.29  698.26 b1+                 FDHNAI......   2.626
      240  748.23  748.40 y1+                 ......MKILDL   2.626
      239  298.05    0.00                                    0.000
      236  334.07    0.00                                    0.000
      230  166.04    0.00                                    0.000
      228  183.07    0.00                                    0.000
      223  247.07  247.11 y1+                 ..........DL   2.626
      221  120.10  120.07 immo                           F   1.313
      221 1424.88    0.00                                    0.000
      213 1325.48    0.00                                    0.000
      210  329.15    0.00                                    0.000
      204  731.14  731.40 y1+-1NH3            ......MKILDL   1.313
      203 1354.55    0.00                                    0.000
      201  316.08    0.00                                    0.000
      201 1008.42    0.00                                    0.000
      199  568.22  568.18 b1+-1NH3            FDHNA.......   1.313
      196  639.28    0.00                                    0.000
      188  366.20    0.00                                    0.000
      186   84.04    0.00                                    0.000
      184  203.04    0.00                                    0.000
      181  331.14    0.00                                    0.000
      181  855.49    0.00                                    0.000
      177  217.08    0.00                                    0.000
      177  791.28    0.00                                    0.000
      174  318.11    0.00                                    0.000
      170  216.07    0.00                                    0.000
      167  861.24  861.48 y1+                 .....IMKILDL   2.626
      165  531.15    0.00                                    0.000
overall LOD score: 23.6381

DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta
top ten candidates for pmass: 1908.96 DTAFiles\ATAAPAGAPPQPQDLEFTK.mgf.dta
	ATAAPAGAPPQPQDLEFTK	prelim: 0.464155	LOD: 54.2672
	KPTNLGTSCLLRHLQR	prelim: 0.338948	LOD: 27.1336
	APQPQRSASPALPFWTR	prelim: 0.201213	LOD: 24.6669
	STGWYPQPQINWSDSK	prelim: 0.355488	LOD: 24.6669
	ALQSAGPFGPYIPGGPAPGR	prelim: 0.303074	LOD: 23.4336
	QVSAESSSSMNSNTPLVR	prelim: 0.205814	LOD: 23.4336
	GKQSPHHERPEPEPPR	prelim: 0.211954	LOD: 22.2002
	ALRPKPTLEKDLEEDR	prelim: 0.218912	LOD: 20.9669
	IVEVLGIPPAHILDQAPK	prelim: 0.343929	LOD: 20.9669
	GQELHKPETGCQQELR	prelim: 0.197297	LOD: 20.9669
random mean: 12.1361 random StdDev: 4.61053
std Devs above mean: 9.138
IDENTIFIED: ATAAPAGAPPQPQDLEFTK
observed   pmass: 1908.96 pcharge: 1
identified pmass: 1908.95
mods: {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,}
basicHits: 15
prelimScore: 0.464155
Top 4N peaks: ATAAPAGAPPQPQDLEFTK
intensity  obsMZ   expMZ  type                seq            LOD
     6560 1299.53 1299.63 y1+                 ........PPQPQDLEFTK   2.467
     5095   63.86    0.00                                           0.000
     5056   70.06   70.05 immo                                  P   1.233
     2954  977.39  977.47 y1+                 ...........PQDLEFTK   2.467
     2364  226.10  226.13 b1+-1H2O            ATA................   1.233
     2318  323.13    0.00                                           0.000
     1331  297.12  297.17 b1+-1H2O            ATAA...............   1.233
     1183  315.12  315.17 b1+                 ATAA...............   2.467
     1105  341.14    0.00                                           0.000
      987  663.24    0.00                                           0.000
      894  244.08  244.13 b1+                 ATA................   2.467
      879   86.08   86.08 immo                                  L   1.233
      833 1595.70 1595.78 y1+                 ....PAGAPPQPQDLEFTK   2.467
      828  101.06  101.06 immo                                  Q   1.233
      701  611.25  611.32 b1+                 ATAAPAGA...........   2.467
      688  141.08    0.00                                           0.000
      668  175.07    0.00                                           0.000
      661  169.08    0.00                                           0.000
      650 1427.57 1427.69 y1+                 ......GAPPQPQDLEFTK   2.467
      613  269.12    0.00                                           0.000
      613  548.20    0.00                                           0.000
      596  933.36  933.48 b1+                 ATAAPAGAPPQ........   2.467
      588  905.35  905.48 a1+                 ATAAPAGAPPQ........   1.233
      572 1281.52 1281.63 y1+-1H2O            ........PPQPQDLEFTK   1.233
      512  368.15    0.00                                           0.000
      495  110.06    0.00                                           0.000
      480  394.16  394.22 b1+-1H2O            ATAAP..............   1.233
      480 1202.47 1202.58 y1+                 .........PQPQDLEFTK   2.467
      454  619.24  619.33 y1+-1H2O            ..............LEFTK   1.233
      447  270.13    0.00                                           0.000
      439  776.31    0.00                                           0.000
      437 1370.58 1370.67 y1+                 .......APPQPQDLEFTK   2.467
      433   98.05    0.00                                           0.000
      414  365.16    0.00                                           0.000
      403   72.07   72.04 b1+                 A..................   2.467
      396  337.15    0.00                                           0.000
      394  173.07  173.09 b1+                 AT.................   2.467
      386  243.11    0.00                                           0.000
      370  126.03    0.00                                           0.000
      367  708.29  708.37 b1+                 ATAAPAGAP..........   2.467
      360  195.09    0.00                                           0.000
      350  248.13  248.14 y1+                 .................TK   2.467
      330  112.06    0.00                                           0.000
      330  115.07    0.00                                           0.000
      329  212.10    0.00                                           0.000
      313  540.22  540.28 b1+                 ATAAPAG............   2.467
      311  209.06    0.00                                           0.000
      311  295.13    0.00                                           0.000
      310  167.09    0.00                                           0.000
      309  906.38    0.00                                           0.000
      303  155.07  155.09 b1+-1H2O            AT.................   1.233
      299  382.15    0.00                                           0.000
      293  186.09    0.00                                           0.000
      278   74.05   74.05 immo                                  T   1.233
      278 1052.46    0.00                                           0.000
      273  583.21  583.32 a1+                 ATAAPAGA...........   1.233
      264 1088.41 1088.53 y1+-1NH3            ..........QPQDLEFTK   1.233
      262  355.12    0.00                                           0.000
      259  102.06  102.04 immo                                  E   1.233
      245  338.15    0.00                                           0.000
      242  453.19    0.00                                           0.000
      239  520.24    0.00                                           0.000
      237  452.16    0.00                                           0.000
      232  198.10    0.00                                           0.000
overall LOD score: 54.2672

*/

Flood::Vector<double> getAttributes(fragmatch& f,string seq, unsigned pcharge) {
	
	//if (params.verbosity < 5) {cout << "getting attributes for frag " << f.type << " " << f.desc << endl; }
	// given the seq, and the immonium fragment, returns the attribute vector to be fed into the neural net
	vector<double> attributes;			// holds the attributes for this fragment


	// NO FRAGMENTATION - TYPE 0
	if (f.fragType.substr(0,6).compare("nofrag")==0) {
		attributes.push_back(0);	// iontype 0 = unfragmented
		attributes.push_back(pcharge);
		attributes.push_back(f.fcharge);
		attrib

Initial URL

Initial Description

Harvest paper: 
Citation: McHugh L, Arthur JW (2008) Computational methods for protein identification from mass spectrometry data. PLoSComputBiol 4(2): e12. doi:10.1371/journal.pcbi.0040012

PINNACLE automatically learns the statistical and neural network models for the Harvest algorithm.

Contact [email protected] for a Visual Studio project solution.

Initial Title

PINNACLE - Automated Machine Learning Module for the Harvest peptide identification package

Initial Tags

Initial Language

C++

Choose a language for easy browsing: