Return to Snippet

Revision: 61174
at December 2, 2012 07:43 by webonomic


Initial Code
data myOUT;
input term doc count;
datalines;
1 1 1
1 3 1
1 4 1
2 2 1
2 3 2
3 1 2
3 3 2
3 4 1
4 2 2
4 4 1
5 3 2
;
run;
 
proc sort data=myOUT;
by  doc term;
run;
 
data docbyterm;
set myOUT;
by doc;
array t;
retain t;
if first.doc then do;
   do i=1 to 5;
      t=0;
   end;
end;
t=count;
if last.doc then do;
   output;
end;
run;
 
 
proc corr data=docbyterm cov outp=cooccur sscp;
var t1-t5;
run;

Initial URL
https://communities.sas.com/thread/6327?start=0&tstart=0

Initial Description
Text Miner uses a compressed representation of the term-by-doc frequency matrix. You will find an OUT data set in the project data directory of  your text miner run. Its label will include the string "OUT" in it.  Since a 30,000 document collection will have as many as 500,000 to a million distinct terms, be sure to restrict your terms of interest with a start list. I give an example of creating the cooccurrence matrix with the following code which expands the compressed version to an uncompressed version and then computes the co-occurrence count with proc corr and the sscp option.

Initial Title
Co Word Analysis with SAS

Initial Tags


Initial Language
SAS