This package is intended for processing of very large data sets
via shell pipelines. The programs do not store the data.
They are responses to the challenge: can one perform some of
the standard computations of statistical data analysis
(autocorrelation of a scalar time-series, covariance matrix
of a set of vectors, and least-squares polynomials) if one
receives the data points one at a time, and must process them
and throw them away before receiving the next data point?
Of course, all this must be done while preserving numerical
stability. The three C programs I provide seem to achieve these
aims for the three specific problems mentioned.
The ideas could be relevant more generally to stream computing
and distributed data analysis; see e.g.
Version 1.2 is 64-bit clean. A new feature is that the covariance
program takes no arguments.
tar zvxf pipemath-1.2.tgz; cd pipemath-1.2; make
Lines in the data file starting with # are ignored.
Computes the autocorrelation function of a scalar time series.
Usage: cat datafile | autocorrelation [maxlag=20 [stride=1 [dt=1]]]
Computes the covariance matrix of a set of n-vectors.
Usage: cat datafile | covariance
or: covariance < datafile
Each line of datafile has an n-vector. The value of n is determined
by the number of items on the first line. All subsequent lines must have
the same number of items.
Fits a least-squares polynomial.
Usage: cat datafile | lsqpoly [degree=1].
Each line of datafile has an x,y pair and an optional weight
sudo make install