lu-approximation.core

-main

(-main & args)
Approximates data by the multiple linear regression model with resampling
and supplies estimates of the approximation accuracy.

Saves results into the csv file
in the root execution directory.

Arguments: sample-path
           values-path
           n-rep

Original sample is read from the csv file by the [sample-path] address.
Target values are read from the csv file by the [values-path] address.
The first row should contain variable labels.
The first column contains values of the response y.
The second column in the [values-path] file contains the group id for
the given sample value.
Remaining columns contain values of the explanatory variables.


Model:
       y=Xb+eps,
       eps ~ F(0,sigma^2).

       y: [n x 1] vector of the response (in [sample-path]).
       X: [n x (p+1)] matrix of the explanatory variables (in [sample-path]).
       b: [(p+1) x 1] vector of the regression coefficients.
       eps: [n x 1] vector of the independent and
            identically distributed errors with common distribution F
            having mean 0 and finite variance sigma^2.
       n: number of observations (in [sample-path]).
       p: number of explanatory variables in the input file.

Assumptions: error terms are independent and identically distributed.

Regression coefficients are estimated using ordinary least squares (OLS).


Accuracy (single value):
       pho_j=|y~_j-y^_j|,
       y^_j=X'_j*b.

       y~_j: [N x 1] vector of the true observed values (in [values-path]).
       y^_j: [N x 1] vector of the fitted values from the model.
       X'_j: [N x (p+1)] matrix of the explanatory variables (in [values-path]).
       N: number of observations (in [values-path]).

Accuracy (in subset):
       pho(k,S)=argmin_(pho- >=0)[#{pho_i <= pho- | i in S} >= km].

       pho(k,S): a (100 x k) percentile of the accuracy sample.
       k: belongs to [0,1].
       S: subset of values (subset in [values-path]).
       m: number of values in S.
       #: denotes the number of elements in the set.

Accuracy estimates: pho(Q_1,S)=pho(0.25,S),
                    pho(Q_2,S)=pho(0.50,S),
                    pho(Q_3,S)=pho(0.75,S),
                    pho_max(S)=pho(1,S).

Accuracy estimates are calculated for each group in [values-path].


Confidence intervals (bootstrapping):
   output: lu-approximation/accuracy-bootstrap.csv
           lu-approximation/accuracy-sample.csv

   estimates:
           pho(Q_1), pho(Q_2), pho(Q_3), pho_max
           all calculated after [n-rep] replications.
   out: mean with 95% percentile confidence interval.

   bootstrap scheme: percentile bootstrap (Efron & Tibshirani, 1993),
                     left border - value at position of the largest
                     integer not greater than alpha/2*[n-rep],
                     right border - value at position of the smallest
                     integer not less than (1-alpha/2)*[n-rep].

   confidence level: alpha=0.05.


## Usage

     $ lein run sample-x1-x2-x3.csv cells-x1-x2-x3.csv 10000


## References:
     [1] Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife.
         Annals of Statistics, 7(1): 1-26. DOI:10.1214/aos/1176344552.
     [2] Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap.
         New York: Chapman and Hall.