-main
(-main & args)
Approximates data by the multiple linear regression model with resampling
and supplies estimates of the approximation accuracy.
Saves results into the csv file
in the root execution directory.
Arguments: sample-path
values-path
n-rep
Original sample is read from the csv file by the [sample-path] address.
Target values are read from the csv file by the [values-path] address.
The first row should contain variable labels.
The first column contains values of the response y.
The second column in the [values-path] file contains the group id for
the given sample value.
Remaining columns contain values of the explanatory variables.
Model:
y=Xb+eps,
eps ~ F(0,sigma^2).
y: [n x 1] vector of the response (in [sample-path]).
X: [n x (p+1)] matrix of the explanatory variables (in [sample-path]).
b: [(p+1) x 1] vector of the regression coefficients.
eps: [n x 1] vector of the independent and
identically distributed errors with common distribution F
having mean 0 and finite variance sigma^2.
n: number of observations (in [sample-path]).
p: number of explanatory variables in the input file.
Assumptions: error terms are independent and identically distributed.
Regression coefficients are estimated using ordinary least squares (OLS).
Accuracy (single value):
pho_j=|y~_j-y^_j|,
y^_j=X'_j*b.
y~_j: [N x 1] vector of the true observed values (in [values-path]).
y^_j: [N x 1] vector of the fitted values from the model.
X'_j: [N x (p+1)] matrix of the explanatory variables (in [values-path]).
N: number of observations (in [values-path]).
Accuracy (in subset):
pho(k,S)=argmin_(pho- >=0)[#{pho_i <= pho- | i in S} >= km].
pho(k,S): a (100 x k) percentile of the accuracy sample.
k: belongs to [0,1].
S: subset of values (subset in [values-path]).
m: number of values in S.
#: denotes the number of elements in the set.
Accuracy estimates: pho(Q_1,S)=pho(0.25,S),
pho(Q_2,S)=pho(0.50,S),
pho(Q_3,S)=pho(0.75,S),
pho_max(S)=pho(1,S).
Accuracy estimates are calculated for each group in [values-path].
Confidence intervals (bootstrapping):
output: lu-approximation/accuracy-bootstrap.csv
lu-approximation/accuracy-sample.csv
estimates:
pho(Q_1), pho(Q_2), pho(Q_3), pho_max
all calculated after [n-rep] replications.
out: mean with 95% percentile confidence interval.
bootstrap scheme: percentile bootstrap (Efron & Tibshirani, 1993),
left border - value at position of the largest
integer not greater than alpha/2*[n-rep],
right border - value at position of the smallest
integer not less than (1-alpha/2)*[n-rep].
confidence level: alpha=0.05.
## Usage
$ lein run sample-x1-x2-x3.csv cells-x1-x2-x3.csv 10000
## References:
[1] Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife.
Annals of Statistics, 7(1): 1-26. DOI:10.1214/aos/1176344552.
[2] Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap.
New York: Chapman and Hall.