regression-tests.csv

bootstrap-accuracy

(bootstrap-accuracy path values-path do-resampling)
Runs bootstrapping to estimate accuracy.
    path: A path to the source csv file with the original
          sample.
    do-resampling: Method to get random resampling with
                   replacement.
    values-path: A path to the csv file with values
                 from another sample.
 returns: A hash-map with csv file names and content.

 Accuracy (single value):
        pho_j=|y~_j-y^_j|,
        y^_j=X'_j*b.

        y~_j: [N x 1] vector of the true observed values (in [values-path]).
        y^_j: [N x 1] vector of the fitted values from the model.
        X'_j: [N x (p+1)] matrix of the explanatory variables (in [values-path]).
        N: number of observations (in [values-path]).

 Accuracy (in subset):
        pho(k,S)=argmin_(pho- >=0)[#{pho_i <= pho- | i in S} >= km].

        pho(k,S): a (100 x k) percentile of the accuracy sample.
        k: belongs to [0,1].
        S: subset of values (subset in [values-path]).
        m: number of values in S.
        #: denotes the number of elements in the set.

 Accuracy estimates: pho(Q_1,S)=pho(0.25,S),
                     pho(Q_2,S)=pho(0.50,S),
                     pho(Q_3,S)=pho(0.75,S),
                     pho_max(S)=pho(1,S).

 Accuracy estimates are calculated for each group in [values-path].

 Confidence intervals (bootstrapping):
    estimates:
            pho(Q_1), pho(Q_2), pho(Q_3), pho_max
            all calculated after resampling with replacement.
    out: mean with 95% percentile confidence interval.

    bootstrap scheme: percentile bootstrap (Efron & Tibshirani, 1993),
                      left border - value at position of the largest
                      integer not greater than alpha/2*[n-rep],
                      right border - value at position of the smallest
                      integer not less than (1-alpha/2)*[n-rep].

    confidence level: alpha=0.05.

## Usage

    (require '[regression-tests.csv :refer :all])

    ;;   sample.csv
    ;; y,x1,x2,x3
    ;; 1.1,1,1,4
    ;; 1,2,3,2
    ;; 0.95,2,2,3
    ;; 1.15,1.5,1.5,1.5
    ;; 2.1,3,3.1,5
    ;; 2.05,3.5,3,5.5
    ;; 3,4,3,6
    ;; 3.01,3.8,2.5,6.3
    ;; 3.02,3.9,2.7,6.5
    ;; 2.9,4.2,3.4,6

    ;;   values.csv
    ;; y,urban,x1,x2,x3
    ;; 1,1,1,1,1
    ;; 2,1,1,1,1
    ;; 1.1,1,1.1,1.3,1.1
    ;; 1.4,1,2,1,3
    ;; 3,1,3,1,2
    ;; 2.1,1,1,1,2
    ;; 2.4,1,3,1,3
    ;; 1.3,1,1,0,1
    ;; 1.5,1,2,1,0
    ;; 3.4,1,3,4,1
    ;; 0.1,1,3,1,1
    ;; 0,1,1,1,1
    ;; 1,1,1,1,1

    (bootstrap-accuracy "sample.csv"
                        "values.csv"
                        (fn[indexes]
                          (->> (map #(map (partial nth indexes) %&)
                                    [0 1 2 0 4 5 6 7 8 6]
                                    [0 1 5 3 5 5 6 7 0 9]
                                    [0 1 2 3 5 5 6 7 8 9]
                                    [0 1 2 3 4 5 8 7 8 9])
                               (apply mapv vector))))
    => {:ci
        {"accuracy-bootstrap.csv"
         '(["id" "group-id" "95-percent-ci-1" "95-percent-ci-2" "mean"]
           ["p-25-percent-1" "1" "0.097270" "0.732594" "0.306091"]
           ["p-25-percent-all" "all" "0.097270" "0.732594" "0.306091"]
           ["p-50-percent-1" "1" "0.391024" "0.914979" "0.569725"]
           ["p-50-percent-all" "all" "0.391024" "0.914979" "0.569725"]
           ["p-75-percent-1" "1" "1.020029" "1.450733" "1.226903"]
           ["p-75-percent-all" "all" "1.020029" "1.450733" "1.226903"]
           ["p-max-1" "1" "2.215106" "3.011073" "2.525272"]
           ["p-max-all" "all" "2.215106" "3.011073" "2.525272"]
           ["p-min-1" "1" "0.078317" "0.128894" "0.098244"]
           ["p-min-all" "all" "0.078317" "0.128894" "0.098244"])}
        :samples
        {"accuracy-sample.csv"
         '(["min" "quartile-1" "quartile-2" "quartile-3" "max"]
           ["0.097507" "0.732594" "0.914979" "1.450733" "2.466349"]
           ["0.094261" "0.097270" "0.674088" "1.020029" "3.011073"]
           ["0.128894" "0.158486" "0.474098" "1.131372" "2.708930"]
           ["0.092243" "0.272683" "0.394438" "1.264116" "2.224903"]
           ["0.078317" "0.269425" "0.391024" "1.268268" "2.215106"])}}


## References
    [1] Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife.
        Annals of Statistics, 7(1): 1-26. DOI:10.1214/aos/1176344552.
    [2] Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap.
        New York: Chapman and Hall.

bootstrap-independence-tests

(bootstrap-independence-tests path neighbours-path do-resampling)
Runs bootstrapping to get estimates for the independence tests.
    path: A path to the source csv file with the original
          sample.
    neighbours-path: A path to the source csv file with
                     neigbours data.
    do-resampling: Method to get random resampling with
                   replacement.
 returns: A hash-map with csv file names and content.

 Bootstrap hypothesis testing on spatial autocorrelation

     estimates: Moran's I (Moran, 1950), Geary's C (Geary, 1954) coefficients.
     out: mean with 95% percentile confidence interval, p-value.

     bootstrapping: bootstrap sample (pairs) is drawn from original residuals with replacement.

     confidence level: alpha=0.05.

     The equal-tail p-value in the two-tailed test is calculated as a twofold minimum between
     1) the relative number of bootstrap statistics equal or less than a test statistic
     (for the original sample) 2) the relative number of bootstrap statistics bigger than
     a test statistic (for the original sample). An equal-tailed property means that
     the probability of a value to be from the left side of an interval is the same as the probability
     of a value to be from the right side of an interval (Efron and Tibshirani, 1993).

     Original neighbours matrix of spatial proximity is normalized by the number of neighbours
     of the i-th observation.

 ## Usage

    (require '[regression-tests.csv :refer :all])

    ;;   sample.csv
    ;; y,x1,x2,x3
    ;; 1.1,1,1,4
    ;; 1,2,3,2
    ;; 0.95,2,2,3
    ;; 1.15,1.5,1.5,1.5
    ;; 2.1,3,3.1,5
    ;; 2.05,3.5,3,5.5
    ;; 3,4,3,6
    ;; 3.01,3.8,2.5,6.3
    ;; 3.02,3.9,2.7,6.5
    ;; 2.9,4.2,3.4,6

    ;;   neighbours.csv
    ;; y,n1,n2,n3,n4
    ;; 0,1,5,,
    ;; 1,0,,,
    ;; 2,,,
    ;; 33,9,,,
    ;; 4,,,
    ;; 5,0,7,8,9
    ;; 6,,,
    ;; 7,5,,,
    ;; 8,5,,,
    ;; 9,33,5,,


    (bootstrap-independence-tests "sample.csv"
                                  "neighbours.csv"
                                  (fn[indexes]
                                    (->> (map #(map (partial nth indexes) %&)
                                              [0 1 2 0 4 5 6 7 8 6]
                                              [0 1 5 3 5 5 6 7 0 9]
                                              [0 1 2 3 5 5 6 7 8 9]
                                              [0 1 2 3 4 5 8 7 8 9])
                                         (apply mapv vector))))
    => {"independence-tests-bootstrap.csv"
        '(["statistics" "95-percent-ci-1" "95-percent-ci-2" "mean" "p-value"]
          ["morans-i-test" "-0.497198" "-0.184985" "-0.340847" "0.400000"]
          ["geary-c-test" "0.953672" "2.352850" "1.514549" "0.800000"])
        "morans-i-test-sample.csv"
        '(["value"]
          ["-0.453250"]
          ["-0.497198"]
          ["-0.184985"]
          ["-0.298643"]
          ["-0.270161"])
        "geary-c-test-sample.csv"
        '(["value"]
          ["2.352850"]
          ["1.380878"]
          ["0.953672"]
          ["1.457173"]
          ["1.428174"])}

## References
    [1] Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap.
        New York: Chapman and Hall.
    [2] Geary, R. (1954). The Contiguity Ratio and Statistical Mapping.
        The Incorporated Statistician, 5(3): 115-145. DOI: 10.2307/2986645.
    [3] Moran, P. (1950). Notes on Continuous Stochastic Phenomena.
        Biometrika, 37(1-2): 17-23. DOI: 10.2307/2332142.
    [4] Lin, K.-P., Long, Z.-H., & Ou, B. (2011). The Size and Power of Bootstrap Tests for Spatial Dependence in a Linear Regression Model.
        Computational Economics, 38(2): 153-171. DOI: 10.1007/s10614-010-9224-0.

bootstrap-regression

(bootstrap-regression path do-resampling)
Runs bootstrapping to obtain estimates
 from the regression model.
    path: A path to the source csv file with the original
          sample.
    do-resampling: Method to get random resampling with
                   replacement.
 returns: A hash-map with csv file names and content.

 Confidence intervals (bootstrapping):
    estimates:
            b_0, b_1, ..., b_n;
            R-square, MSE (mean square error).
    out: mean with 95% percentile confidence interval.

    bootstrap scheme: percentile bootstrap (Efron & Tibshirani, 1993),
                      left border - value at position of the largest
                      integer not greater than alpha/2*[n-rep],
                      right border - value at position of the smallest
                      integer not less than (1-alpha/2)*[n-rep].

## Usage

    (require '[regression-tests.csv :refer :all])

    ;;   sample.csv
    ;; y,x1,x2,x3
    ;; 1.1,1,1,4
    ;; 1,2,3,2
    ;; 0.95,2,2,3
    ;; 1.15,1.5,1.5,1.5
    ;; 2.1,3,3.1,5
    ;; 2.05,3.5,3,5.5
    ;; 3,4,3,6
    ;; 3.01,3.8,2.5,6.3
    ;; 3.02,3.9,2.7,6.5
    ;; 2.9,4.2,3.4,6

    (bootstrap-regression "sample.csv"
                          (fn[indexes]
                            (->> (map #(map (partial nth indexes) %&)
                                      [0 1 2 0 4 5 6 7 8 6]
                                      [0 1 5 3 5 5 6 7 0 9]
                                      [0 1 2 3 5 5 6 7 8 9]
                                      [0 1 2 3 4 5 8 7 8 9])
                                 (apply mapv vector))))
    => {"regression-stat-bootstrap.csv"
        '(["statistics" "95-percent-ci-1" "95-percent-ci-2" "mean"]
          ["x1" "0.509545"	"1.008406" "0.810853"]
          ["x2" "-0.636473"	"-0.117615" "-0.410165"]
          ["x3" "-0.014289"	"0.262771" "0.098158"]
          ["intercept" "-0.387295"	"0.736617" "0.238792"]
          ["r-squared" "0.940166" "0.957364" "0.948597"]
          ["mse" "0.051196" "0.074844" "0.062514"])}


## References
    [1] Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife.
        Annals of Statistics, 7(1): 1-26. DOI:10.1214/aos/1176344552.
    [2] Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap.
        New York: Chapman and Hall.

run-permutation-tests

(run-permutation-tests path do-shuffle)
Runs permutation tests.
    path: A path to the source csv file with the original
          sample.
    do-shuffle: Method to get random permutations.
 returns: A hash-map with csv file names and content.

 Hypothesis testing (permutation tests):
    1) Overall model significance - exact permutation test on R-square.
        H0: b_1=b_2=...=b_p=0.

        out: approximate p-value, calculated after permutations
              with 95%-normal approximation interval.

    2) Significance of the i-th coefficient - approximate permutation
       test (Freedman & Lane, 1983) on t-statistic.
        H0: b_i=0.

        out: approximate p-value, calculated after permutations
              with 95%-normal approximation interval.

## Usage

    (require '[regression-tests.csv :refer :all])

    ;;   sample.csv
    ;; y,x1,x2,x3
    ;; 1.1,1,1,4
    ;; 1,2,3,2
    ;; 0.95,2,2,3
    ;; 1.15,1.5,1.5,1.5
    ;; 2.1,3,3.1,5
    ;; 2.05,3.5,3,5.5
    ;; 3,4,3,6
    ;; 3.01,3.8,2.5,6.3
    ;; 3.02,3.9,2.7,6.5
    ;; 2.9,4.2,3.4,6

    (run-permutation-tests "sample.csv"
                           (fn[indexes]
                             (->> (map #(map (partial nth indexes) %&)
                                       [0 2 1 8 3 4 5 6 9 7]
                                       [5 0 1 3 2 4 8 6 7 9]
                                       [0 2 3 5 4 1 6 7 8 9])
                                  (apply mapv vector))))
    => {:tests
        {"permutation_tests.csv"
         '(["test" "p-value" "lower-bound-ci" "upper-bound-ci"]
           ["overall-test-r2" "0.250000" "-0.174352" "0.674352"]
           ["x1-test-t-stat" "0.000000" "0.000000" "0.000000"]
           ["x2-test-t-stat" "0.000000" "0.000000" "0.000000"]
           ["x3-test-t-stat" "0.250000" "-0.174352" "0.674352"])}
        :samples
        {"permutation_r2_sample.csv"
         '(["value"]
           ["0.946715"]
           ["0.748369"]
           ["0.797480"]
           ["0.699669"])}}

## References
    [1] Anderson, M. (2001). Permutation tests for univariate or multivariate analysis of variance and regression.
        Canadian Journal of Fisheries and Aquatic Sciences, 58(3): 626-639. DOI: 10.1139/f01-004.
    [2] Freedman, D., & Lane, D. (1983). A Nonstochastic Interpretation of Reported Significance Levels.
        Journal of Business & Economic Statistics, 1(4): 292-298. DOI: 10.2307/1391660.