The R package ‘modelIntegration’ implements aggregation of several probability distributions into a single integrated one. Suppose that, several independent methods are used to observe a deterministic element and each method represents the latter as a probability distribution. Thus, we deal with a family of probability distributions providing alternative descriptions to the same object. The problem is how to combine information from the prior estimates. This package implements the posterior integration method [Kryazhimskiy, 2013]. For comparison, an implementation of simple averaging of the input distributions is added.
where \(p_1,p_2,\dots,p_n\) are prior distributions on \(Z\) associated with the methods \(1,\dots,n\). \(Z\) is a non-empty finite set, whose number of elements is bigger than one.
Alternatively, prior estimates can be combined using simple averaging. This approach represents the distribution of the outcomes of random tests, in each of which one of the priors is chosen at random with probability \(1/n\), and then an outcome is picked up randomly according to the probability distribution based on the chosen method. Namely, \begin{equation} p(z)=\frac{p_1(z)+p_2(z)+ \dots +p_n(z)}{n} \end{equation}To explore the basic usage of modelIntegration, we’ll start with the built-in forest_npp
and forest_npp90
data frames. These datasets contain probability distribution tables for net primary production (NPP) of the forest ecosystems in seven bioclimatic zones in Russia, reported in [Kryazhimskiy et al., 2015]. The documentantation of the datasets is provided with ?forest_npp
and ?forest_npp90
calls.
dim(forest_npp)
## [1] 1131 17
colnames(forest_npp)
## [1] "npp" "LEA_Tundra" ## [3] "LEA_Tundra_Northern_Taiga" "LEA_Middle_Taiga" ## [5] "LEA_Southern_Taiga" "LEA_Temperate" ## [7] "LEA_Steppe" "LEA_Deserts" ## [9] "LEA_Total" "DGVM_Tundra" ## [11] "DGVM_Tundra_Northern_Taiga" "DGVM_Middle_Taiga" ## [13] "DGVM_Southern_Taiga" "DGVM_Temperate" ## [15] "DGVM_Steppe" "DGVM_Deserts" ## [17] "DGVM_Total"
The main method of the modelIntegration package is integrate
. It can work with several representations of probability distributions. The discrete distributions are supplied through pdfs
argument, which supports a ‘table-based’ format. A continuous distribution is discretized using the cdf, supplied in cdfs
. In this case, a bin center equals to a value of the corresponding outcome and a bin width is determined from the subsequent outcome values in the range. The identical range of the random variables (associated with each prior distribution) is set in the vals
argument.
example1 <- integrate( vals = forest_npp[, 1], pdfs = as.list(forest_npp[c("LEA_Tundra", "DGVM_Tundra")])) summary(example1)
## Product Average ## mean 189.29034 213.6184 ## std 42.78502 74.0616
example2 <- integrate( vals = forest_npp90[, 1], pdfs = as.list(forest_npp90["LEA_Tundra"]), cdfs = list("DGVM_Tundra" = function(x)(pnorm(x, mean = 202, sd = 52)))) summary(example2)
## Product Average ## mean 183.73562 212.92005 ## std 43.87124 79.16872
The two integrated estimates can be accessed with product
and average
calls correspondingly. The package also supports a summary of descriptive statistics for the integrated distributions and the priors.
example <- integrate(c(1, 2), list(c(0.75, 0.25), c(0.75, 0.25))) product(example)
## x prob ## 1 1 0.9 ## 2 2 0.1
average(example)
## x prob ## 1 1 0.75 ## 2 2 0.25
statistics(example)
## P1 P2 Product Average ## mean 1.2500000 1.2500000 1.1 1.2500000 ## std 0.4330127 0.4330127 0.3 0.4330127
[1] Kryazhimskiy, A.V. (2013). Posterior integration of independent stochastic estimates. IIASA Interim Report. IR-13-006.
[2] Kryazhimskiy, A.V. (2016). Posteriori integration of probabilities. Elementary theory. Theory of Probability and its Applications, 60(1): 62-87.
[3] Kryazhimskiy, A., Rovenskaya, E., Shvidenko, A., Gusti, M. Shchepashchenko, D. & Veshchinskaya, V. (2015). Towards harmonizing competing models: Russian forests’ net primary production case study. Technological Forecasting & Social Change, 98: 245-254.