The package hdpGLM makes it easy to estimate semi-parametric regression models, and summarize and visualize the results. The package is useful for many purposes:
The hdpGLM
works similarly to linear regression models.
It estimates coefficients of linear regressions, including generalized
linear models, such as logit coefficients. But it simultaneously
searches for latent clusters in the data and estimates the linear
coefficients for those clusters. The result of the estimation is K vectors of linear coefficients,
one vector for each cluster. If there are no clusters in the data, it
returns the estimated coefficients similar to classical regression
models.
The clustering procedure is based on a hierarchical semi-parametric Bayesian model proposed in Ferrari (2020). The model, called Hierarchical Dirichlet process Generalized Linear Models (hdpGLM), can be used to deal with latent heterogeneity in different situations, including those that emerge due to unobserved variables. It deals with the latent heterogeneity in two ways: (1) it finds latent clusters which can be better described by different regression models and (2) estimate the coefficient of those models. The hdpGLM can also be used with hierarchical data to estimate latent heterogeneity in multiple contexts and check if the clusters are context-dependent (see an example in section Estimating Context-dependent latent heterogeneity).
The linear model is estimated by sampling from the posterior distribution using a Gibbs sampler. Non-linear models (e.g., logit with binary outcomes) use Hamiltonian Monte Carlo within Gibbs. The algorithms are presented in Ferrari (2020).
Why should we estimate clusters of linear regressions instead of fitting a single regression model?
First, it improves the predictive performance of the regression model
and keeps the interpretability of the regression coefficients.
hdpGLM
is much more flexible than traditional regression
and produces monotonically lower root mean square error (see Ferrari (2020) for details).
Second, latent heterogeneity emerges when there are omitted variables
in the estimation of regression models. The hdpGLM
can be
used to estimate marginal effects even when interactions were omitted.
It recovers the linear coefficients of each latent group.
The function hdpGLM
estimates a semi-parametric Bayesian
regression model. The syntax is similar to other R functions such as
lm()
, glm()
, and lmer()
.
Here is a toy example. Suppose we are studying how income inequality
affects support policies that help alleviate poverty in a given country
A. Yet, suppose further that (1) the effect of inequality varies between
groups of people; for some people, inequality increases support for
welfare policies, but for others, it decreases welfare policy support;
(2) we don’t know which individual belongs to which group. The data set
welfare
contains simulated data for this example.
## loading and looking at the data
welfare = read.csv2('welfare.csv')
head(welfare)
#> support inequality income ideology
#> 1 -18.5649610 0.3392724 0.1425111 1.9225985
#> 2 -9.3905812 -0.9906646 -0.5117102 0.2483346
#> 3 0.9276234 -2.2318510 -0.3856288 -1.3619216
#> 4 -12.3594498 -3.0079501 -0.9440585 -0.2088675
#> 5 -2.4834411 0.1000455 0.8322192 0.1321378
#> 6 -11.4187853 -0.9543883 -0.8810503 0.2916444
Now, suppose that inequality increases support for welfare only among
women, but it decreases support among men. We didn’t collect data on
gender (male versus female). We could estimate the hdpGLM
and recover the coefficients even if gender wasn’t observed. The package
provides a function called hdpGLM
, which estimates a
semi-parametric Bayesian generalized linear model using a Dirichlet
mixture. Let’s estimate the model. The example uses few iterations in
the MCMC, but in real applications, one should use a much larger
number.
library(hdpGLM)
#>
#> ## ===============================================================
#> ## Hierarchial Dirichlet Process Generalized Linear Model (hdpGLM)
#> ## ===============================================================
#>
#> Author: Diogo Ferrari
#> Usage : https://github.com/DiogoFerrari/hdpGLM
#>
#> Attaching package: 'hdpGLM'
#> The following object is masked _by_ '.GlobalEnv':
#>
#> welfare
## estimating the model
mcmc = list(burn.in=10, ## MCMC burn-in period
n.iter =500) ## number of MCMC iterations to keep
mod = hdpGLM(support ~ inequality + income + ideology, data=welfare,
mcmc=mcmc)
## printing the outcome
summary(mod)
#>
#> --------------------------------
#> dpGLM model object
#>
#> Maximum number of clusters activated during the estimation: 12
#> Number of MCMC iterations: 500
#> burn-in: 10
#> --------------------------------
#>
#> Summary statistics of clusters with data points
#>
#> --------------------------------
#> Coefficients for cluster 1 (cluster label 1)
#>
#> Post.Mean Post.Median HPD.lower HPD.upper
#> 1 (Intercept) -3.8148162 -3.8164491 -3.8822155 -3.740043
#> 2 inequality -1.5221475 -1.5205087 -1.5963770 -1.457560
#> 3 income 3.8811833 3.8843698 3.8039093 3.935357
#> 4 ideology -8.2536625 -8.2555289 -8.3270029 -8.181216
#> 5 sigma 0.9845142 0.9857298 0.8896836 1.103283
#>
#> --------------------------------
#> Coefficients for cluster 2 (cluster label 3)
#>
#> Post.Mean Post.Median HPD.lower HPD.upper
#> 1 (Intercept) -3.867265 -3.8661659 -3.926398 -3.795238
#> 2 inequality 2.002829 2.0041347 1.939848 2.070552
#> 3 income 3.841769 3.8406393 3.780147 3.911862
#> 4 ideology -8.306318 -8.3085463 -8.380315 -8.225235
#> 5 sigma 1.000120 0.9967094 0.919052 1.090124
#>
#> --------------------------------
The summary function prints the result in a tidy format. The column
k
in the summary shows the label of the estimated clusters.
The column Mean
is the average of the posterior
distribution for each linear coefficient in each cluster.
The function classify
can be used to classify the data
points into clusters based on the estimation.
welfare_clustered = classify(welfare, mod)
head(welfare_clustered)
#> Cluster support inequality income ideology
#> 1 3 -18.5649610 0.3392724 0.1425111 1.9225985
#> 2 3 -9.3905812 -0.9906646 -0.5117102 0.2483346
#> 3 3 0.9276234 -2.2318510 -0.3856288 -1.3619216
#> 4 3 -12.3594498 -3.0079501 -0.9440585 -0.2088675
#> 5 1 -2.4834411 0.1000455 0.8322192 0.1321378
#> 6 3 -11.4187853 -0.9543883 -0.8810503 0.2916444
tail(welfare_clustered)
#> Cluster support inequality income ideology
#> 1995 1 -1.5230053 1.055855140 -0.7295937 -0.7067871
#> 1996 1 0.4814892 0.582588091 2.0051082 0.3090389
#> 1997 1 -14.1929956 0.391164197 -0.9607449 0.7765482
#> 1998 1 -8.2396789 0.074437376 1.2020300 1.0874928
#> 1999 1 -23.1583753 0.434223018 -0.6176438 2.0387294
#> 2000 3 -7.2075582 0.008355317 -0.4538951 0.2268072
There are a series of built-in functions, with various options, to
plot the results. In the example below, you see two of those options.
The separate
parameter plot the posterior samples for each
cluster separately, and the option ncols
controls how many
columns to use for the panels in the figure (to see more, run
help(plot.hdpGLM)
and help(plot.dpGLM)
).
To continue the previous toy example, suppose that we are analyzing data from many countries, and we suspect that the latent heterogeneity is different in each country. The effect of inequality on support for welfare may be gender-specific only in some countries (contexts). Or maybe the way it is gender-specific varies from country to country. Suppose we didn’t have data on gender, but we collect information on countries’ gender gap in welfare provision. Let’s look at this new data set.
## loading and looking at the data
welfare = read.csv2('welfare2.csv')
head(welfare)
#> support inequality income ideology country gap
#> 1 -18.5649610 0.3392724 0.1425111 1.9225985 0 0.1
#> 2 -9.3905812 -0.9906646 -0.5117102 0.2483346 0 0.1
#> 3 0.9276234 -2.2318510 -0.3856288 -1.3619216 0 0.1
#> 4 -12.3594498 -3.0079501 -0.9440585 -0.2088675 0 0.1
#> 5 -2.4834411 0.1000455 0.8322192 0.1321378 0 0.1
#> 6 -11.4187853 -0.9543883 -0.8810503 0.2916444 0 0.1
tail(welfare)
#> support inequality income ideology country gap
#> 3195 0.3190583 -0.7504798 -0.7839583 0.92300705 4 -0.8280808
#> 3196 -1.3837239 0.6620435 -1.5566268 0.05634618 4 -0.8280808
#> 3197 -1.3820016 -0.4298706 -1.0945688 0.71559078 4 -0.8280808
#> 3198 0.6878775 0.5450604 2.6175887 -1.94844469 4 -0.8280808
#> 3199 -7.9282930 1.7846004 1.6755823 1.29160208 4 -0.8280808
#> 3200 -1.7472485 0.5030992 -0.5395479 0.20109879 4 -0.8280808
The variable country
indicates the country (context) of
the observation, and the variable gap
the gender gap in
welfare provision in the respective country. The estimation is similar
to the previous example, but now there is a second formula
for the context-level variables. Again, the example below uses few
iterations in the MCMC, but in practical applications, one needs to
increase that).
## estimating the model
mcmc = list(burn.in=1, ## MCMC burn-in period
n.iter =50) ## number of MCMC iterations to keep
mod = hdpGLM(support ~ inequality + income + ideology,
support ~ gap,
data=welfare, mcmc=mcmc)
summary(mod)
#>
#> --------------------------------
#> hdpGLM Object
#>
#> Maximum number of clusters activated during the estimation: 1
#> Number of MCMC iterations: 50
#> Burn-in: 1
#>
#> Number of contexts : 5
#>
#> Number of clusters (summary across contexts):
#>
#> Average Std.Dev Median Min. Max.
#> 1 3.4 1.81659 3 1 6
#> --------------------------------
#>
#>
#> Summary statistics of clusters with data points in each context
#>
#> --------------------------------
#> Coefficients and clusters for context 1
#>
#> Post.Mean Post.Median HPD.lower HPD.upper Cluster
#> 1 (Intercept) -3.9703857 -3.9088602 -4.3136759 -3.7449763 1
#> 2 inequality -1.1678118 -1.4929021 -1.6892712 0.2069014 1
#> 3 income 3.8746050 3.8920977 3.6665169 4.0073460 1
#> 4 ideology -8.3003145 -8.3097882 -8.4225423 -8.1791706 1
#> 5 (Intercept) -2.7858157 -3.2766123 -3.9038344 0.1025201 10
#> 6 inequality -0.5949714 -0.7725635 -1.4366145 0.4298386 10
#> 7 income 3.8688530 3.8896672 3.4387990 4.1678861 10
#> 8 ideology -8.4181133 -8.4297295 -8.7376207 -8.0837074 10
#> 9 (Intercept) -4.0185349 -3.7977486 -8.4870297 -3.0357129 11
#> 10 inequality -1.7368642 -1.6590399 -2.9671067 -1.2723547 11
#> 11 income 3.6354299 3.8647564 -0.7181249 4.1446288 11
#> 12 ideology -8.0198111 -8.1340929 -8.8288753 -4.7604710 11
#> 13 (Intercept) -3.8560019 -3.8858519 -4.0897553 -2.6062820 13
#> 14 inequality 2.2345539 2.0360860 1.9614206 3.4594386 13
#> 15 income 3.6595638 3.8217045 1.8282796 3.9618816 13
#> 16 ideology -8.0209598 -8.3457199 -8.4180289 -4.4478659 13
#>
#> --------------------------------
#> Coefficients and clusters for context 2
#>
#> Post.Mean Post.Median HPD.lower HPD.upper Cluster
#> 1 (Intercept) 0.1620877 0.1418698 0.006242296 0.35545544 1
#> 2 inequality -0.5750285 -0.5393623 -0.752053679 -0.35212974 1
#> 3 income -0.2960043 -0.2889277 -0.441237773 -0.19598778 1
#> 4 ideology -1.8221645 -1.8273171 -1.977270382 -1.69788643 1
#> 5 (Intercept) 1.2980439 1.3858959 0.325120328 1.71530882 2
#> 6 inequality 1.3868443 1.2600503 0.981485715 2.79728055 2
#> 7 income 1.2302749 1.1823241 0.820918212 1.81309179 2
#> 8 ideology -1.2227460 -1.3003651 -1.660159449 0.12959421 2
#> 9 (Intercept) 1.5243006 1.2725827 0.653948159 5.11199032 3
#> 10 inequality 1.1074910 0.9703921 0.806177260 3.32317204 3
#> 11 income 1.1286699 0.8806087 0.536008597 6.75046047 3
#> 12 ideology -1.9807688 -1.8760201 -3.177512197 -1.67512670 3
#> 13 (Intercept) 0.8671735 0.9557446 -0.215199115 2.29154317 5
#> 14 inequality 0.1942800 0.2928247 -4.601062375 2.16643095 5
#> 15 income 0.8664719 0.8827436 0.066049967 2.22690581 5
#> 16 ideology -1.4699645 -1.1973285 -3.840152882 5.23861132 5
#> 17 (Intercept) 0.2424943 0.2922051 -0.581309242 1.04398439 6
#> 18 inequality 0.4749571 0.4725006 -0.489294235 1.26953955 6
#> 19 income 0.9564180 0.9731193 -0.092592190 1.69775576 6
#> 20 ideology -0.6065115 -0.6104234 -1.097744339 0.13863474 6
#> 21 (Intercept) -0.2244346 -0.2268487 -0.424459060 0.05456221 7
#> 22 inequality -1.3556697 -1.3066656 -2.224102139 -0.97491508 7
#> 23 income -0.5580188 -0.6336097 -0.943978617 0.31204573 7
#> 24 ideology -2.4791434 -2.3382237 -4.160265159 -2.13719249 7
#>
#> --------------------------------
#> Coefficients and clusters for context 4
#>
#> Post.Mean Post.Median HPD.lower HPD.upper Cluster
#> 1 (Intercept) 0.1489821 0.1359429 -0.006905322 0.50790734 2
#> 2 inequality -1.1078865 -1.1006213 -1.278098440 -0.89866613 2
#> 3 income -2.5300662 -2.5442136 -2.669712701 -2.31897443 2
#> 4 ideology -0.1360577 -0.1027092 -0.349021115 0.04589737 2
#>
#> --------------------------------
#> Coefficients and clusters for context 3
#>
#> Post.Mean Post.Median HPD.lower HPD.upper Cluster
#> 1 (Intercept) -0.05840748 -0.2726197 -0.4716340 1.49753949 3
#> 2 inequality -0.53135500 -0.5712764 -0.7566605 -0.07186253 3
#> 3 income -4.21120868 -4.1336759 -4.9820642 -3.90801613 3
#> 4 ideology 0.65857951 0.6082059 0.3266539 1.44840386 3
#> 5 (Intercept) 0.79997325 0.8776810 -0.4131583 1.89347650 4
#> 6 inequality 0.37912078 0.6269679 -0.9902230 1.29767813 4
#> 7 income -3.10885208 -3.0563370 -3.8242099 -2.73764690 4
#> 8 ideology 0.94088493 1.0004378 -0.1671213 2.18625227 4
#> 9 (Intercept) -0.17811647 -0.1573437 -0.5129998 0.10357201 5
#> 10 inequality 0.53324477 0.4807321 0.1375071 0.91825855 5
#> 11 income -3.08849597 -3.0540346 -3.9179301 -2.61128702 5
#> 12 ideology 2.29666857 2.2564111 2.0019751 2.65896424 5
#>
#> --------------------------------
#> Coefficients and clusters for context 5
#>
#> Post.Mean Post.Median HPD.lower HPD.upper Cluster
#> 1 (Intercept) 0.04984748 -0.1148491 -0.3749028 1.45411660 4
#> 2 inequality 0.84199218 0.9963766 -0.6860094 1.29938176 4
#> 3 income -0.06235565 -0.2193475 -0.6079892 2.16749029 4
#> 4 ideology -1.84745003 -1.9158276 -2.2302656 -1.07440839 4
#> 5 (Intercept) 0.75914853 0.3726648 -0.1657321 3.46494744 6
#> 6 inequality -3.16889752 -3.1356287 -4.0815019 -2.56581412 6
#> 7 income -0.56827206 -0.5695946 -1.1303822 0.02114409 6
#> 8 ideology -1.84604021 -2.2094811 -2.5353045 2.01061716 6
#> 9 (Intercept) -0.16620031 0.1427532 -4.7978466 0.61366963 8
#> 10 inequality -2.34857790 -2.3204492 -2.9144084 -1.83436620 8
#> 11 income 0.20717495 0.1923471 -0.3082516 0.88522169 8
#> 12 ideology -2.21262923 -2.3064404 -2.7686858 -0.45375310 8
#>
#> --------------------------------
#> Context-level coefficients:
#> Description Post.Mean HPD.lower HPD.upper
#> 1 Intercept of beta[0] -0.34771184 -2.915214 2.123203
#> 2 Intercept of beta[1] -0.21221191 -2.673362 1.748879
#> 3 Intercept of beta[2] -0.23624496 -2.596123 1.440000
#> 4 Intercept of beta[3] -1.04754888 -3.775422 1.133972
#> 5 Effect of gap on beta[0] -0.30462827 -2.399219 1.465704
#> 6 Effect of gap on beta[1] 0.05611329 -1.843939 2.449528
#> 7 Effect of gap on beta[2] -0.61543082 -3.681783 1.609011
#> 8 Effect of gap on beta[3] 0.77982256 -2.283581 2.589473
#>
#> --------------------------------
The summary
contains more information now. As before,
the column k
indicates the estimated clusters. The column
j
indicates the country (context) of the estimated value
for the respective cluster’s coefficient. The second summary
($tau
) shows the marginal effect of the context-level
feature (gap
). Details of the interpretation can be found
in Ferrari (2020).
There are a series of built-in functions to visualize the output. The
function plot_tau()
displays the estimation of the effect
of the context-level variables.
The function plot_pexp_beta()
displays the association
between the context-level features and the latent heterogeneity in the
effect of the linear coefficients in each context. The paramter
‘smooth.line’ plots a line representing the linear association between
the context-level feature (gap
) and the posterior averages
of the marginal effects in each cluster. The parameter
ncol.beta
controls the number of columns in the figure for
the panels. For more options, see help(plot_pexp_beta)
plot_pexp_beta(mod, smooth.line=TRUE, ncol.beta=2)
#>
#>
#> Generating plots ...
#> Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
#> of ggplot2 3.3.4.
#> ℹ The deprecated feature was likely used in the hdpGLM package.
#> Please report the issue at <https://github.com/DiogoFerrari/hdpGLM/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> `geom_smooth()` using formula = 'y ~ x'
#> `geom_smooth()` using formula = 'y ~ x'