SUG 2018 Abstracts

Abstracts and Proceedings, 2018 Stata User Group Meeting

Literate programming: Using log2markup, basetable, and matrixtools

Niels Henrik Bruun, Aarhus University

During the last decade, there have been several attempts to integrate comments and statistical outputs in Stata indicating the importance of this with respect to, for instance, literate programming. I present a later development based on three integrated packages: log2markup, basetable, and matrixtools.

log2markup transforms a commented log file into a document based on markup languages of the users' choice like LaTeX, HTML or Markdown. One of the features of log2markup is that it reads output from Stata commands as part of the markup language itself.

One command where this is beneficial is basetable, which is one of several interactive commands in which it is easy to build the typical first or base table for data summaries, for example, in articles. The output can set to have the style of the markup language used in the comments. I briefly demonstrate its usability.

Another set of Stata commands I will present are in the Stata package matrixtools. Here, the basic command matprint makes it easy to print the matrix content in the wanted markup style. Several other matrixtools commands use matprint, such as sumat, which is an extension of the Stata command summarize. Summary statistics, including new ones like "unique values". sumat returns all results in a matrix (also for text variables). It is possible to group statistics by a categorical variable. Another such command is crossmat, which is a wrapper for the Stata command tabulate, returning all outputs in matrices. Further, there is the command metadata, which collects metadata from the current dataset, a noncurrent dataset, or all datasets in a folder (if requested, including subfolders as well).

Additional information here.

Exploring marginal treatment effects: Flexible estimation using Stata

Martin Eckhoff Andresen, Statistics Norway

Well-known instrumental variables (IV) estimators identify treatment effects in settings with selection on levels. In settings that also exhibit selection on gains, the treatment effects for the compliers identified by IV might be very different from other populations of interest. Under stronger separability assumptions, the marginal treatment effects (MTE) framework allows us to estimate the whole distribution of treatment effects. I introduce the framework and theory behind MTE, and I introduce the new package mtefe, which uses several estimation methods to fit MTE models in Stata. This package provides important improvements and flexibility over existing packages such as margte (Brave and Walstrum 2014) and calculates various treatment-effect parameters based on the results.

Additional information here.

Calculating polarization indices for population subgroups using Stata

Jan Zwierzchowski, SGH Warsaw School of Economics

In recent years, more attention has been focused on the effects of economic growth and inequality changes on income polarization, as well as on the changes in the middle income class fraction. Most of the literature that deals with this issue is focused on polarization indices. However, the polarization indices proposed by researchers allow only for an assessment of polarization in the whole population and does not actually explain reasons for the decline of middle class fractions in certain countries. This presentation proposes a class of median relative polarization (MRP) partial indices, which allows for a comprehensive assessment of income distribution changes (its polarization or convergence) in any given sub-population, particularly the lower-, middle-, and upper- income class groups. Moreover, a class of proposed indices is further generalized to allow for assessment of polarization in certain cohort groups while operating on panel-data sources. I wrote a new Stata command that operationalizes the proposed polarization indices. Polarization indices for lower-, middle-, and upper-income groups in the 2005–2015 period have been calculated using panel data for Poland (Social Diagnosis Panel Survey Dataset). It has been shown that despite the lack of polarization in the whole population, there was a slight convergence of incomes in the lower- and middle-income groups and a significant polarization of incomes in the upper-income group. This means that on average, incomes of the lowest and middle earners tend to converge toward the median, while the incomes of the richest part of the population are growing even higher.

Additional information here.

Calibrating survey weights in Stata

Jeff Pitblado, StataCorp

Calibration is a method for adjusting the sampling weights, often to account for nonresponse and underrepresented groups in the population. Another benefit of calibration is smaller variance estimates compared with estimates using unadjusted weights. Stata implements two methods for calibration: the raking-ratio method and the generalized regression method. Stata supports calibration for the estimation of totals, ratios, and regression models. Calibration is also supported by each survey variance estimation method implemented in Stata. In this presentation, I will show how to use calibration in survey data analysis using Stata.

Additional information here.

Estimation and inference for quantiles and measures of inequality with survey data

Philippe Van Kerm, Luxembourg Institute for Social and Economic Research

Stata is the software of choice for many analysts of household surveys, in particular for poverty and inequality analysis. No dedicated suite of commands comes bundled with the software, but many community-contributed commands are freely available for the estimation of various types of indices. This presentation will present a set of new tools that complement and significantly upgrade some existing packages. The key feature of the new packages is their ability to leverage Stata's built-in capacity for dealing with survey design features (via the svy prefix), resampling methods (via the bootstrap, jackknife, or permute prefix), multiply imputed data (via mi), and various postestimation commands for testing purposes.

Additional information here.

Introduction to Bayesian analysis using Stata

Chuck Huber, StataCorp

Bayesian analysis has become a popular tool for many statistical applications. Yet many data analysts have little training in the theory of Bayesian analysis and software used to fit Bayesian models. This presentation will provide an intuitive introduction to the concepts of Bayesian analysis and demonstrate how to fit Bayesian models using Stata. No prior knowledge of Bayesian analysis is necessary, and specific topics will include the relationship between likelihood functions, prior, and posterior distributions, Markov Chain Monte Carlo (MCMC) using the Metropolis–Hastings algorithm, and how to use Stata's Bayes prefix to fit Bayesian models.

Additional information here.

merlin: Mixed effects regression for linear and non-linear models

Michael J. Crowther, University of Leicester

merlin can do a lot of things: linear regression, a Weibull survival model, a three-level logistic model, a multivariate joint model of multiple longitudinal outcomes, a recurrent event, and survival. merlin can do things I haven't even thought of yet. I will take a single dataset, attempt to show you the full range of capabilities of merlin, and present some of the new features following its rise from the ashes of megenreg. There will even be some surprises.

Additional information here.

Producing up-to-date survival estimates from prognostic models using temporal recalibration

Sarah Booth, University of Leicester

Period analysis is a method used in survival analysis which uses delayed entry techniques in order to only include the most recent data. Period analysis has been shown to produce more up-to-date survival predictions compared to using the standard method of cohort analysis. However, using period analysis reduces the sample size which leads to greater uncertainty in the parameter estimates.

Temporal recalibration combines the advantages of cohort and period analysis. A cohort model is fitted then recalibrated using a period analysis sample. The parameter estimates are constrained to be the same but the baseline hazard function can vary which allows any improvements in survival to be captured. Therefore, this method could be useful for prognostic models since it enables more up-to-date survival predictions to be produced.

In this talk I’ll show the differences between the cohort, recalibrated and period analysis models and compare the survival estimates which are produced. This involves using stset to define the period analysis sample and stpm2 to fit and recalibrate flexible parametric survival models.

Additonal information here.

Analyzing time-to-event data in the presence of competing risks within the flexible parametric modeling framework. What tools are available in Stata? Which one to use and when?

Sarwar Islam Mozumder, University of Leicester

In a typical survival analysis, the time to an event of interest is studied. For example, in cancer studies, researchers often wish to analyze a patient's time to death since diagnosis. Similar applications also exist in economics and engineering. In any case, the event of interest is often not distinguished between different causes. Although this may sometimes be useful, in many situations this will not paint the entire picture and restricts analysis. More commonly, the event may occur because of different causes, which better reflects real-world scenarios. For instance, if the event of interest is death due to cancer, it is also possible for the patient to die because of other causes. This means that the time at which the patient would have died because of cancer is never observed. These are known as competing causes of death or competing risks.
In a competing risks analysis, interest lies in the cause-specific cumulative incidence function (CIF). This can be calculated by either:

(1) transforming on (all) cause-specific hazards, or
(2) its direct relationship with the subdistribution hazards

Obtaining cause-specific CIFs within the flexible parametric modeling framework by adopting approach (1) is possible by using the stpm2 postestimation command, stpm2cif. Alternatively, since competing risks is a special case of a multistate model, an equivalent model can be fitted using the multistate package. To estimate cause-specific CIFs using approach (2), stpm2 can be used by applying time-dependent censoring weights, which are calculated on restructured data using stcrprep.

The above methods involve some form of data augmentation. Instead, estimation on individual-level data may be preferred because of computational advantages. This is possible using either approach (1) or (2) with stpm2cr.

In this presentation, I provide an overview of these various tools, and I discuss which of these to use and when.

Additional information here.

Standardized survival curves and related measures from flexible parametric survival models

Paul C. Lambert, University of Leicester and Karolinska Institutet

In observational studies with time-to-event outcomes, we expect that there will be confounding and would usually adjust for confounders in a survival model. From such models, an adjusted hazard ratio comparing exposed and unexposed subjects is often reported. This is fine, but hazard ratios can be difficult to interpret, are not collapsible, and there are further problems when trying to interpret hazard ratios as causal effects. Risks are much easier to interpret than rates, so quantifying the difference on the survival scale can be desirable.
In Stata, stcurve gives survival curves after fitting a model where certain covariates can be given specific values, but those not specified are given mean values. Thus, it gives a prediction for an individual who happens to have the mean values of each covariate and may not reflect the average survival in the population. An alternative is to use standardization to estimate marginal effects, where the regression model is used to predict the survival curve for unexposed and exposed subjects at all combinations of other covariates included in the model. These predictions are then averaged to give marginal effects.

I present stpm2_standsurv to obtain various standardized measures after fitting a flexible parametric survival model. As well as estimating standardized survival curves, the command can estimate the marginal hazard function, the standardized restricted mean survival time, and centiles of the standardized survival curve. Contrasts can be made between any of these measures (differences, ratios). A user-defined function can be given for more complex contrasts.

Additional information here.