The 2023 Northern European Stata Conference


The 2023 North European Stata Conference will be held at the Karolinska Institutet in Stockholm on September 1, 2023. Lecture room Richard Doll, Eugeniahemmet, Maria Aspmans gata 30, 1st floor.

This meeting will provide Stata users the opportunity to exchange ideas, experiences, and information on new applications of Stata. Anyone interested in using Stata is welcome. Representatives from StataCorp will attend, and there will be the usual open panel discussion with Stata developers.

If you want to attend the meeting, please send an email to containing your name, affiliation, and contact details.



09:00–09:05:  Welcome!

09:05–09:30 Kit Baum

Boston College, USA

Drivers of COVID-19 deaths in the United States: A two-stage modeling approach

We offer a two-stage (time-series and cross-section) econometric modeling approach to examine the drivers behind the spread of COVID-19 deaths across counties in the United States. Our empirical strategy exploits the availability of two years (January 2020 through January 2022) of daily data on the number of confirmed deaths and cases of COVID-19 in the 3,000 U.S. counties of the 48 contiguous states and the District of Columbia. In the first stage of the analysis, we use daily time-series data on COVID-19 cases and deaths to fit mixed models of deaths against lagged confirmed cases for each county. Because the resulting coefficients are county specific, they relax the homogeneity assumption that is implicit when the analysis is performed using geographically aggregated cross-section units. In the second stage of the analysis, we assume that these county estimates are a function of economic and sociodemographic factors that are taken as fixed over the course of the pandemic. Here we employ the novel one-covariate-at-a-time variable-selection algorithm proposed by Chudik et al. (2018) to guide the choice of regressors.


09:30–09:55 Robert Thiesmeier

Karolinska Institutet, Sweden

Estimation of two-stage models in individual participant data meta-analysis with missing data

Individual participant data (IPD) meta-analysis often have missing data and are analyzed in two-steps: estimates are first obtained within each individual study and then averaged across studies. The current mi suite of commands for dealing with missing data does not allow a two-stage approach in estimating regression models. Therefore, we introduce a new command, -twostage-, which offers to fit two-stage regression models for IPD meta-analysis with missing data. -twostage- has been developed to accommodate systematic and sporadically missing data in IPD meta-analysis. We first briefly describe the challenges of missing data in IPD meta-analysis and then illustrate applications of the -twostage- command in the context of health-related studies.


09:55–10:20 Nicola Orsini

Karolinska Institutet, Sweden

Imputation of systematic missing data in individual participant data meta-analysis

Answering research questions in light of multiple studies is challenged by one or more variables being 100% unobserved by design, also known as systematic missing data. The current imputation methods implemented in -mi-, however, are mainly suited for one study and sporadically missing data. Our aim is to introduce a new user-defined imputation method within -mi impute- capable of handling the main features of individual participant data (IPD) meta-analysis. Realistic simulated studies will be used to illustrate the logic and practice of imputing systematic missing data.


10:20–10:50 Coffee break


10:50–11:15 Haghish Ebad Fardzadeh

University of Oslo, Norway

Single Imputation and Multiple Imputation with Machine Learning

Abstract forthcoming.


11:15–11:40 Matteo Bottai

Karolinska Institutet, Sweden

A command for estimating regression parameters for the maximum agreement predictor

The talk presents -mareg- a command for estimating the coefficients of maximum agreement regression models for an outcome variable given predictors. Recently introduced by Bottai et al. (The American Statistician, 2022, 76:4, 313-321), maximum agreement regression maximizes the concordance correlation between the prediction and the observed outcome, not the Pearson's correlation coefficient maximized by ordinary linear regression. The syntax of the command is nearly identical to that of -regress-, which estimates least squares regression. The talk shows the features of the command and its possible applications through real data examples.


11:40-12:05 Nils Henrik Bruun

Aalborg University Hospital, Denmark

Regression to the mean and randomized control trials with continuous outcomes

Measurement errors in a study make the "regression to the mean" occur to different degrees. To remedy the "regression to the mean"-effect in randomized control trials, one should measure the continuous outcome before randomization and adjust for the baseline outcome value in the analysis. This adjustment requires the use of regression constraints. The adjustment leads to lesser standard errors. After presenting a real case, I introduce the concept of "regression to the mean." Then I introduce the relation from "regression to the mean" to the intra-class correlation and the measurement error. Using the case, I compare the estimates from several approaches in randomized control trials. Here, I demonstrate the use of constraints. Knowing the intra-class correlation in power calculations will lead to a lesser required number of observations, i.e., higher power. Hence, randomized control trials should report the intra-class correlation.


12:05-13:10 Lunch break


13:10-14:10 Enrique Pinzon


Heterogeneous difference-in-difference estimation

Treatment effects might differ over time and for groups that are treated at different points in time, treatment cohorts. In Stata 18, we introduced two commands that estimate treatment effects that vary over time and cohort. For repeated cross-sectional data, we have -hdidregress-. For panel data, we have -xthdidregress-. Both commands let you graph the evolution of treatment over time. They also allow you to aggregate treatment within cohort and time and visualize these effects. I will show you how both commands work and briefly discuss the theory underlying them.


14:10-14:35 Nurgul Batyrbekova

Karolinska Institutet, Sweden

Modelling hazard rates with multiple timescales: An application study

There are situations when we need to model multiple timescales in survival analysis. A usual approach would involve fitting Cox or Poisson models to a time-split dataset. However, this leads to large datasets and can be computationally intensive when model fitting, especially if interest lies in displaying how the estimated hazard rate or survival change along multiple timescales continuously. Flexible parametric survival models on the log hazard scale is an alternative method when modelling data with multiple timescales. This can be achieved by using Stata package -stmt- where one of the timescales is chosen to be a primary timescale, and the other timescale(s) is(are) specified by using an option offset. Through a case-study I will demonstrate this method and provide examples of graphical representations.


14:35-15:00 Coffee break


15:00-15:25 Alessandro Gasparini

Red Door Analytics AB, Sweden

Hierarchical survival models: Estimation, prediction, interpretation

Hierarchical time-to-event data is common across various research domains. In the medical field, for instance, patients are often nested within hospitals and regions, while in education, students are nested within schools. In these settings, the outcome is typically measured at the individual level, with covariates recorded at any level of the hierarchy. This hierarchical structure poses unique challenges and necessitates appropriate analytical approaches. Traditional methods, like the widely-used Cox model, assume the independence of study subjects, disregarding the inherent correlations among subjects nested within the same higher-level unit (such as a hospital). Consequently, failing to account for the multilevel structure and within-cluster correlation can yield biased and inefficient results. To address these issues, one can use mixed effects models, which incorporate both population-level fixed effects and cluster-specific random effects at various levels of the hierarchy. Stata users can leverage several powerful commands to fit hierarchical survival models, such as -mestreg- and -stmixed-. With this presentation, we introduce and demonstrate the use of these commands, including a range of post-estimation predictions. Moreover, we delve into measures that quantify the impact of the hierarchical structure, commonly referred to as contextual effects in the literature, and discuss the interpretation of model-based predictions, focusing on the difference between conditional and marginal effects.


15:25-15:50 Caroline Weibull

Karolinska Institutet & Red Door Analytics AB, Sweden

Modelling excess mortality comparing to a control population: A combined additive and relative hazards model

In this work, we propose a flexible parametric excess hazard model on the log hazard scale, incorporating a modelled expected rate from a control population (e.g., matched comparators). Covariate effects are assumed to be multiplicative within both the expected hazard and the excess hazard, while the presence of disease among the studied group has an additive effect, hence the excess hazard. By modelling the expected rate, we can appropriately allow for uncertainty. The model is extended to include time-dependent effects, multiple time-scales, and more. Following estimation, we quantify results through the prediction of the survival, hazard, and cumulative incidence functions, as well as transformations of these, and crucially with associated confidence intervals on all measures. The proposed method has been implemented in the Stata package -stexcess- (


15:50-16:15 Michael Crowther

Red Door Analytics AB, Sweden

Health technology assessment and Stata: Reviewing the old and coding the new

Health Technology Assessment (HTA) utilizes a wide variety of statistical methods to evaluate clinical and cost effectiveness of treatments, including survival analysis and meta-analysis. In this talk, I will briefly review some of the available features in Stata that we have developed over the years, with a focus towards their use in HTA, and describe some ongoing work to improve their applicability in such settings. This will include; 1) flexible survival modelling with merlin, 2) Markov, semi-Markov and non-Markov multi-state modelling with multistate, and 3) efficient and generalizable individual patient simulation with -survsim-. Finally, I will introduce some new tools, such as the -maic- command for conducting matched-adjusted indirect comparisons, and a new prefix command for -stmerlin-, providing Bayesian flexible survival models.


16:15-17:00:  Open discussion with Stata Developers

Yulia Marchenko, Vice President of Statistics and Data Science, StataCorp

Enrique Pinzon, Associate Director of Econometrics, StataCorp

Contribute to the Stata community by sharing your feedback with StataCorp's developers. From feature improvements to bug fixes and new ways to analyze data, we want to hear how Stata can be made better for our users



Scientific committee

Matteo Bottai, matteo.bottai@ki.seUnit of Biostatistics, National Institute of Environmental Medicine, Karolinska Institutet.

Nicola Orsini,, Biostatistics Team, Department of Global Public Health, Karolinska Institutet.

Caroline Weibull,, Division of Clinical Epidemiology, Department of Medicine, Karolinska Insititutet.


Logistics organizers

The meeting is jointly organized by the Biostatistics Team at the Department of Global Public Health and Metrika Consulting.

Metrika is the distributor of Stata in Northern Europe -- the Nordic and Baltic regions, and Russia. For further information, please visit or contact us at