VAT: Incl. excl.




Feature highlights 

We are excited to introduce you to the new features in Stata 16. Below, we list some highlights of the release, and we tell you a little more about the first 13 of them.

1. Lasso-based machine learning

2. Reporting

3. Meta-analysis

4. Choice models

5. Python integration

6. New in Bayesian analysis—Multiple chains, predictions, and more

7. Panel-data ERMs

8. Import data from SAS and SPSS

9. Nonparametric series regression

10. Multiple datasets in memory

11. Sample-size analysis for confidence intervals

12. Nonlinear DSGE models

13. Multiple-group IRT models

14. xtheckman

15. Multiple-dose pharmacokinetic modeling

16. Heteroskedastic ordered probit models

17. Graph sizes in printer points, centimeters, and inches

18. Numerical integration

19. Linear programming

20. Stata in Korean

21. Mac interface now supports Dark Mode and native tabbed windows

22. Do-file Editor—Autocompletion and more syntax highlighting


1. Lasso-based machine learning

Lasso is a machine-learning technique used for model selection, prediction, and inference.

The new lasso command selects “optimal” predictors for continuous, count, and binary outcomes using deviances from linear, Poisson, logit, or probit regression models. For instance, if you type

. lasso linear y x1-x500

lasso will select a subset of the specified covariates—say, x2, x10, x11, and x21. You can then use the standard predict command to obtain predictions of y.

If you instead have a binary or count outcome, you can use lasso logit, lasso probit, or lasso poisson in the same way. And if you prefer to select variables using the elastic net or square-root lasso method, you can use the elasticnet or sqrtlasso command.

Sometimes, variable selection or prediction is the final goal of lasso. Other times, you are interested in estimating and testing coefficients. Stata 16 provides 11 commands that allow you to estimate coefficients, standard errors, and confidence intervals and to perform tests for variables of interest while using lasso methods to select from among potential control variables. The commands are

dsregress, dslogit, dspoisson, poregress, pologit, popoisson, poivpoisson, xporegress, xpologit, xpopoisson, and xpoivregress.

The ds commands perform double-selection lasso, the po commands perform partialing-out lasso, and the xpo commands perform cross-fit partialing-out lasso. They do this for models with continuous, binary, and count outcomes. They can even handle endogenous covariates in models for continuous outcomes. The literature currently discusses many methods for lasso-based inference. We make some of these methods available so that researchers can select their favorite. In fact, there are even more lasso-based methods of inference in the literature, and often researchers may use the tools available in lasso, sqrtlasso, and elasticnet to implement other methods.

The lasso and elasticnet commands are standard lasso tools often requested for variable selection and prediction. The lasso tools for inference implement newer methods developed primarily by econometricians. However, these inference methods will be popular in all disciplines because they provide a method for testing and interpreting coefficients on variables of interest.

You can easily learn all about the lasso features in the new Lasso Reference Manual.


2. Reporting

Stata’s reporting features allow you to create Word, PDF, Excel, and HTML documents that incorporate Stata results and graphs with formatted text and tables. Regardless of the type of document you create, you can rely on Stata’s integrated versioning features to ensure that your reports are reproducible.

Want dynamic reports that are updated as your data change? Stata’s reporting features make this easy too. Rerun the command or do-file that created your report with the updated dataset, and all Stata results in the report are updated automatically.

Stata 16 has new and improved reporting features, of course, but just as importantly, all of Stata's reporting features are now documented in a new Reporting Reference Manual. The manual includes many new examples that demonstrate workflows and provide guidance on customizing the Word, PDF, Excel, and HTML documents you create using Stata.

New reporting features in Stata 16:

- The dyndoc and markdown commands now create Word documents in addition to the HTMLdocuments they previously created. Now, you can easily incorporate full Stata output andgraphs with Markdown-formatted text to create customized Word documents. ?The Do-file Editor now provides syntax highlighting for Markdown language elements.

- The putdocx command now lets you include headers, footers, and page numbers. It also makesit easier to write large blocks of text.

- The html2docx command converts HTML documents, including CSS, to Word documents.

- The docx2pdf command converts Word documents to PDFs.


3. Meta-analysis

Stata 16 has a new suite of commands for performing meta-analysis. This suite lets you explore and combine the results from different studies. For instance, if you have collected results from 20 studies about the effect of a particular drug on blood pressure, you can summarize these studies and estimate the overall effect using meta-analysis.

The new meta suite is broad, but what sets it apart is its simplicity. You can type, for instance,

. meta set effectsize stderr

to declare precomputed effect sizes or use meta esize to compute effects from summary data. With this, you can perform random-effects, fixed-effects, or common-effect meta-analysis.

To estimate an overall effect size and its confidence interval, obtain heterogeneity statistics, and more, you simply type

.meta summarize

And visualizing the results is as easy as typing

. meta forestplot

But the meta suite provides much more.

Meta-regression and subgroup analysis allow you to evaluate the heterogeneity of studies. These are available via meta regress and meta forestplot, subgroup() or meta summarize, subgroup().

You can investigate potential publication bias. Check visually for funnel-plot asymmetry using meta funnelplot; formally test for funnel-plot asymmetry using meta bias; and assess publication bias using the trim-and-fill method with meta trimfill.

You can even perform cumulative meta-analysis with meta summarize, cumulative().

All the meta-analysis features are documented in the new Meta-analysis Reference Manual.


4. Choice models

Stata 16 introduces a new, unified suite of commands for modeling choice data. We have added new commands for summarizing choice data. We renamed and improved existing commands for fitting choice models. We even added a new command for fitting mixed logit models for panel data. And we document them together in the new Choice Models Reference Manual.

And here’s the best part: margins now works after fitting choice models. This means you can now easily interpret the results of your choice models. While the coefficients estimated in choice models are often almost uninterpretable, margins allows you to ask and answer very specific questions based on your results. Say that you are modeling choice of transportation. You can answer questions such as

- What proportion of travelers are expected to choose air travel?

- How does the probability of traveling by car change for each additional $10,000 inincome?

- If wait times at the airport increase by 30 minutes, how does this affect the choice ofeach mode of transportation?

What else is new? You now cmset your data before fitting a choice model. For instance,

. cmset personid transportmethod

Then, you use cmsummarize, cmchoiceset, cmtab, and cmsample to explore, summarize, and look for potential problems in your data.

And you use cm estimation commands to fit one of the following choice models:

- cmclogitconditional logit (McFadden’s choice) model

- cmmixlogitmixed logit model

- cmxtmixlogitpanel-data mixed logit model

- cmmprobitmultinomial probit model

- cmroprobitrank-ordered probit model

- cmrologitrank-ordered logit model

cmxtmixlogit is completely new in Stata 16, and it fits mixed logit models for panel data.


5. Python integration

In Stata 16, you can embed and execute Python code from within Stata. Stata's new python command allows you to easily call Python from Stata and output Python results within Stata.

You can invoke Python interactively or in do-files and ado-files so that you can leverage Python's extensive language features. You can also execute a Python script file (.py) directly through Stata.

In addition, we introduced the Stata Function Interface (sfi) Python module, which provides a bi-directional connection between Stata and Python. This module lets you access Stata's current dataset, frames, macros, scalars, matrices, value labels, characteristics, global Mata matrices, and more.

All of this means that you can now use any Python package directly within Stata. For instance, you can use Matplotlib to draw 3-dimensional graphs. You can use NumPy for numerical computations. You can use Scrapy to scrape data from the web. You can access additional machine-learning techniques such as neural networks and support vector machines through TensorFlow and scikit-learn. And much more.

Finally, Stata’s Do-file Editor now includes syntax highlighting for the Python language.


6. New in Bayesian analysis—Multiple chains, predictions, and more

Multiple chains: Bayesian inference based on an MCMC (Markov chain Monte Carlo) sample is valid only if the Markov chain has converged. One way we can evaluate this convergence is to simulate and compare multiple chains.

The new nchains() option can be used with both the bayes: prefix and the bayesmh command. For instance, you type

. bayes, nchains(4): regress y x1 x2

and four chains will be produced. The chains will be combined to produce a more accurate final result. Before interpreting the result, however, you can compare the chains graphically to evaluate convergence. You can also evaluate convergence using the Gelman–Rubin convergence diagnostic that is now reported by bayes: regress and other Bayesian estimation commands when multiple chains are simulated. When you are concerned about noncovergence, you can investigate further using the bayesstats grubin command to obtain individual Gelman–Rubin diagnostics for each parameter in your model.

Bayesian predictions: Bayesian predictions are simulated values from the posterior predictive distribution. These predictions are useful for checking model fit and for predicting out-of-sample observations. After you fit a model with bayesmh, you can use bayespredict to compute these simulated values or functions of them and save those in a new Stata dataset. For instance, you can type

. bayespredict (ymin:@min({_ysim})) (ymax:@max({_ysim})), saving(yminmax)

to compute minimums and maximums of the simulated values. You can then use other postestimation commands such as bayesgraph to obtain summaries of the predictions.

The dataset created by bayespredict may include thousands of simulated values for each observation in your dataset. Sometimes, you do not need all of these individual values. To instead obtain posterior summaries such as posterior means or medians, you can use bayespredict, pmean or bayespredict, pmedian. Alternatively, you may be interested in a random sample of the simulated values. You can use, for instance, bayesreps, nreps(100) to obtain 100 replicates.

Finally, you may want to evaluate model goodness of fit using posterior predictive p-values, also known as PPPs or as Bayesian predictive p-values. PPPs measure agreement between observed and replicated data and can be computed using the new bayesstats ppvalues command. For instance, using our earlier example

. bayesstats ppvalues {ymin} {ymax} using yminmax


7. Panel-data ERMs

Extended regression models (ERMs) were a big new feature last release. The ERM commands fit models that account for three common problems that arise in observational data—endogenous covariates, sample selection, and treatment—either alone or in combination.

In Stata 16, we introduce the xteregress, xteintreg, xteprobit, and xteoprobit commands for fitting panel-data ERMs. This means ERMs can now account for the three problems we mentioned above and for within-panel correlation. These new commands fit random-effects linear, interval, probit, and ordered probit regression models. They allow random effects in one or all equations, and they allow random effects to be correlated across equations.


8. Import data from SAS and SPSS

With Stata 16’s new import sas and import spss commands, you can now import data stored in SAS (.sas7bdat) and SPSS (.sav) formats. The dialog boxes make it easy to explore the data before importing them and, if desired, to select a subset of variables and observations to load into Stata.

In addition, with the new import sasxport8 and export sasxport8 commands, you can import and export SAS XPORT Version 8 Transport files into Stata. The existing import sasxport and export sasxport commands worked with SAS XPORT Version 5 Transport files and have been renamed import sasxport5 and export sasxport5.


9. Nonparametric series regression

Stata 16's new npregress series command fits nonparametric series regressions that approximate the mean of the dependent variable using polynomials, B-splines, or splines of the covariates. This means that you do not need to specify any predetermined functional form. You specify only which covariates you wish to include in your model. For instance, type

. npregress series wineoutput rainfall temperature i.irrigation

Instead of reporting coefficients, npregress series reports effects, meaning average marginal effects for continuous variables and contrasts for categorical variables. The results might be that the average marginal effect of rainfall is 1 and the contrast for irrigation is 2. This contrast can be interpreted as the average treatment effect of irrigation.

Being a nonparametric regression, the unknown mean is approximated by a series function of the covariates. And yet we can still obtain the inferences that we could from a parametric model. We just use margins. We could type

. margins irrigation, at(temperature=(40(5)90))

and obtain a table of the expected effect of having irrigation at temperatures of 40, 50, ..., 90 degrees. And we could graph the result using marginsplot.

Even more, npregress series can fit partially parametric (semiparametric) models.


10. Multiple datasets in memory

You can now load multiple datasets into memory. You type

. use people

and people.dta is loaded into memory. Next, you type

frame create counties

. frame counties: use counties

and you have two datasets in memory. people.dta is in the frame named default, and counties.dta is in the frame named counties. Your current frame is still default. Most Stata commands use the data in the current frame. For example, if you typed

. list

then people.dta will be listed. If you typed

. frame counties: list

then counties.dta will be listed. Or you could make counties the current frame by typing

. frame change counties

and list will now list the counties data.

Navigating frames is easy and so is linking them. Imagine that both datasets have a variable named countycode that identifies counties in the same way. Type

. frlink m:1 countycode, frame(counties)

and each person in the default frame is linked to a county in the counties frame. This means you can now use the frget command to copy variables from the counties frame to the current frame. Or you can use the frval() function to directly access the values of variables in the counties frame. For instance, if we have each individual’s income in the default frame and median county income in the counties frame, we can generate a new variable containing relative income by typing

. generate rel_income = income / frval(counties, median_income)

This is just the beginning. While this example uses only two frames, you can have up to 100 frames in memory at once, and you can have many links among those frames.


11. Sample-size analysis for confidence intervals

The new ciwidth command performs Precision and Sample Size (PrSS) analysis, which is sample-size analysis for confidence intervals (CIs). This method is used when you are planning a study and you want to optimally allocate resources when CIs are to be used for inference. Said differently, you use this method when you want to estimate the sample size required to achieve the desired precision of a CI in a planned study.

ciwidth produces sample sizes, precision, and more that are required for the

- CI for one mean

- CI for one variance

- CI for two independent means

- CI for two paired means

The control panel interface lets you select the analysis type and input assumptions to obtain desired results.

ciwidth allows results to be displayed in customizable tables and graphs.

ciwidth also provides facilities for you to add your own methods.


12. Nonlinear DSGE models

Stata 15 introduced the dsge command for fitting linear DSGE models, which are time-series models used in economics and finance. These models are an alternative to traditional forecasting models. Both attempt to explain aggregate economic phenomena, but DSGE models do this on the basis of models derived from microeconomic theory.

New in Stata 16, the dsgenl command fits nonlinear DSGE models. Most DSGE models are nonlinear, and this means that you no longer need to linearize them by hand. When you enter equations into dsgenl, it linearizes them for you.

After estimating the parameters of your model with dsgenl, you can obtain the transition and policy matrices; determine the model’s steady state; estimate variables’ variances, covariances, and autocovariances implied by the system of equations; and create and graph impulse–response functions.


13. Multiple-group IRT models

IRT models explore the relationship between a latent (unobserved) trait and items that measure aspects of the trait. This often arises in standardized testing where the trait of interest is ability, such as mathematical ability. A set of items (test questions) is designed, and the responses measure this unobserved trait.

Stata’s irt commands fit 1-, 2-, and 3-parameter logistic models. They also fit graded response, nominal response, partial credit, and rating scale models, and any combination of them. And after fitting a model, irtgraph graphs item-characteristic curves, test characteristic curves, item information functions, and test information functions.

New in Stata 16, the irt commands allow comparisons across groups. Take any of the existing irt commands, add a group(varname) option, and fit the corresponding multiple-group model. For instance, type

. irt 2pl item1-item10, group(female)

and fit a two-group 2PL model.

Group-specific means and variances of the latent trait will be estimated. Group-specific difficulty and discrimination parameters can also be estimated for one or more items. With constraints, you can specify exactly which parameters are allowed to vary and which parameters are constrained to be equal across groups.

You can even use likelihood-ratio tests to compare models with and without constraints to perform an IRT model-based test of differential item functioning.


And more!