# I ran a negative binomial regression - and you can too!

Updated: Oct 15, 2020

During my Master's and PhD, the focus when it came to methodology and methods, was research design, rather than techniques. So, when it comes to statistics, I'm usually left floating in the ether, figuring things out on my own. This includes taking courses, asking people, buying/downloading books, and looking at online tutorials.

For one of my studies, I came across the technique of negative binomial regressions, pointed out to me by an acquaintance. I had a huge dataset with many, many, **many zeros** that referenced the **occurrence of an event**: in this case, whether a legislative action was on women's represenation or not. Long story short, the statistical analyses just weren't working because the effect size was way too small. This actually meant that what I had was a **Poisson distribution**. In order to make the visualisation for this easier, I'll take the example of my own dataset:

Unit of analysis = legislative action (bill, speech, etc.)

Occurrence of event = in women's representation (yes/no)

Independent variables = author, author's gender, party, etc.

So, it kinda looks like this:

ID NAME GENDER PARTY OCCURRENCE

0000 Author1 0 Party_code 0

0001 Author2 1 Party_code 1

Anyway, while a NBR is not that hard to do as it turns out, it can be tricky, and most tutorials annoyingly use THE PERFECT DATA. In addition, they tend to focus on certain aspects and leave others out. By putting all that information together for myself, I decided to write my own __step-by-step__ how-to for **SPSS**, in the hope that I can provide some clarity for others. I am not a methodologist or a statistician - I'm a political scientist who learns the methods and she techniques needs when the research question or the syllabus demand it. So, if you are an expert and find any errors in the following, just message me and we'll talk it out.

1. First things first, the best check for a Poisson distribution that I found was to use:

**Nonparametric tests > Legacy dialogs > One-Sample K-S**

When the dialog box opens, add your dependent variable and click on the Poisson option. Your result should look like this:

I apologise for the fact that my result was perfect, as you should look for the Asymp. Sig. for your result as to whether your distribution is Poisson or not and a 1,000 result is the ultimate answer for that.

Ok, you got your result. This means, as it is, your dataset will not give you significant results. What you **can **do is reshape it and change the way your DV is counted. By reshaping, I mean changing the unit of analysis. Some choose to cluster the events by month, legislative term, or year. I did try that, but in my case it made the dataset vulnerable to two issues: the increase in women representatives and the increase in actions overall over time. So I clustered it by representative, which still allowed me to use time as a variable and show the actions of individual people.

2. To reshape my dataset I used **pivot tables** in Excel. It was the easiest possible solution. Work with what you got.

3. This is where things get interesting: open the clustered file back on SPSS in order to run the analysis.

**Analyze > Generalized Linear Models > Generalized Linear Models**

3.1 *Type of model*:

- if you click on negative binomial, it'll automatically set dispersion at one

- if you click on custom (down by the window, click distribution, pick negative binomial, link function - Log, and estimate value), SPSS will estimate the dispersion, and most books and tutorials suggest that

3.2 *Response*:

- add your dependent variable (which, of course, is a count variable)

3.3 *Predictors*:

- add categorical variables to factor field (use options to pick the reference category)

- add scale/count variables to covariates field

3.4 *Model*:

- this is where you add the variables to the model, moving them from the left to the right

- here you can pick which variables will be part of the model, leaving some of them on the box on the left; this is helpful when comparing how the model will react to the addition/subtraction to certain variables

- set to *main effects* to get the *estimated marginal means*

*- *keep intercept checked

3.5 *Estimation*:

- if the model comes out and says something like "it has reached the maximum step-halving", this is where you fix it! Just change that from 5 to any number larger than 5 until it works

- in the Covariance Matrix section, clicks on *Robust estimator* for robust standard errors

3.6 *Statistics*:

- click on *Include exponential parameter estimates* - this is the big kahuna! a.k.a. incidence rate ratio or risk ratio or likelihood

3.7 *EM Means*:

- this is where you select estimated means

- add the categorical variables of interest to the box on the right

- click on compute means for linear predictor and display overall estimated mean

3.8 *Save*:

- this is where you can get all kinds of checks; suggestions are as follows

- predicted value of mean response (which are predicted probabilities for the DV)

- Cook's distance (which finds outliars)

- standardized Pearson

- standardized deviance

**Interpretation! **

So, all the values are pointless if we don't know how to use them properly. For instance, outliars were necessary for me, because it pointed out the strength only one of a few MPs have - so I never removed them. Keep in mind that you know your theory and what you need and make your decisions based on that.

**Goodness-of-fit values**

Values such as AIC and BIC are good only in comparison to other AICs and BICs. If you're running only one model, sure, report them, but it doesn't really matter unless you run more than one and can show the difference between them.

Value/df for deviance and Pearson Chi-square shouldn't go above 2.

**Omnibus test**

This is the significance of the model! A Sig = ,000 means that the hypothesis of an intercept model only is rejected!

**Parameter estimates**

B: the beta here serves pretty much to show the direction of the relationship

Sig: significance of the independent variable for the model

Exp(B): back to the big kahuna, which are *incidence rate ratios *for count variables or *risk ratios* for categorical variables; essentially, you can phrase these responses as "greater likelihood of" or "expected number of times", which is the number stated there.

So, consider that we have** y** number of actions on women's representation. If a factor has a risk ratio of impact of 1.09, this means y x 1.09, therefore, not that much impact. If it's smaller than 0, your beta will be negative - the impact means it diminishes the outcome of the response variable. It is also possible to use a percentage by multiplying Exp(B) by 100 and then subtracting 100, so: 1.09*100 = 109 // 109 - 100 = 9, so a 9% impact.

Standard errors: the level of variance from the mean (smaller is better, means less variation)

EM Means: mean response for each factor are fixed at their mean, and the predicted mean for the response and its 95% interval are shown.

This is definitely a compilation of A LOT of things. Some were from the SPSS website itself.

I guess this as close of a blibiography as it is possible:

__https://www3.nd.edu/~rwilliam/__

__https://stats.idre.ucla.edu/spss/dae/negative-binomial-regression/__

Fávero, L. P., & Belfiore, P. (2017). *Manual de Análise de Dados—Estatística e Modelagem Multivariada com Excel®, SPSS® e Stata®* (1st ed.). GEN LTC.
Denham, B. E. (2017). Poisson and Negative Binomial Regression. In *Categorical statistics for communication research*. John Wiley & Sons.

Hilbe, J. M. (2011). *Negative binomial regression* (2nd ed). Cambridge University Press.

Pallant, J. (2010). *SPSS survival manual: A step by step guide to data analysis using SPSS* (4th edition). McGraw Hill.

Okay, that's it! This was oddly fun and liberating to figure out and I hope writing it down and posting it helps someone else!