One of the most used algorithms in ARIMA modeling is Box-Jenkins Algorithm.

Box- Jenkins algorithm consists of 4 major steps :

1. Model identification

2. Model selection

3. Model diagnostic

4. Model forecasting

The first step in the analysis is to generate a run sequence plot of the data. A run sequence plot gives us a general knowledge of the data set to be analyzed and we could graphically indicate stationarity (i.e., constant location and scale)), the presence of outliers, and seasonal patterns.

Although Box-Jenkins models can estimate seasonal components, the analyst needs to specify the seasonal period (for example, 12 for monthly data). Seasonal components are common for economic time series. They are less common for engineering and scientific data. Run sequence plot is a sequence of the 4 plots (data plot, auto-correlation plot, histogram plot and probability density plot) of the data set. Each plot should point a certain characteristic of the data set to be analyzed.

Box- Jenkins algorithm consists of 4 major steps :

1. Model identification

2. Model selection

3. Model diagnostic

4. Model forecasting

**STEP 1. MODEL IDENTIFICATION- Run Sequence plot**The first step in the analysis is to generate a run sequence plot of the data. A run sequence plot gives us a general knowledge of the data set to be analyzed and we could graphically indicate stationarity (i.e., constant location and scale)), the presence of outliers, and seasonal patterns.

Although Box-Jenkins models can estimate seasonal components, the analyst needs to specify the seasonal period (for example, 12 for monthly data). Seasonal components are common for economic time series. They are less common for engineering and scientific data. Run sequence plot is a sequence of the 4 plots (data plot, auto-correlation plot, histogram plot and probability density plot) of the data set. Each plot should point a certain characteristic of the data set to be analyzed.

**Step 1.1. Data Plot**

We can make the following conclusions from the run sequence plot.- Whether the data plots show strong and positive correlation (in which case some of regression models (linear or exponential) would be better prognostic model).
- Or data plots show that there is not a significant trend or any obvious seasonal pattern in the data.

**Step 1.2. Auto-correlation Plot of Original Data**

Auto-correlation plots (Box and Jenkins, pp. 28-32) are a commonly-used tool for checking randomness in a data set. This randomness is ascertained by computing auto-correlations for data values at varying time lags. If random, such auto-correlations should be near zero for any and all time-lag separations. If non-random, then one or more of the auto-correlations will be significantly non-zero.

The auto-correlation plot has a 95% confidence band, which is constructed based on the assumption that the process is a moving average process. The auto-correlation plot shows that the sample auto-correlations are very strong and positive and decay very slowly.

**Step 1.3. Histogram Plot of Original Data**

The histogram shows the response appears to be reasonably symmetric, but with a bimodal distribution. If histogram does not have normal distribution it is more likely not to be random.

**Step 1.4. Probability Density Plot of Original Data**

The normal probability plot of bi-modal distributed data (left figure) shows some curvature indicating that distributions other than the normal may provide a better fit. On the right side for the comparison purposes is shown normal probability plot of normally distributed data.

**Step 1.5. Statistical Tests for Determining Non-Stationarity**

Sometimes it is not possible to detect is the data set stationary or not, just from the looking at auto-correlation plots. There are two other statistical tests that check for stationarity of the data set.

First test is Augmented Dickey-Fuller test (adf.test in R). If the if p value from this test p>0.05 data is non-stationary, if p<0.05 that data is stationary and no differencing is needed.

Second test is Kwiatkowski-Phillips-Schmidt-Shin (kpss.test in R). If the p value from this test is p<0.05 data is non-stationary.

**Step 2. MODEL SELECTION**

In order to accurately forecast the data set we need to determine best fitting model.

**Step 2.1. Create Non-stationary Data Set (Differenced Data Set)**

If we detected non-stationarity in previous step we need to remove non-stationarity from the data by differencing it. In this step we create the new data set (d1) by taking lag 1 of the original data set. We proceed with subtracting d1 data set from the original data set. In this way we differenced the data. If the non-stationarity is not present the data set we will skip this step and our model automatically becomes ARMA (p,q) model.

**Step 2.2.Run the Sequence Plot of the Differenced Data**

In this step we run the sequence plot of the differenced data. The run sequence plot of the differenced data should show no obvious trends or seasonality.

**Step 2.2.3.Auto-correlations Plot of the Differenced Data**

In order to determine if differenced data set is stationary or not stationary we are going to generate auto-correlation plots of differenced data. If the auto-correlation plot does not show any significant lags then that means that differenced data is less auto-correlated than the original data and our data set can be modeled as ARIMA (0,1,0). However, if the auto-correlation plots show decaying trend of significant lags, then we may need to difference the data one more time and generate auto-correlation plots of newly created data set (d2). If the auto-correlation plots of the new data set (d2) show no significant lags then our data set can be modeled as ARIMA (0,2,0). However, if the data set (d2) shows only a few significant lags, in order to determine proper ARIMA model we need to further investigate the data set. If data set stays non-stationary (decaying auto-correlation lags) even after third differencing we recommend that some other prognostic tool should be used to model the data.

**Step 2.3. Statistical Procedure to Determine Number of Differencing**

As it was the case with the testing of the data stationarity, sometimes the run sequence plots or autocorrelation plots cannot help us determining what is the proper number of differencing is needed in order to make data series stationary. In that case we can run unit root tests to determine the least number of differences required to pass the test at the level alpha. The tests we used to determine minimum number of differencing needed are Kwiatkowski-Phillips-Schmidt-Shin, Augmented Dickey-Fuller tests.

The R function that evaluates number of differencing needed is

**ndiffs(x, alpha=0.05, test=c("kpss","adf", "pp"))**.However, we should be careful in implementing the results from the above mentioned statistical tests in our ARIMA modeling, since the test would give us finite number of differencing needed to make our data set stationary, but the finite number might be unacceptable for the meaningful ARIMA model forecasting. Many statisticians recommend not to difference data more than 3 times in order to make data stationary. If more differencing is needed they recommend not to us ARIMA model for forecasting of data set.

**Step 2.4. Determining AR (autoregressive) and MA (moving “average”) parameters**

The next step is to examine the auto-correlations and partial auto-correlation plots of the differenced data. In order detect which AR(p) and MA(q) model is appropriate, we are to examine the properties and relationship of the auto-correlation and partial auto correlation plots. Table 1.shows properties of auto-correlation function (ACF) and partial auto-correlation function (PACF) plots for ARMA models.

**Step 2.4.1. Auto-correlation Plot of the Differenced Data**

In this case, the auto-correlation plot of the differenced data with a 95% confidence band shows that only the auto-correlation at lag 1 is significant. The auto correlation plot also show that the rest of the auto-correlation coefficients are looking like dumped sine waves. This indicates AR (1) model. However, in order to confirm that is AR(1) model the partial auto-correlation plot should show two things

- That only lag 1 coefficient should be significant.
- That plot should be finite and coefficient plot should cut after first lag 1 (Table1.).

**Step2.4.2. Partial Auto-correlation Plot of the Differenced Data**

The partial auto-correlation plot of the differenced data shows that, instead only first lag being significant, that first two lag are significant (crossing 95% confidence band). This part of the plot (red circle) suggest AR(2) model. However, we can also see that rest of the lag coefficients (green box) in the plot, are not cutting off after p lags, but damping like sine wave suggesting that ARIMA model is needed to accurately forecast the series.

As we have seen from the previous example if the data are from an ARIMA(p,d,0) or ARIMA(0,d,q) model, then the ACF and PACF plots can be helpful in determining the value of p or q as suggested in the Table 1. However, if the ACF and PACF plots fall under rules from the 3rd column of Table 1.then we need a different approach to detect what is the best forecasting model.

One way to do it is to run a number of ARIMA models with different p and q combinations and determine which model scores the least on one of the selected statistical tests. There are few tests available to compare the forecast results of different ARIMA (p,d,q) models. The most used ones are: Akaike's Information Criterion (AIC), Corrected Akaike's Information Criterion (AICc) and and Bayesian Information Criterion (BIC). The lowest value on these tests indicates the best ARIMA model.

**Step3. MODEL DIAGNOSTIC**

Once the model is selected, the randomness of the residuals (error terms from yt- yt-1 =ε) needed to be checked. The randomness of the residuals is checked using Portmanteau test (Box.Ljung test in R). If the p value >0.05 that means that residuals are not correlated (in other words residuals are white noise). If p<0.05 that mean the residuals are correlated, and the procedure should be repeated until ARIMA model with uncorrelated residuals is found. The new corrected model should be used for forecasting.

As we have done before, we can check for the randomness of the residuals graphically and run sequence plot of the residuals. Residuals’ randomness (presence of the white noise) was determined by looking at auto-correlation plots (only lag 1 should be significant), distribution plots (normal distribution should be present) and probability density plots (straight line).

**Step 4.FORECASTING AND FITTING THE MODEL**

The last step in this procedure is fitting the forecasted model to the original data and plotting the forecasted model with confidence bands.

If you are interested in hands on experience in implementing ARIMA modeling please follow this link , or visit our EDUCATION page