Chapter 4 Data processing
4.1 Creating the NTS database
The edited survey data was prepared for analysis and reporting before being delivered to DfT. This chapter outlines the protocols followed during post-processing based on methods and scripts originally created by DfT.
The data for the survey year was imported into an annual database where a series of processing tasks were carried out to prepare the data for analysis. The database was divided into separate sections, as shown in Table 4.1 below.
Table 4.1: NTS SQL database schemas
Name | Purpose |
---|---|
Data | For importing questionnaire data |
Param | For parameterising data |
DVLA | For processing data from the DVLA database |
Imptn | For imputing data |
Weights | For importing the weighting data |
Random | For storing random numbers used in imputations |
In addition to creating the NTS annual database, two across-years databases were added to or amended as required:
the NTS_Info database which stores information such as Retail Prices Index (RPI), school and bank holiday data and concessionary travel schemes
the NTS_Lookup database that stores each of the look up tables that are used to attach description labels during analysis.
4.2 Importing the questionnaire data into tables
The metadata documents were used to create SQL scripts to import the questionnaire data into the NTS annual database. A script was created for each analysis level of the database (PSU, Household, Vehicle, Individual, LDJ, Trip and Stage) and for each multi-coded question. This stage was automated using SQL Server Integration Services (SSIS), which creates the tables and imports the data files without the need for user intervention.
Each record of each table was assigned a unique identifier during the import process. Once the import was complete the identifiers were cascaded down to the lower levels, allowing tables to be linked using a single identifier field.
4.3 Parameterisation
Parameterisation is the process of converting variables into a format that is more useful for analysis. Before the parameterisation routines were run, year-on-year changes to variables were identified and the routines amended to deal with the changes. Updates to the routines were reflected in the post-processing documentation. Many of the variables were passed unprocessed into the parameterised tables. For the remainder, several different transformations were applied, such as:
- creating a banded version of continuous variables
- combining several variables into a single analysis variable
- creating summary variables
Some variables that were parameterised were themselves used in the construction of subsequent variables, so these were created first. This stage was also automated using SSIS to run the parameterisation scripts in the required order. The relationship between the data in the import and parameterisation tables was recorded in the dependency documentation.
As part of this process, data from the DVLA database was linked to vehicles for which a registration mark was provided.
4.4 Imputation
Several variables underwent an imputation process where missing values were derived by looking at other known data. Again, this stage of the post-processing was automated using SSIS, which ran the imputation routines in a specific order due to the dependencies between variables. A variety of techniques were used in the imputation routines. Each routine was documented individually giving details of the methods used. Some routines required the use of random numbers to determine how cases should be allocated. These routines used random number tables that were created at the beginning of the process and retained, so that the results would be repeatable should the imputations need to be carried out again.
4.5 Adding weights
The weighting data was imported into the NTS annual database. Each set of weights was imported into a separate table, as shown in Table 4.2 below.
Table 4.2: Importing weights
Name | Level | Table |
---|---|---|
Interview sample weights | Household | Weights.Interview Sample |
Fully responding weights | Household | Weights.FC Sample |
Long-distance journey weights | LDJ | Weights.LDJ |
Short walk weights | Trip | Weights.Short Walks |
Diary drop off weights | Trip | Weights.Diary |
Self-completion weights | Individual | Weights.Self Completion |
4.6 Creating trip and stage numerics
To enable analysis of trip and stage level data with the correct handling of short walks and series of calls, the following grossing factors were attached to the trip and stage imputation tables, as shown in Table 4.3 below.
Table 4.3: Grossing factors for trip and stage imputation tables
Table | Variable | Description |
---|---|---|
Imptn.Stage | SSXSC | Number of stages, grossed for short walks, excluding ‘series of calls’ trips |
Imptn.Stage | SD | Stage distance travelled, grossed for short walks |
Imptn.Stage | STTXSC | Travelling time grossed for short walks, excluding ‘series of calls’ trips |
Imptn.Trip | JJXSC | Number of trips, grossed for short walks, excluding ‘series of calls’ trips |
Imptn.Trip | JD | Trip distance travelled, grossed for short walks |
Imptn.Trip | JOTXSC | Overall trip time, grossed for short walks, excluding ‘series of calls’ trips |
Imptn.Trip | JTTXSC | Travelling time, grossed for short walks, excluding ‘series of calls’ |
4.7 Combining long-distance journey data
Due to the infrequency of longer distance trips, additional long-distance journey (LDJ) data is collected for the week preceding the placement interview. To allow analysis of all long-distance trips, these LDJ trips were combined with those trips over 50 miles from the diary data into a single table.
4.8 Creating household income semi-deciles and quintiles
To allow analysis of trip behaviour by income on a comparable basis, households were categorised into income bands based on a measure of household affluence known as real household income equivalence. This adjusts a household’s stated income so that the household’s size and composition are considered. This adjustment was carried out using a measure called the McClements Scale.
Incomes were also adjusted for inflation to facilitate analysis across time periods. To adjust for inflation the equivalised income was multiplied by the RPI value from the month the interview was carried out.
The conversion from household income band to value used the median values from the household income bands of the Family Resources Survey 2021 to 2022 (the same data source as was used for NTS 2022).
4.9 Adding holidays data
The holidays database was extended to incorporate dates up to the end of March 2024 , using data supplied by DfT. Prior to 2016 this data was provided for each local authority, but this level of detail is no longer available. Consequently, the school holiday dates from 2016 onwards represent the national average.
A code to indicate holiday status (that is, weekend, bank holiday, school holiday or term time) was then added to each day record in the annual NTS database to enable analysis of trip data by travel day type.
4.10 Adding concessionary travel data
The following variables from the DfT’s latest annual concessionary travel survey were added to the NTS_Info database for each Local Authority, as shown in Table 4.4 below.
Table 4.4: Concessionary travel data variables
Variable | Description |
---|---|
ConcTravElig | Eligibility for elderly person concessionary travel scheme |
ConcTravFare | Type of bus fare concession (free since 2008) |
ConcTravTimes | Times offered for concessionary bus travel |
ConcTravAreas | Areas offered for concessionary bus travel (national concession since 2008) |
ConcTravOther | Any other concessions offered to elderly people |
ConcTravModes | Any additional modes offered to elderly people (multi-coded) |
These variables were then appended to the records of the Household table of the annual NTS database using the Local Authority code to link to the relevant data.