Harmonizing Variable Names


We are using the SPSS .sav files as a data resource from GESIS. The SPSS files are combining data with metadata labels. Furthermore, GESIS often includes metadata, including constants, into the SPSS file. The label_normalize() function is used to create a first suggestion for a variable name using the GESIS SPSS file’s variable labels.

The normalization rules can be reviewed in the source code of the normalize_names() function.

The variable names are following the underscore_separatednaming convention which is often called the snake_case_convention. We are using prefix and suffix conventions for easy separation and detachments of certain special variables and types.

  • we are removing all special characters that may be understood as regular expressions, such as %, ], and all whitespace.

  • We use Eurostat abbreviations similar to some Eurostat metadata vocabularies for greater than, lesser than and similar quasi-arithmetic expressions, and pct for percent, percentage or the % sign.

Prefix conventions

  • id_: easy selection and detachments of constant metadata, such as the document object identifier or the version number of the data file; except for the uniqid variable, which is an observation / row identifier in almost all files.

  • mc_ : multiple choice question items. These are logically related variables that are given in the questionnaire in tabular format, but they are coded both in SPSS and R item by item as separate variables (columns). For the same (structured) question, we always use the mc_this_question_item_1, mc_this_question_item_2, mc_this_question_item_dk naming convention, where the _dk suffix contains the indication that the respondent rejected the entire complex question.

  • is_ : indicator questions, which may created from binary answer options, or from multiple-choice like questions without clear structure.

These questions can easily be selected for example with the tidyselect::starts_with(“mc_”) selector which is widely used in the tidyverse.

Suffix conventions

Currently we are only using suffixes for questionnaire variations. The most likely variation is related to the fact that some fieldwork areas are not part of the core Eurobarometer territory, and for some reason they separated in the original data file, because the respondent received a shorter, or slightly modified questionnaire. In these cases, some of the questions are the same as in the core questionnaire, and can be imputed to the core question.

  • _tcc: Question variation for members of the Turkish Cypriot community

  • _mk: Question variation used in North Macedonia.

Preferred expressions

With often used variables, we try to use the nearest lower_snake_case version of the regularly used questionnaire name without alphanumeric IDs, because we believe that the researchers of Eurobarometer are familiar with these names. We use this for example with age_exact, type_of_community and some other often used variables.

Often, repeating (trend) questions can be found under several, often dissimilar labels. In these cases, we are choosing a preferred version of the variable name, and we are creating metadata tables to bring other variations to this format.

Less frequent questions and variables

In less frequently used questions, questionnaire items we are creating programmatic variable names using roughly the convention of rOpenSci. In these cases the creation of time series or data panels is less likely, or requires more work, because the questions or the questionnaire items are not the same. Our approach in this case is to create metadata tables for naming that are following topics. We hope to find contributors who are familiar with certain topics, and can organize the similar but not exactly the same variables into topical groups with similar variable names, such as variables related to trust in institutions or climate change. In these cases we follow general rules:

  • The variable name starts with a more general concept and goes towards a specificity

  • When applicable, we prefer the rOpenSci object_verb() organization (see: Function and argument naming), but overrule it for convenience or avoiding awkward variable names.

  • Only fully identical questionnaire items have identical names

  • Similar questionnaire items have similarly structured names for easier selection and comparison

  • We try to avoid extremely long variable names

The preferred variable names are stored in a topical metadata table that contains standardized topical keywords, and serves as a data and questionnaire map to the researcher.

Example Metadata Table

This metadata table is created by not_included/vignette_vocabulary_examples.R. It is a good practice to create the metadata tables programmatically for reproducibility.

After the modification of the source code of this R file, data-raw/create_categorical_var.R can update the data file. The documentation of the metadata table should be updated in R/data-category_label_2.R

Similar metadata tables should help the naming for similar variables (for example, in this case, often used two-level categories) or for topics (such as trust in institutions for not two-level variables, or air pollution and climate change related questions.)

data("categorical_variables_2", package = "eurobarometer")

categorical_variables_2 %>%
  kable() %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 ) %>%
    add_header_above(c("Filtering" = 3,
                       "Preferred Term" = 1,
                       "Keywords" = 2)
Preferred Term
filename r_name normalized_names canonical_names keyword_1 geo_qualifier
ZA6643_v4-0-0.sav qa8a_1 trust_in_institutions_written_press trust_in_written_press written_press NA
ZA6643_v4-0-0.sav qa8a_1 trust_in_institutions_written_press trust_in_written_press written_press NA
ZA6643_v4-0-0.sav qa8a_1 trust_in_institutions_written_press trust_in_written_press written_press NA
ZA6643_v4-0-0.sav qa8a_1 trust_in_institutions_written_press trust_in_written_press written_press NA
ZA6643_v4-0-0.sav qa8b_1 trust_in_institutions_written_press_tcc trust_in_written_press_tcc written_press tcc
ZA6643_v4-0-0.sav qa13_3 european_central_bank_trust trust_in_european_central_bank european_central_bank NA
ZA6928_v1-0-0.sav qa14_1 european_parliament_trust trust_in_european_parliament european_parliament NA
ZA6928_v1-0-0.sav qa14_3 european_central_bank_trust trust_in_european_central_bank european_central_bank NA
ZA7576_v1-0-0.sav qa14_1 european_parliament_trust trust_in_european_parliament european_parliament NA

We must decide early on the column names of the metadata table itself, i.e. filename, r_name, normalized_names, canonical_names, keyword_1, geo_qualifier.