At this point this workflow if purely hypothetical, the functions are not written, and they are not evaluated.

Acquiring the data from GESIS

I do not see a lot of sense in automating this, because the new GESIS interface requires a lot of approvals and interaction. However, I think that the files should be in a single folder. Maybe this could be a tempdir() but I am not very enthusiastic about it because it causes documentation issues.

Reading in the files

Working with native SPSS files is extremely slow, and I think it would make sense to first read and re-save them as .rds files or .rda files.

import_file_names <- c(

my_survey_list <- read_surveys (
   import_file_names, .f = 'read_example_file' )

Analyse the metadata in the surveys

my_metadata <- gesis_metadata_create( my_survey_list )
#>  [1] "filename"             "qb"                   "var_name_orig"       
#>  [4] "var_label_orig"       "var_label_norm"       "var_name_suggested"  
#>  [7] "length_cat_range"     "length_na_range"      "length_total_range"  
#> [10] "n_categories"         "factor_levels"        "valid_range"         
#> [13] "na_levels"            "class_orig"           "conversion_suggested"

The my_metadata table is rather large, so let’s review random samples from a few columns.

The variable names need to be harmonized, and you will get suggestions on how to do it. Only a very small subset of the entire table is shown here:

filename var_name_orig var_label_orig var_name_suggested
ZA7576_sample isocntry COUNTRY CODE - ISO 3166 country_code_iso_3166
ZA7576_sample p1 DATE OF INTERVIEW date_of_interview
ZA7576_sample qa14_3 EUROPEAN CENTRAL BANK - TRUST european_central_bank_trust
ZA7576_sample filename filename filename
ZA7576_sample d7 MARITAL STATUS marital_status
ZA7576_sample w1 WEIGHT RESULT FROM TARGET (REDRESSMENT) weight_result_from_target_redressment

The metadata can help identifying questionnaire item types and it suggests conversion to R classes. Again, only a very small subset of the entire table is shown below:

subset_metadata_1 %>%
  arrange ( var_name_suggested ) %>%
  kable () %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
var_label_norm var_name_suggested conversion_suggested length_cat_range length_na_range
country_code_iso_3166 country_code_iso_3166 character 0 0
date_of_interview date_of_interview haven_labelled 21 0
digital_object_identifier doi character 0 0
european_central_bank_trust european_central_bank_trust harmonized_labelled 2 1
filename filename character 0 0
marital_status marital_status haven_labelled 15 0
weight_result_from_target_redressment weight_result_from_target_redressment numeric 0 0
qb var_name_orig var_name_suggested conversion_suggested
id doi doi character
id filename filename character
metadata isocntry country_code_iso_3166 character
metadata p1 date_of_interview haven_labelled
socio-demography d7 marital_status haven_labelled
trust qa14_3 european_central_bank_trust harmonized_labelled
weights w1 weight_result_from_target_redressment numeric
  • The id groups serve only identification purposes, and will be used in the skeleton of the panel.
  • The metadata group relate to information about the responses. We include here the country ID because it determines the weight(s) to be used.
  • The weigths group shows the weights calculated by Kantar or GESIS.
  • The socio-demography group shows identified socio-demography variables for which eurobarometer provides a built-in harmonization tool. There may be other variables that the research would like to add into this group by overriding the qb value of certain question ids.
  • The trust group relates to a variable group covered by our example harmonization table trust_table.
  • The not_identified group contains variables for which we do not offer a full-scale built-in harmonization. However, some of our helper functions do help harmonizing these variables, too, but they require more manual programming work by the user.

Creating a skeleton panel

## filter out the most basic and omnipresent id variables, and the
## most basic weights, w1 and its projected version wex

my_panel <- panel_create ( survey_list = my_survey_list,
               ## must be at least two, and one must be the uniqid
               ## of the file or row_id 
                          id_vars = c("uniqid", "doi")

#> [1] "panel_id" "uniqid"   "doi"

Let’s have a look at 6 randomly selected rows:

panel_id uniqid doi
1638_320000120_doi_10_4232_1_12847 320000120 doi:10.4232/1.12847
3026_350000010_doi_10_4232_1_13393 350000010 doi:10.4232/1.13393
201_200002000_doi_10_4232_1_13393 200002000 doi:10.4232/1.13393
4058_390002749_doi_10_4232_1_12847 390002749 doi:10.4232/1.12847
3294_350000508_doi_10_4232_1_13393 350000508 doi:10.4232/1.13393
142_999999999_doi_10_4232_1_12847 999999999 doi:10.4232/1.12847

This should return a very basic file for joining, a data.frame/tibble with * a truly unique id * id elements for joining in individual survey data, in this example, uniqid and filename must be present in all imported files. The filename was added by read_surveys().

Harmonizing various aspects of the survey

## The id's are all harmonized to character value, they are 
## not consistent in the original SPSS files. 
id_panel        <- harmonize_qb_vars(
  survey_list = my_survey_list,
  metadata = my_metadata,
  id_vars = c("uniqid", "doi"),
  question_block = "id",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" )

## Weights are harmonized to numeric.
weight_panel <- harmonize_qb_vars(
  survey_list = my_survey_list,
  metadata = my_metadata,
  question_block = "weights",
  id_vars = c("uniqid", "doi"),
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" )

## Metadata is harmonized to various classes, but mainly character.
metadata_panel <- harmonize_qb_vars(
  survey_list = my_survey_list,
  metadata = my_metadata,
  question_block = "metadata",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" )

## Demography panel is harmonized uniquely, but this is not well
## developed yet.
demography_panel <- harmonize_qb_vars(
  survey_list = my_survey_list,
  metadata = my_metadata,
  id_vars = c("uniqid", "doi"),
  question_block = "socio-demography",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" )

Value Harmonization

The trust panel contains harmonized labelled variables. They can be converted to a harmonized numeric value with consistent binary values and consistent treatment of inappropriate and declined values.

This is only working with binary vars, the rest is converted to character.

trust_panel <- harmonize_qb_vars(
  survey_list = my_survey_list,
  metadata = my_metadata,
  question_block = "trust",
  var_name = "var_name_suggested",
  conversion = "conversion_suggested" )

The trust_vector below contains the harmonized numeric values, harmonized labels alongside the original values and the original labels. This means that all conversion is reversible.

trust_vector <- unique ( trust_panel$council_of_the_eu_trust)
#>  dbl+lbl [1:5] 99998,     1,     0, 99999,    NA
#>  @ labels     : Named num [1:4] 99998 1 0 99999
#>   ..- attr(*, "names")= chr [1:4] "decline" "tend_to_trust" "tend_not_to_trust" "inap"
#>  @ labels_orig: Named num [1:4] 99998 1 0 99999
#>   ..- attr(*, "names")= chr [1:4] "DK" "Tend to trust" "Tend not to trust" "Inap. (not 1 in eu28)"
#>  @ num_orig   : Named num [1:4] 99998 1 0 99999
#>   ..- attr(*, "names")= chr [1:4] "3" "1" "2" "9"

The harmonize_to_numeric() function correctly gives the numeric value, considering the two sources of missingness.

trust_panel %>%
  mutate_at( vars(-all_of("panel_id")), harmonize_to_numeric ) %>%
  summarize_if ( is.numeric, mean, na.rm=TRUE) %>%
  tidyr::pivot_longer( cols = everything()) %>%
name value
council_of_the_eu_trust 0.5603632
european_central_bank_trust 0.4880052
european_commission_trust 0.6032808
european_council_trust 0.5801233
european_parliament_trust 0.6386884
trust_in_institutions_army 0.6897696
trust_in_institutions_european_union 0.5148342
trust_in_institutions_european_union_tcc 0.5376344
trust_in_institutions_justice_legal_system 0.5038724
trust_in_institutions_media 0.3939707
trust_in_institutions_media_tcc 0.2600000
trust_in_institutions_national_government 0.4133733
trust_in_institutions_national_parliament 0.3929345
trust_in_institutions_police 0.6622064
trust_in_institutions_political_parties 0.2315917
trust_in_institutions_political_partiess_tcc 0.1503268
trust_in_institutions_public_administration 0.5322263
trust_in_institutions_reg_local_public_authorities 0.5670477
trust_in_institutions_united_nations 0.5674447
trust_in_institutions_united_nations_tcc 0.5212187
trust_in_institutions_political_parties_tcc 0.1425577

It is likely that your computer’s memory will not be enough to left_join these data tables, so bind them in long format:

panels <- list ( id_panel,

long_panels <- lapply (panels,
                 function(x) tidyr::pivot_longer (
                   x, -all_of("panel_id") )

panel_long <-
  rbind, long_panels )

my_panel <- panel_long %>%
  tidyr::pivot_wider () %>%
  left_join ( id_panel, by = 'panel_id'  )

The advantage of this workflow is that we can separately work on the demography_panel, metadata_panel, trust_panel.

The harmonize_qb in turn cares systematically for variables types, such as multiple_choice_questions, two-level_factors, three-level_factors, numeric, character for constants, etc.

This approach is not sensitive to missing questions. If some trust questions are present in all files, and others only in a few, you will still get a full panel.


# Print the harmonized and the original value labels
#>           decline     tend_to_trust tend_not_to_trust              inap 
#>             99998                 1                 0             99999
attr(trust_panel$council_of_the_eu_trust, "labels_orig")
#>                    DK         Tend to trust     Tend not to trust 
#>                 99998                     1                     0 
#> Inap. (not 1 in eu28) 
#>                 99999
# Summarize them as factors
summary ( labelled::to_factor(trust_panel$council_of_the_eu_trust))
#>           decline     tend_to_trust tend_not_to_trust              inap 
#>              1318              2098              1646              1517 
#>              NA's 
#>              5789
# Summarize them as numeric
summary ( harmonize_to_numeric(trust_panel$council_of_the_eu_trust))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    0.00    0.00    1.00    0.56    1.00    1.00    8624

You can simply revert to the original value labels:

  trust_panel$council_of_the_eu_trust ) <-  attr(
    trust_panel$council_of_the_eu_trust, "labels_orig")

#>                    DK         Tend to trust     Tend not to trust 
#>                  1318                  2098                  1646 
#> Inap. (not 1 in eu28)                  NA's 
#>                  1517                  5789

Work Documentation

num_value_range <- unique(trust_panel$trust_in_institutions_media)
#> [1]     1 99998     0 99999    NA
num_value_range [!]
#> <labelled<double>[4]>
#> [1]     1 99998     0 99999
#> Labels:
#>  value             label
#>      1     tend_to_trust
#>  99998           decline
#>      0 tend_not_to_trust
#>  99999              inap

summarize_values <- data.frame (
  harmonized_numeric = labelled::val_labels(
    original_numeric = attr( trust_panel$trust_in_institutions_media, "num_orig"),
  original_labels = names(attr( trust_panel$trust_in_institutions_media, "labels_orig"))
harmonized_numeric original_numeric original_labels
tend_to_trust 1 1 Tend to trust
decline 99998 99998 DK
tend_not_to_trust 0 0 Tend not to trust
inap 99999 99999 Inap. (CY-TCC in isocntry)

Published results

What I had been corresponding with GESIS is that from there we could have the following outputs:

  • When we first finish a big batch, and include it in the panel, we create a trend file that will be published on GESIS, and it will be a separate data publication with its own doi. We can place it, for example, on Zenodo or figshare, too.

  • We can also create a methodological publication from the standard steps we made

  • We can produce publications of the topical trend elements, probably with other researchers who know more about the topic, such as climate change.

Therefore, we have the possibility to first exploit whatever we create, and the workflow remains transparent and reproducible. Other users can clear up other elements, and if they are good, we can ask them to join their part in as a contribution to the package.

  • We can also see how much we are Eurobarometer-specific, and modify the package, or create a mutant for other surveys. I think we will not be very Eurobarometer specific.