At this point this workflow if purely hypothetical, the functions are not written, and they are not evaluated.
I do not see a lot of sense in automating this, because the new GESIS interface requires a lot of approvals and interaction. However, I think that the files should be in a single folder. Maybe this could be a
tempdir() but I am not very enthusiastic about it because it causes documentation issues.
Working with native SPSS files is extremely slow, and I think it would make sense to first read and re-save them as .rds files or .rda files.
my_metadata <- gesis_metadata_create( my_survey_list ) names(my_metadata) #>  "filename" "qb" "var_name_orig" #>  "var_label_orig" "var_label_norm" "var_name_suggested" #>  "length_cat_range" "length_na_range" "length_total_range" #>  "n_categories" "factor_levels" "valid_range" #>  "na_levels" "class_orig" "conversion_suggested"
The my_metadata table is rather large, so let’s review random samples from a few columns.
The variable names need to be harmonized, and you will get suggestions on how to do it. Only a very small subset of the entire table is shown here:
|ZA7576_sample||isocntry||COUNTRY CODE - ISO 3166||country_code_iso_3166|
|ZA7576_sample||p1||DATE OF INTERVIEW||date_of_interview|
|ZA7576_sample||doi||DIGITAL OBJECT IDENTIFIER||doi|
|ZA7576_sample||qa14_3||EUROPEAN CENTRAL BANK - TRUST||european_central_bank_trust|
|ZA7576_sample||w1||WEIGHT RESULT FROM TARGET (REDRESSMENT)||weight_result_from_target_redressment|
The metadata can help identifying questionnaire item types and it suggests conversion to R classes. Again, only a very small subset of the entire table is shown below:
subset_metadata_1 %>% arrange ( var_name_suggested ) %>% kable () %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"), fixed_thead = T, font_size = 10 )
idgroups serve only identification purposes, and will be used in the skeleton of the panel.
metadatagroup relate to information about the responses. We include here the country ID because it determines the weight(s) to be used.
weigthsgroup shows the weights calculated by Kantar or GESIS.
socio-demographygroup shows identified socio-demography variables for which
eurobarometerprovides a built-in harmonization tool. There may be other variables that the research would like to add into this group by overriding the
qbvalue of certain question ids.
trustgroup relates to a variable group covered by our example harmonization table
not_identifiedgroup contains variables for which we do not offer a full-scale built-in harmonization. However, some of our helper functions do help harmonizing these variables, too, but they require more manual programming work by the user.
## filter out the most basic and omnipresent id variables, and the ## most basic weights, w1 and its projected version wex my_panel <- panel_create ( survey_list = my_survey_list, ## must be at least two, and one must be the uniqid ## of the file or row_id id_vars = c("uniqid", "doi") ) names(my_panel) #>  "panel_id" "uniqid" "doi"
Let’s have a look at 6 randomly selected rows:
This should return a very basic file for joining, a data.frame/tibble with * a truly unique id * id elements for joining in individual survey data, in this example,
filename must be present in all imported files. The filename was added by
## The id's are all harmonized to character value, they are ## not consistent in the original SPSS files. id_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, id_vars = c("uniqid", "doi"), question_block = "id", var_name = "var_name_suggested", conversion = "conversion_suggested" ) ## Weights are harmonized to numeric. weight_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, question_block = "weights", id_vars = c("uniqid", "doi"), var_name = "var_name_suggested", conversion = "conversion_suggested" ) ## Metadata is harmonized to various classes, but mainly character. metadata_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, question_block = "metadata", var_name = "var_name_suggested", conversion = "conversion_suggested" ) ## Demography panel is harmonized uniquely, but this is not well ## developed yet. demography_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, id_vars = c("uniqid", "doi"), question_block = "socio-demography", var_name = "var_name_suggested", conversion = "conversion_suggested" )
The trust panel contains harmonized labelled variables. They can be converted to a harmonized numeric value with consistent binary values and consistent treatment of
This is only working with binary vars, the rest is converted to character.
trust_panel <- harmonize_qb_vars( survey_list = my_survey_list, metadata = my_metadata, question_block = "trust", var_name = "var_name_suggested", conversion = "conversion_suggested" )
trust_vector below contains the harmonized numeric values, harmonized labels alongside the original values and the original labels. This means that all conversion is reversible.
trust_vector <- unique ( trust_panel$council_of_the_eu_trust) str(trust_vector) #> dbl+lbl [1:5] 99998, 1, 0, 99999, NA #> @ labels : Named num [1:4] 99998 1 0 99999 #> ..- attr(*, "names")= chr [1:4] "decline" "tend_to_trust" "tend_not_to_trust" "inap" #> @ labels_orig: Named num [1:4] 99998 1 0 99999 #> ..- attr(*, "names")= chr [1:4] "DK" "Tend to trust" "Tend not to trust" "Inap. (not 1 in eu28)" #> @ num_orig : Named num [1:4] 99998 1 0 99999 #> ..- attr(*, "names")= chr [1:4] "3" "1" "2" "9"
harmonize_to_numeric() function correctly gives the numeric value, considering the two sources of missingness.
trust_panel %>% mutate_at( vars(-all_of("panel_id")), harmonize_to_numeric ) %>% summarize_if ( is.numeric, mean, na.rm=TRUE) %>% tidyr::pivot_longer( cols = everything()) %>% kable()
It is likely that your computer’s memory will not be enough to
left_join these data tables, so bind them in long format:
panels <- list ( id_panel, trust_panel, demography_panel, weight_panel, metadata_panel ) long_panels <- lapply (panels, function(x) tidyr::pivot_longer ( x, -all_of("panel_id") ) ) panel_long <- do.call( rbind, long_panels ) my_panel <- panel_long %>% tidyr::pivot_wider () %>% left_join ( id_panel, by = 'panel_id' )
The advantage of this workflow is that we can separately work on the
The harmonize_qb in turn cares systematically for variables types, such as multiple_choice_questions, two-level_factors, three-level_factors, numeric, character for constants, etc.
This approach is not sensitive to missing questions. If some trust questions are present in all files, and others only in a few, you will still get a full panel.
# Print the harmonized and the original value labels labelled::val_labels(trust_panel$council_of_the_eu_trust) #> decline tend_to_trust tend_not_to_trust inap #> 99998 1 0 99999 attr(trust_panel$council_of_the_eu_trust, "labels_orig") #> DK Tend to trust Tend not to trust #> 99998 1 0 #> Inap. (not 1 in eu28) #> 99999
# Summarize them as factors summary ( labelled::to_factor(trust_panel$council_of_the_eu_trust)) #> decline tend_to_trust tend_not_to_trust inap #> 1318 2098 1646 1517 #> NA's #> 5789 # Summarize them as numeric summary ( harmonize_to_numeric(trust_panel$council_of_the_eu_trust)) #> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's #> 0.00 0.00 1.00 0.56 1.00 1.00 8624
You can simply revert to the original value labels:
num_value_range <- unique(trust_panel$trust_in_institutions_media) as.numeric(num_value_range) #>  1 99998 0 99999 NA num_value_range [!is.na(num_value_range)] #> <labelled<double>> #>  1 99998 0 99999 #> #> Labels: #> value label #> 1 tend_to_trust #> 99998 decline #> 0 tend_not_to_trust #> 99999 inap summarize_values <- data.frame ( harmonized_numeric = labelled::val_labels( trust_panel$trust_in_institutions_media), original_numeric = attr( trust_panel$trust_in_institutions_media, "num_orig"), original_labels = names(attr( trust_panel$trust_in_institutions_media, "labels_orig")) )
|tend_to_trust||1||1||Tend to trust|
|tend_not_to_trust||0||0||Tend not to trust|
|inap||99999||99999||Inap. (CY-TCC in isocntry)|
What I had been corresponding with GESIS is that from there we could have the following outputs:
When we first finish a big batch, and include it in the panel, we create a trend file that will be published on GESIS, and it will be a separate data publication with its own doi. We can place it, for example, on Zenodo or figshare, too.
We can also create a methodological publication from the standard steps we made
We can produce publications of the topical trend elements, probably with other researchers who know more about the topic, such as climate change.
Therefore, we have the possibility to first exploit whatever we create, and the workflow remains transparent and reproducible. Other users can clear up other elements, and if they are good, we can ask them to join their part in as a contribution to the package.