Week 7
Cleaning and Data
Analysis in II

Soci—269

Sakeef M. Karim
Amherst College

AN INTRODUCTION TO QUANTITATIVE SOCIOLOGY—CULTURE & POWER

Data Wrangling III—
October 15th

A Brief Update

New Deadline for Coding Assignment in

Coding Assignment Deadline

Your first coding assignment is now due by 8:00 PM on Tuesday, November 4th.

A Brief Update

New Deadline for Coding Assignment in

Assignment instructions are available online.

Getting Started

Are We All on the Same Page?

Launch RStudio and execute the following code:

load(url("https://github.com/sakeefkarim/soci-269-f25/raw/refs/heads/main/data/week%207/week7.RData"))
A Question Click your Environment tab.
What do you notice?

Are We All on the Same Page?

Here are a list of objects available in your Global Environment:

ls()
[1] "all_regions"       "can_mex_usa_long"  "canada"           
[4] "gss_2010"          "mexico"            "truncated_regions"
[7] "usa"              
Another Question What do each of these objects correspond to?

A Solution

We’ll Be Using this Data Once Again

Show the underlying code
vdem <- vdem |> as_tibble()

high_level <- c("v2x_polyarchy", "v2x_libdem", "v2x_partipdem", "v2x_delibdem", "v2x_egaldem")

high_level_names <- c("electoral", "liberal", "participatory", "deliberative", "egalitarian")

can_mex_usa <- vdem |> as_tibble() |> 
                       # Here, we (i) select (and rename) the first column; 
                       # (ii) select the year variable; then 
                       # (iii) select all the variables in the high_level vector.
                       select(country = 1, year, all_of(high_level)) |> 
                       # Here, we rename all the high_level variables using the 
                       # high_level_names vector.
                       rename_with(~ high_level_names, all_of(high_level)) |> 
                       # We're performing "per-operation" grouping here---within
                       # the mutate() function (by year). We're then relocating the
                       # grouped variable we created and ensuring that it appears after
                       # "electoral."
                       mutate(electoral_global_avg = mean(electoral, na.rm = TRUE),
                              .by = year, .after = electoral) |> 
                       # Isolating Canada, the US and Mexico + years in the
                       # 21st century:
                       filter(str_detect(country, "Can|United St|Mex"),
                              year >= 2000) |> 
                       # Arranging countries alphabetically:
                       arrange(country)

can_mex_usa
# A tibble: 75 × 8
   country  year electoral electoral_global_avg liberal participatory
   <chr>   <dbl>     <dbl>                <dbl>   <dbl>         <dbl>
 1 Canada   2000     0.843                0.491   0.766         0.594
 2 Canada   2001     0.838                0.494   0.76          0.587
 3 Canada   2002     0.831                0.505   0.753         0.578
 4 Canada   2003     0.831                0.513   0.753         0.574
 5 Canada   2004     0.83                 0.513   0.758         0.571
 6 Canada   2005     0.829                0.518   0.756         0.571
 7 Canada   2006     0.834                0.521   0.765         0.571
 8 Canada   2007     0.834                0.517   0.766         0.572
 9 Canada   2008     0.832                0.521   0.762         0.57 
10 Canada   2009     0.833                0.523   0.764         0.57 
# ℹ 65 more rows
# ℹ 2 more variables: deliberative <dbl>, egalitarian <dbl>

Introduction to dplyr
Combining Data Frames

dplyr::bind_rows()

We can use bind_rows() to combine observations—or rows—from different data frames.

Quick Exercise

In the next 5-10 minutes, try to recreate can_mex_usa using …

  • The data you have stored in your Environment.
  • bind_rows().

Note: If you’re running into issues, fear not—the answer’s on the next slide.

dplyr::bind_rows()

# The Solution:

can_mex_usa <- bind_rows(canada, mexico, usa)

can_mex_usa
# A tibble: 75 × 8
   country  year electoral electoral_global_avg liberal participatory
   <chr>   <dbl>     <dbl>                <dbl>   <dbl>         <dbl>
 1 Canada   2000     0.843                0.491   0.766         0.594
 2 Canada   2001     0.838                0.494   0.76          0.587
 3 Canada   2002     0.831                0.505   0.753         0.578
 4 Canada   2003     0.831                0.513   0.753         0.574
 5 Canada   2004     0.83                 0.513   0.758         0.571
 6 Canada   2005     0.829                0.518   0.756         0.571
 7 Canada   2006     0.834                0.521   0.765         0.571
 8 Canada   2007     0.834                0.517   0.766         0.572
 9 Canada   2008     0.832                0.521   0.762         0.57 
10 Canada   2009     0.833                0.523   0.764         0.57 
# ℹ 65 more rows
# ℹ 2 more variables: deliberative <dbl>, egalitarian <dbl>

dplyr::bind_cols()

Another Mini-Exercise

Okay, that wasn’t so hard. Let’s try to use bind_cols(), to append a new variable—e_regionpol from truncated_regions—to our data.

A Quick Question

Will this work?

can_mex_usa |> bind_cols(truncated_regions)

dplyr::bind_cols()

# The Solution:

can_mex_usa_1 <- can_mex_usa |> bind_cols(# Avoiding duplicate columns:
                                          truncated_regions |> 
                                          select(-c(1:2))) |> 
                                # Relocating e_regionpol variable
                                relocate(e_regionpol, .after = country)

can_mex_usa_1
# A tibble: 75 × 9
   country e_regionpol  year electoral electoral_global_avg liberal
   <chr>    <hvn_lbll> <dbl>     <dbl>                <dbl>   <dbl>
 1 Canada            2  2000     0.843                0.491   0.766
 2 Canada            2  2001     0.838                0.494   0.76 
 3 Canada            2  2002     0.831                0.505   0.753
 4 Canada            2  2003     0.831                0.513   0.753
 5 Canada            2  2004     0.83                 0.513   0.758
 6 Canada            2  2005     0.829                0.518   0.756
 7 Canada            2  2006     0.834                0.521   0.765
 8 Canada            2  2007     0.834                0.517   0.766
 9 Canada            2  2008     0.832                0.521   0.762
10 Canada            2  2009     0.833                0.523   0.764
# ℹ 65 more rows
# ℹ 3 more variables: participatory <dbl>, deliberative <dbl>,
#   egalitarian <dbl>

dplyr::left_join()

What happens when we try to bind can_mex_usa with all_regions using the bind_cols() function?

can_mex_usa |> bind_cols(all_regions)

The powerful *_join() family of verbs from dplyr is especially useful when we’re stitching together data frames of different dimensions
(e.g., different numbers of rows).

dplyr::left_join()

Yet Another Mini-Exercise

  • Try to use left_join() to attach the e_regionpol variable from all_regions to our original can_mex_usa data frame.

  • Store your new object as can_mex_usa_2 and relocate e_regionpol so that it appears right after country.

dplyr::left_join()

# The Solution:

can_mex_usa_2 <- can_mex_usa |> left_join(all_regions) |> 
                                relocate(e_regionpol, .after = country)

can_mex_usa_2
# A tibble: 75 × 9
   country e_regionpol  year electoral electoral_global_avg liberal
   <chr>    <hvn_lbll> <dbl>     <dbl>                <dbl>   <dbl>
 1 Canada            5  2000     0.843                0.491   0.766
 2 Canada            5  2001     0.838                0.494   0.76 
 3 Canada            5  2002     0.831                0.505   0.753
 4 Canada            5  2003     0.831                0.513   0.753
 5 Canada            5  2004     0.83                 0.513   0.758
 6 Canada            5  2005     0.829                0.518   0.756
 7 Canada            5  2006     0.834                0.521   0.765
 8 Canada            5  2007     0.834                0.517   0.766
 9 Canada            5  2008     0.832                0.521   0.762
10 Canada            5  2009     0.833                0.523   0.764
# ℹ 65 more rows
# ℹ 3 more variables: participatory <dbl>, deliberative <dbl>,
#   egalitarian <dbl>

Introduction to dplyr
Extending Verbs

dplyr::*if

We can extend filter() so that rows are extracted based on the values of all their columns …

… or any of their columns:

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::rename_with

By using rename_with(), we can rename multiple columns at once.

rename_with() accepts a variety of functions and character vectors.

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::*across()

We can transform the values in more than one column by deploying across().

We can use embed functions within across(), too.

Moreover, we can use across() to generate quick descriptive summaries.

Note: Keep clicking or the space bar on your to advance through the slide deck.

Introduction to dplyr
Reshaping Data

Wide to Long and Back Again

We can modify the orientation of our data using pivot_longer()

… and pivot_wider().

Note: Keep clicking or the space bar on your to advance through the slide deck.

Today’s Exercise

Try to produce the following data frame:

# A tibble: 375 × 5
   country region                            year measure       score
   <chr>   <fct>                            <dbl> <chr>         <dbl>
 1 Canada  North America and Western Europe  2000 electoral      84.3
 2 Canada  North America and Western Europe  2000 liberal        76.6
 3 Canada  North America and Western Europe  2000 participatory  59.4
 4 Canada  North America and Western Europe  2000 deliberative   75.7
 5 Canada  North America and Western Europe  2000 egalitarian    71.2
 6 Canada  North America and Western Europe  2001 electoral      83.8
 7 Canada  North America and Western Europe  2001 liberal        76  
 8 Canada  North America and Western Europe  2001 participatory  58.7
 9 Canada  North America and Western Europe  2001 deliberative   75.2
10 Canada  North America and Western Europe  2001 egalitarian    70.5
# ℹ 365 more rows

The can_mex_usa_long data frame should be
available in your Environment.

Optional Exercise

If you’re done, play around with gss_2010 (cf. Healy 2023).

You can explore the variables here:

See You on Monday

References

Coppedge, Michael, John Gerring, Carl Henrik Knutsen, Staffan I. Lindberg, Jan Teorell, et al. 2025. V-Dem Country-Year Dataset V15.”
Healy, Kieran Joseph. 2023. gssr: General Social Survey Data for Use in R.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. Sebastopol, CA: O’Reilly.