Module 4 · Lesson 1 of 27/10 in the course~12 min

Module lessons (1/2)

Filtering and Selecting Data

The Tidyverse is a collection of R packages designed for data science that share a common philosophy, grammar, and data structures. The core package for data manipulation in this suite is called dplyr.

dplyr introduces a set of functions ("verbs") that make data frame manipulation highly intuitive.

The Pipe Operator (`%>%` or `|>`)

The Tidyverse philosophy centers on chaining operations together. The pipe operator %>% (or the native pipe |> introduced in R 4.1+) takes the result of one expression and passes it as the first argument to the next function. This avoids nested function calls and prevents cluttering your workspace with temporary variables.

Code

# Without pipe:
filter(select(df, name, age), age > 20)

# With pipe:
df %>%
  select(name, age) %>%
  filter(age > 20)

dplyr Verbs for Filtering and Selecting

The three fundamental verbs for extracting data from a data frame are:

1. `select()`

Selects specific columns from a data frame. You can list the column names to keep or use the - prefix to exclude columns.

Code

# Select the 'name' and 'salary' columns
select(df, name, salary)

# Remove the 'address' column
select(df, -address)

2. `filter()`

Filters rows based on one or more logical conditions.

Code

# Filter rows where age is greater than 30
filter(df, age > 30)

# Filter using multiple conditions (logical AND)
filter(df, age > 30, department == "HR")

3. `arrange()`

Sorts rows based on the values of one or more columns. The sorting order is ascending by default. To sort in descending order, wrap the column name in desc().

Code

# Sort by age (ascending)
arrange(df, age)

# Sort by salary (descending)
arrange(df, desc(salary))

Try it yourself

Exercise 1: Select columns

Exercise#r.m4.l1.e1

Attempts: 0Loading…

Given the data frame df, select the name and age columns using select() and save the result in df_selected.

Loading editor…

Show hint

Usa: df_selected <- select(df, name, age)

Solution available after 3 attempts

Exercise#r.m4.l1.e2

Attempts: 0Loading…

Filter the rows of the data frame df where the age column is strictly greater than 18, saving the result in df_adults.

Loading editor…

Show hint

Use the filter(df, age > 18) function and assign the result to df_adults.

Solution available after 3 attempts

Exercise#r.m4.l1.e3

Attempts: 0Loading…

Use the pipe operator %>% to chain operations: first filter df keeping only records where age > 18, and then select the name column. Save the result in res.

Loading editor…

Show hint

Scrivi: res <- df %>% filter(age > 18) %>% select(name)

Solution available after 3 attempts

Exercise#r.m4.l1.e4

Attempts: 0Loading…

Sort the rows of the data frame df based on the salary column in descending order using arrange() and desc(). Save the result in df_sorted.

Loading editor…

Show hint

Usa arrange(desc(salary)) all'interno di una pipeline o come argomento diretto.

Solution available after 3 attempts

Exercise#r.m4.l1.e5

Attempts: 0Loading…

Write a complete pipeline on df: filter the records where department equals 'IT', select the columns name and salary, and sort the result by salary (ascending). Save the final result in res.

Loading editor…

Show hint

Use the pipe %>% to chain filter(department == 'IT'), select(name, salary), and arrange(salary).

Solution available after 3 attempts

Filtering and Selecting Data

The Pipe Operator (%>% or |>)

dplyr Verbs for Filtering and Selecting

1. select()

2. filter()

3. arrange()

Try it yourself

Exercise 1: Select columns

The Pipe Operator (`%>%` or `|>`)

1. `select()`

2. `filter()`

3. `arrange()`