Module 3 · Lesson 2 of 26/10 in the course~15 min

Module lessons (2/2)

Exploration and Aggregation

R was created primarily as a statistical environment, so it provides many built-in functions to analyze, summarize, and understand the distribution of data in vectors or tables.

Basic Statistical Functions

R allows you to easily calculate the main measures of center and spread on numeric vectors:

mean(x): Calculates the arithmetic mean of the elements.
median(x): Calculates the median (the middle value).
sd(x): Calculates the standard deviation (a measure of how spread out the data is).
min(x) and max(x): Return the minimum and maximum values respectively.

Code

values <- c(10, 12, 15, 18, 20, 22)

avg <- mean(values)
dispersion <- sd(values)

cat("Mean:", avg, "- Std Dev:", round(dispersion, 2), "\n")

Handling Missing Data (`NA`)

In the real world, datasets often contain missing values, represented in R by the special entity NA (Not Available).

If you try to calculate the mean or any other aggregate on a vector that contains even a single NA, R will return NA as the result. To ignore missing values, you must pass the logical argument na.rm = TRUE (NA remove):

Code

temperatures <- c(18, 21, NA, 24, 19)

# This will return NA
print(mean(temperatures))

# This will ignore the NA and perform the calculation on valid values
print(mean(temperatures, na.rm = TRUE))

Global Summary with `summary()`

The summary() function is a powerful tool that provides a complete statistical overview of a vector or an entire Data Frame, showing min, first quartile, median, mean, third quartile, and max:

Code

prices <- c(5, 10, 15, 20, 100)
print(summary(prices))

Aggregating Data with `aggregate`

When working with data frames, it is common to calculate summary statistics (such as mean or sum) for specific groups or categories. The aggregate() function in R simplifies this process:

Code

df <- data.frame(
  department = c("HR", "IT", "HR", "IT"),
  salary = c(3000, 4500, 3200, 4800)
)

# Calcola la media dei salari per dipartimento
avg_salaries <- aggregate(salary ~ department, data = df, FUN = mean)
print(avg_salaries)

The syntax salary ~ department means "analyze the salary variable based on the department variable". The FUN = mean argument specifies that we want to calculate the arithmetic mean for each group.

Advanced Filtering with `subset`

The subset() function provides an intuitive and elegant way to extract subsets of rows and columns from a data frame, avoiding the use of complex square brackets:

Code

employees <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 34, 29),
  active = c(TRUE, FALSE, TRUE)
)

# Filtra solo i dipendenti attivi con età superiore a 28
active_employees <- subset(employees, active == TRUE & age > 28)

Try it yourself

Exercise#r.m3.l2.e1

Attempts: 0Loading…

Given the vector temperatures, calculate the mean and save it in the variable mean_temp. Then calculate the median and save it in median_temp.

Loading editor…

Show hint

Usa mean_temp <- mean(temperatures) e median_temp <- median(temperatures)

Solution available after 3 attempts

Exercise#r.m3.l2.e2

Attempts: 0Loading…

Given the scores vector containing an NA value, calculate the standard deviation of the vector excluding missing values, and save it in the sd_scores variable.

Loading editor…

Show hint

Use the sd() function with the na.rm = TRUE argument: sd_scores <- sd(scores, na.rm = TRUE)

Solution available after 3 attempts

Exercise#r.m3.l2.e3

Attempts: 0Loading…

Given the data frame df, obtain the statistical summary of the values column and save it in the stats_summary variable.

Loading editor…

Show hint

Extract the column using the $ operator and pass it to summary(): stats_summary <- summary(df$values)

Solution available after 3 attempts

Exercise#r.m3.l2.e4

Attempts: 0Loading…

Given the data frame sales, use the aggregate() function to calculate the sum (FUN = sum) of the revenue column grouped by the region column. Save the result in the region_sales variable.

Loading editor…

Show hint

Use: region_sales <- aggregate(revenue ~ region, data = sales, FUN = sum)

Solution available after 3 attempts

Exercise#r.m3.l2.e5

Attempts: 0Loading…

Given the data frame products, use the subset() function to filter only the products that belong to the 'Electronics' category (category == 'Electronics') and have a price greater than 100 (price > 100). Save the result in the expensive_electronics variable.

Loading editor…

Show hint

Use: expensive_electronics <- subset(products, category == 'Electronics' & price > 100)

Solution available after 3 attempts

Exploration and Aggregation

Basic Statistical Functions

Handling Missing Data (NA)

Global Summary with summary()

Aggregating Data with aggregate

Advanced Filtering with subset

Try it yourself

Handling Missing Data (`NA`)

Global Summary with `summary()`

Aggregating Data with `aggregate`

Advanced Filtering with `subset`