Skip to main content
eLearner.app
Module 3 · Lesson 2 of 26/10 in the course~15 min
Module lessons (2/2)

Exploration and Aggregation

R was created primarily as a statistical environment, so it provides many built-in functions to analyze, summarize, and understand the distribution of data in vectors or tables.

Basic Statistical Functions

R allows you to easily calculate the main measures of center and spread on numeric vectors:

  • mean(x): Calculates the arithmetic mean of the elements.
  • median(x): Calculates the median (the middle value).
  • sd(x): Calculates the standard deviation (a measure of how spread out the data is).
  • min(x) and max(x): Return the minimum and maximum values respectively.
Code
values <- c(10, 12, 15, 18, 20, 22)

avg <- mean(values)
dispersion <- sd(values)

cat("Mean:", avg, "- Std Dev:", round(dispersion, 2), "\n")

Handling Missing Data (NA)

In the real world, datasets often contain missing values, represented in R by the special entity NA (Not Available).

If you try to calculate the mean or any other aggregate on a vector that contains even a single NA, R will return NA as the result. To ignore missing values, you must pass the logical argument na.rm = TRUE (NA remove):

Code
temperatures <- c(18, 21, NA, 24, 19)

# This will return NA
print(mean(temperatures))

# This will ignore the NA and perform the calculation on valid values
print(mean(temperatures, na.rm = TRUE))

Global Summary with summary()

The summary() function is a powerful tool that provides a complete statistical overview of a vector or an entire Data Frame, showing min, first quartile, median, mean, third quartile, and max:

Code
prices <- c(5, 10, 15, 20, 100)
print(summary(prices))

Aggregating Data with aggregate

When working with data frames, it is common to calculate summary statistics (such as mean or sum) for specific groups or categories. The aggregate() function in R simplifies this process:

Code
df <- data.frame(
  department = c("HR", "IT", "HR", "IT"),
  salary = c(3000, 4500, 3200, 4800)
)

# Calcola la media dei salari per dipartimento
avg_salaries <- aggregate(salary ~ department, data = df, FUN = mean)
print(avg_salaries)

The syntax salary ~ department means "analyze the salary variable based on the department variable". The FUN = mean argument specifies that we want to calculate the arithmetic mean for each group.

Advanced Filtering with subset

The subset() function provides an intuitive and elegant way to extract subsets of rows and columns from a data frame, avoiding the use of complex square brackets:

Code
employees <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 34, 29),
  active = c(TRUE, FALSE, TRUE)
)

# Filtra solo i dipendenti attivi con età superiore a 28
active_employees <- subset(employees, active == TRUE & age > 28)

Try it yourself

Exercise#r.m3.l2.e1
Attempts: 0Loading…

Given the vector temperatures, calculate the mean and save it in the variable mean_temp. Then calculate the median and save it in median_temp.

Loading editor…
Show hint

Usa mean_temp <- mean(temperatures) e median_temp <- median(temperatures)

Solution available after 3 attempts

Exercise#r.m3.l2.e2
Attempts: 0Loading…

Given the scores vector containing an NA value, calculate the standard deviation of the vector excluding missing values, and save it in the sd_scores variable.

Loading editor…
Show hint

Usa la funzione sd() con l'argomento na.rm = TRUE: sd_scores <- sd(scores, na.rm = TRUE)

Solution available after 3 attempts

Exercise#r.m3.l2.e3
Attempts: 0Loading…

Given the data frame df, obtain the statistical summary of the values column and save it in the stats_summary variable.

Loading editor…
Show hint

Estrai la colonna usando l'operatore $ e passala a summary(): stats_summary <- summary(df$values)

Solution available after 3 attempts

Exercise#r.m3.l2.e4
Attempts: 0Loading…

Given the data frame sales, use the aggregate() function to calculate the sum (FUN = sum) of the revenue column grouped by the region column. Save the result in the region_sales variable.

Loading editor…
Show hint

Use: region_sales <- aggregate(revenue ~ region, data = sales, FUN = sum)

Solution available after 3 attempts

Exercise#r.m3.l2.e5
Attempts: 0Loading…

Given the data frame products, use the subset() function to filter only the products that belong to the 'Electronics' category (category == 'Electronics') and have a price greater than 100 (price > 100). Save the result in the expensive_electronics variable.

Loading editor…
Show hint

Use: expensive_electronics <- subset(products, category == 'Electronics' & price > 100)

Solution available after 3 attempts