Lições do módulo (2/2)
Exploração e Agregação
R was created primarily as a statistical environment, so it provides many built-in functions to analyze, summarize, and understand the distribution of data in vectors or tables.
Basic Statistical Functions
R allows you to easily calculate the main measures of center and spread on numeric vectors:
mean(x): Calculates the arithmetic mean of the elements.median(x): Calculates the median (the middle value).sd(x): Calculates the standard deviation (a measure of how spread out the data is).min(x)andmax(x): Return the minimum and maximum values respectively.
values <- c(10, 12, 15, 18, 20, 22)
avg <- mean(values)
dispersion <- sd(values)
cat("Mean:", avg, "- Std Dev:", round(dispersion, 2), "\n")
Handling Missing Data (NA)
In the real world, datasets often contain missing values, represented in R by the special entity NA (Not Available).
If you try to calculate the mean or any other aggregate on a vector that contains even a single NA, R will return NA as the result. To ignore missing values, you must pass the logical argument na.rm = TRUE (NA remove):
temperatures <- c(18, 21, NA, 24, 19)
# This will return NA
print(mean(temperatures))
# This will ignore the NA and perform the calculation on valid values
print(mean(temperatures, na.rm = TRUE))
Global Summary with summary()
The summary() function is a powerful tool that provides a complete statistical overview of a vector or an entire Data Frame, showing min, first quartile, median, mean, third quartile, and max:
prices <- c(5, 10, 15, 20, 100)
print(summary(prices))
Aggregating Data with aggregate
When working with data frames, it is common to calculate summary statistics (such as mean or sum) for specific groups or categories. The aggregate() function in R simplifies this process:
df <- data.frame(
department = c("HR", "IT", "HR", "IT"),
salary = c(3000, 4500, 3200, 4800)
)
# Calcola la media dei salari per dipartimento
avg_salaries <- aggregate(salary ~ department, data = df, FUN = mean)
print(avg_salaries)
The syntax salary ~ department means "analyze the salary variable based on the department variable". The FUN = mean argument specifies that we want to calculate the arithmetic mean for each group.
Advanced Filtering with subset
The subset() function provides an intuitive and elegant way to extract subsets of rows and columns from a data frame, avoiding the use of complex square brackets:
employees <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 34, 29),
active = c(TRUE, FALSE, TRUE)
)
# Filtra solo i dipendenti attivi con età superiore a 28
active_employees <- subset(employees, active == TRUE & age > 28)
Try it yourself
Given the vector temperatures, calculate the mean and save it in the variable mean_temp. Then calculate the median and save it in median_temp.
Mostrar dica
Usa mean_temp <- mean(temperatures) e median_temp <- median(temperatures)
Solução disponível após 3 tentativas
Given the scores vector containing an NA value, calculate the standard deviation of the vector excluding missing values, and save it in the sd_scores variable.
Mostrar dica
Use the sd() function with the na.rm = TRUE argument: sd_scores <- sd(scores, na.rm = TRUE)
Solução disponível após 3 tentativas
Given the data frame df, obtain the statistical summary of the values column and save it in the stats_summary variable.
Mostrar dica
Extract the column using the $ operator and pass it to summary(): stats_summary <- summary(df$values)
Solução disponível após 3 tentativas
Given the data frame sales, use the aggregate() function to calculate the sum (FUN = sum) of the revenue column grouped by the region column. Save the result in the region_sales variable.
Mostrar dica
Use: region_sales <- aggregate(revenue ~ region, data = sales, FUN = sum)
Solução disponível após 3 tentativas
Given the data frame products, use the subset() function to filter only the products that belong to the 'Electronics' category (category == 'Electronics') and have a price greater than 100 (price > 100). Save the result in the expensive_electronics variable.
Mostrar dica
Use: expensive_electronics <- subset(products, category == 'Electronics' & price > 100)
Solução disponível após 3 tentativas