メインコンテンツにスキップ
eLearner.app
モジュール 3 · レッスン 2 / 2コース内の 6/10~15 min
モジュールのレッスン (2/2)

探索と集約

R was created primarily as a statistical environment, so it provides many built-in functions to analyze, summarize, and understand the distribution of data in vectors or tables.

Basic Statistical Functions

R allows you to easily calculate the main measures of center and spread on numeric vectors:

  • mean(x): Calculates the arithmetic mean of the elements.
  • median(x): Calculates the median (the middle value).
  • sd(x): Calculates the standard deviation (a measure of how spread out the data is).
  • min(x) and max(x): Return the minimum and maximum values respectively.
Code
values <- c(10, 12, 15, 18, 20, 22)

avg <- mean(values)
dispersion <- sd(values)

cat("Mean:", avg, "- Std Dev:", round(dispersion, 2), "\n")

Handling Missing Data (NA)

In the real world, datasets often contain missing values, represented in R by the special entity NA (Not Available).

If you try to calculate the mean or any other aggregate on a vector that contains even a single NA, R will return NA as the result. To ignore missing values, you must pass the logical argument na.rm = TRUE (NA remove):

Code
temperatures <- c(18, 21, NA, 24, 19)

# This will return NA
print(mean(temperatures))

# This will ignore the NA and perform the calculation on valid values
print(mean(temperatures, na.rm = TRUE))

Global Summary with summary()

The summary() function is a powerful tool that provides a complete statistical overview of a vector or an entire Data Frame, showing min, first quartile, median, mean, third quartile, and max:

Code
prices <- c(5, 10, 15, 20, 100)
print(summary(prices))

Aggregating Data with aggregate

When working with data frames, it is common to calculate summary statistics (such as mean or sum) for specific groups or categories. The aggregate() function in R simplifies this process:

Code
df <- data.frame(
  department = c("HR", "IT", "HR", "IT"),
  salary = c(3000, 4500, 3200, 4800)
)

# Calcola la media dei salari per dipartimento
avg_salaries <- aggregate(salary ~ department, data = df, FUN = mean)
print(avg_salaries)

The syntax salary ~ department means "analyze the salary variable based on the department variable". The FUN = mean argument specifies that we want to calculate the arithmetic mean for each group.

Advanced Filtering with subset

The subset() function provides an intuitive and elegant way to extract subsets of rows and columns from a data frame, avoiding the use of complex square brackets:

Code
employees <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 34, 29),
  active = c(TRUE, FALSE, TRUE)
)

# Filtra solo i dipendenti attivi con età superiore a 28
active_employees <- subset(employees, active == TRUE & age > 28)

Try it yourself

運動#r.m3.l2.e1
試行回数: 0読み込み中…

Given the vector temperatures, calculate the mean and save it in the variable mean_temp. Then calculate the median and save it in median_temp.

エディターを読み込み中…
ヒントを表示

Usa mean_temp <- mean(temperatures) e median_temp <- median(temperatures)

3 回の試行後に解決策が利用可能になります

運動#r.m3.l2.e2
試行回数: 0読み込み中…

Given the scores vector containing an NA value, calculate the standard deviation of the vector excluding missing values, and save it in the sd_scores variable.

エディターを読み込み中…
ヒントを表示

Use the sd() function with the na.rm = TRUE argument: sd_scores <- sd(scores, na.rm = TRUE)

3 回の試行後に解決策が利用可能になります

運動#r.m3.l2.e3
試行回数: 0読み込み中…

Given the data frame df, obtain the statistical summary of the values column and save it in the stats_summary variable.

エディターを読み込み中…
ヒントを表示

Extract the column using the $ operator and pass it to summary(): stats_summary <- summary(df$values)

3 回の試行後に解決策が利用可能になります

運動#r.m3.l2.e4
試行回数: 0読み込み中…

Given the data frame sales, use the aggregate() function to calculate the sum (FUN = sum) of the revenue column grouped by the region column. Save the result in the region_sales variable.

エディターを読み込み中…
ヒントを表示

Use: region_sales <- aggregate(revenue ~ region, data = sales, FUN = sum)

3 回の試行後に解決策が利用可能になります

運動#r.m3.l2.e5
試行回数: 0読み込み中…

Given the data frame products, use the subset() function to filter only the products that belong to the 'Electronics' category (category == 'Electronics') and have a price greater than 100 (price > 100). Save the result in the expensive_electronics variable.

エディターを読み込み中…
ヒントを表示

Use: expensive_electronics <- subset(products, category == 'Electronics' & price > 100)

3 回の試行後に解決策が利用可能になります