# Package names
<- c("tidyverse","tidyquant","tidymodels", "glue","scales")
packages
# Install packages not yet installed
<- packages %in% rownames(installed.packages())
installed_packages
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Load all packages
invisible(lapply(packages, library, character.only = TRUE))
Collecting, Organizing, and Manipulating Financial Data - Replication
About this Document
This file replicates the codes that have been discussed in the live-session lectures of the Practical Applications in Quantitative Finance course. To ensure you can run the codes without issues, please install and load all required packages beforehand. It is always a good practice to replicate this Quarto document and experiment by making edits to the parameters. At the end of this report, you will find a suggestion on how to tweak this report — try doing some changes on your own!
In this lecture, we will be working with daily stock price data from the Magnificent Seven (AAPL, GOOG, MSFT, NVDA, TSLA, AMZN, and META). I have already downloaded the data for you using the tidyquant
package, which allows us to pull stock price data from multiple securities in a convenient format. You can hit the Download button to get a grasp on how the data looks like or download it directly on eClass® - file name: M7.csv
. Before you start, make sure to follow the instructions from our previous replication to set up your working directory correctly.
Loading packages
As we get started, we will be loading all packages referred in our official website.
Note that you could easily get around this by installing and loading all necessary packages using a more simple syntax:
#Install if not already available - I have commented these lines so that R does not attempt to install it everytime
#install.packages('tidyverse')
#install.packages('tidyquant')
#install.packages('glue')
#install.packages('scales')
#install.packages('ggthemes')
#Load
library(tidyverse)
library(tidyquant)
library(tidymodels)
library(glue)
library(scales)
library(ggthemes)
Using dplyr
, the data manipulation package in the tidyverse
The dplyr
package is one of the core packages in the tidyverse
and is designed for efficient and readable data manipulation. It provides a set of functions (also called “verbs”) that make working with data frames (or tibbles) intuitive and expressive. Key Features:
- Filter rows:
filter()
- Select columns:
select()
- Mutate (create new columns):
mutate()
- Summarize data:
summarize()
- Group operations:
group_by()
- Join tables:
left_join()
,right_join()
,inner_join()
,full_join()
To get started with our exercises, we will refer to the M7.csv
file that has been provided. After setting the current directory of the file, we load the data using the read.csv()
function:
#Apply function to the data
=read.csv('M7.csv')
M7
#Show the first 10 observations
head(M7)
symbol date open high low close volume adjusted
1 AAPL 2020-01-02 74.0600 75.1500 73.7975 75.0875 135480400 72.79604
2 AAPL 2020-01-03 74.2875 75.1450 74.1250 74.3575 146322800 72.08828
3 AAPL 2020-01-06 73.4475 74.9900 73.1875 74.9500 118387200 72.66272
4 AAPL 2020-01-07 74.9600 75.2250 74.3700 74.5975 108872000 72.32098
5 AAPL 2020-01-08 74.2900 76.1100 74.2900 75.7975 132079200 73.48434
6 AAPL 2020-01-09 76.8100 77.6075 76.5500 77.4075 170108400 75.04523
1. The mutate()
function
The mutate()
function adds new variables that are functions of existing variables:
mutate(.data, #The object you are performing the calculations
new_variable_1 = var1 * 2, #Can use basic operations...
new_variable_2 = median(var2), #Or predefined functions)
variable_3 = as.character(var3) #And can be used to modify existing variables)
)
- This function takes the
.data
argument you provided (in your example, theData
object)… - It sequentially creates the columns you asked for and place them to the right of your
data.frame
(ortibble
) - You can use any function, predefined or custom, and apply it to
mutate()
- It can also modify any columns you want (if the name is the same as an existing column)
Exercise: use columns high
and low
and create a new column, mid
, defined as the average between daily high and low prices.
#Apply function to the data
=mutate(M7, mid= (high+low)/2)
M7
#Show the first 10 observations
head(M7)
symbol date open high low close volume adjusted mid
1 AAPL 2020-01-02 74.0600 75.1500 73.7975 75.0875 135480400 72.79604 74.47375
2 AAPL 2020-01-03 74.2875 75.1450 74.1250 74.3575 146322800 72.08828 74.63500
3 AAPL 2020-01-06 73.4475 74.9900 73.1875 74.9500 118387200 72.66272 74.08875
4 AAPL 2020-01-07 74.9600 75.2250 74.3700 74.5975 108872000 72.32098 74.79750
5 AAPL 2020-01-08 74.2900 76.1100 74.2900 75.7975 132079200 73.48434 75.20000
6 AAPL 2020-01-09 76.8100 77.6075 76.5500 77.4075 170108400 75.04523 77.07875
2. The select()
function
The select()
function select (and optionally rename) variables in a data frame, using a concise mini-language that makes it easy to refer to variables based on their name (e.g. a:f
selects all columns from a
on the left to f
on the right) or type (e.g. where(is.numeric)
selects all numeric columns):
select(.data, #The object which you are performing the operations
#Can reorder columns
variable_3,
variable_1, :variable_4, #Matches position patterns
variable_2where(is.numeric) #Can select all columns that match a given pattern
)
- This function takes the
.data
argument you provided (in your example, theData
object)… - And select only the columns you’ve asked for
- You can also use
select(.data,-variable)
to remove a variable - It keeps the structure of the
data.frame
intact - no rows are affected
The select()
function also comes with a handy companion of selectors
, which are functions that help you cherry pick columns in a concise way, rather than hardcoding them altogether:
:
for selecting a range of consecutive variables.starts_with()
starts with a stringends_with()
ends with a stringcontains()
contains a stringmatches()
matches a regular expression.where()
a function to all variables and selects those for which the function returnsTRUE
Exercise: Select only the symbol
, date
, volume
, and adjusted
, in that order.
#Apply function to the data
=select(M7,symbol,date,volume,adjusted)
M7
#Show the first 10 observations
head(M7)
symbol date volume adjusted
1 AAPL 2020-01-02 135480400 72.79604
2 AAPL 2020-01-03 146322800 72.08828
3 AAPL 2020-01-06 118387200 72.66272
4 AAPL 2020-01-07 108872000 72.32098
5 AAPL 2020-01-08 132079200 73.48434
6 AAPL 2020-01-09 170108400 75.04523
3. The filter()
function
The filter()
function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE
for all conditions:
filter(.data, #The object which you are performing the operations
>10, #Simple arithmetic operators
variable_1 %in% c('AAPL','MSFT','FORD'), #Pattern search
variable_2 !(variable_3 %in% c('Boston','Mass','Silicon Valley')), #Negate pattern search
>=10 & variable_3<= 4 | is.na(variable_4) #IF and OR conditions
variable_4 )
- This function takes the
.data
argument you provided (in your example, theData
object)… - And filter the rows based on the conditions outlined
- You can use any function, predefined or custom, and apply it to
filter()
- It returns a subset of the whole object, keeping the columns and the data structure intact
Exercise: filter for observations that occurred in 2025, only. You can use the year()
function with the date
variable to retrieve the year.
#Apply function to the data
=filter(M7,year(date)==2025)
M7
#Show the first 10 observations
head(M7)
symbol date volume adjusted
1 AAPL 2025-01-02 55740700 243.85
2 AAPL 2025-01-03 40244100 243.36
3 AAPL 2025-01-06 45045600 245.00
4 AAPL 2025-01-07 40856000 242.21
5 AAPL 2025-01-08 37628900 242.70
6 AAPL 2025-01-10 61710900 236.85
3. The arrange()
function
The arrange()
function reorders the rows of a data frame by the values of selected columns:
#Some Options, always in the following format: the object you are rearranging + the reordering scheme
arrange(.data, variable1) #Ascending by variable_1
arrange(.data, variable1, variable_2) #Ascending by variable_1 and then variable_2
arrange(.data, variable2, variable_1) #Ascending by variable_2 and then variable 1
arrange(.data, variable1, desc(variable_2)) #Ascending by variable_1, and then descending by variable_2
- This function takes the
.data
argument you provided (in your example, theData
object)… - And reorders the rows of your
data.frame
(ortibble
) - This can be useful for visualization, but also for applying position-dependent functions, like
lag()
,lead()
,head()
, andtail()
Exercise: arrange the dataset by descending date
(newest to oldest) and symbol
..
#Apply function to the data
=arrange(M7,desc(date),symbol)
M7
#Show the first 10 observations
head(M7)
symbol date volume adjusted
1 AAPL 2025-01-29 45486100 239.36
2 AMZN 2025-01-29 26091700 237.07
3 GOOG 2025-01-29 12287800 197.18
4 META 2025-01-29 21377800 676.49
5 MSFT 2025-01-29 23581400 442.33
6 NVDA 2025-01-29 467120600 123.70
4. The summarize()
function
The summarise()
- or summarize()
- function creates a new data frame. It returns one row for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.
summarize(.data, #The object which you are performing the operations
new_variable_1 = mean(var1,na.rm=TRUE), #Average of var1, removing NA values
new_variable_2 = median(var2,na.rm=TRUE), #Median of var1, removing, NA values
new_variable_3 = n_distinct(var2) #Number of unique values of var2
)
- This function takes the
.data
argument you provided (in your example, theData
object)… - And reshapes the
data.frame
(ortibble
) by the aggregation functions - As the name suggests, it is used to summarize a table
Exercise: summarize the dataset by creating an average
column, defined as the average adjusted
prices. You can use the mean()
function to get the average. Use the option na.rm=TRUE
inside the mean
function to make sure that NA
values are disregarded.
#Apply function to the data
=summarize(M7,average=mean(adjusted,na.rm=TRUE))
Summary
#Show the first observations
head(Summary)
average
1 322.1803
5. Slice and dice through group_by
The group_by()
function takes an existing table and converts it into a grouped table where operations are performed “by group”. Using ungroup()
removes grouping.
=group_by(Data,v1,v2,v3)
Data=summarize(avg=mean(x,na.rm=TRUE)) Data
- This function takes the
.data
argument you provided (in your example, theData
object)… - And creates the
avg
variable taking the average ofx
within each tuple defined by the grouping variables (in this case,v1
,v2
, andv3
) - It returns a
grouped dataframe
, with the results ofavg
displayed for each unique combination ofv1
,v2
, andv3
The group_by()
function in R is part of the dplyr package and is used to create grouped data frames. It is commonly used in combination with summarize()
, mutate()
, and other dplyr
functions to perform operations within groups.
Important
After grouping, it’s often necessary to ungroup the data to prevent unintended behavior in subsequent operations:
=group_by(Data,v1,v2,v3)
Data=summarize(avg=mean(x,na.rm=TRUE))
Data=ungroup(Data) Data
Let’s try the latest summarize()
call again, but now grouping the data by symbol
first:
#Apply function to the data
=group_by(M7,symbol)
M7=summarize(M7,average=mean(adjusted,na.rm=TRUE))
Summary
#Show the first 10 observations
head(Summary,10)
# A tibble: 7 × 2
symbol average
<chr> <dbl>
1 AAPL 234.
2 AMZN 227.
3 GOOG 196.
4 META 625.
5 MSFT 430.
6 NVDA 137.
7 TSLA 405.
6. Pipe your way through the code %>%
The dplyr
verbs, in isolation, are a great tool for data analysts, but what really makes them to shine is what glues them together. The pipe operator (%>%
or |>
) is a key feature of the magrittr
package (included in the tidyverse
) and is widely used in R, especially together with dplyr
, for improving code readability and structuring data transformation workflows. Key benefits include:
✅ Improved Readability – The sequence of transformations is clear.
✅ No Need for Temporary Variables – Each step directly passes its result to the next function.
✅ Avoids Nesting – No deeply nested function calls.
The pipe operator allows you to pass the result of one function as the first argument to the next function, making code more readable and eliminating the need for nested function calls. To show its functionality in action, in the code chunk below, both parts of the code produce the exact same result, but the latter, using the pipe operator, is much simpler to read:
#Instead of
= read.csv('Data.csv') #Start with the data
Data = mutate(Data, new_var_1=var_1*10)#Mutate
Data = select(Data, var_1,var_2,new_var_1,where(is.numeric))#Select
Data = filter(Data, new_var_1>5)#Filter
Data = arrange(Data, new_var_1,desc(var2))#Arrange
Data = summarize(Data, new_var=mean(new_var_1,na.rm=TRUE))#Summarize
Data
#Do
= read.csv('Data.csv')%>% #Start with the data
Data mutate(new_var_1=var_1*10)%>% #Mutate
select(var_1,var_2,new_var_1,where(is.numeric))%>% #Select
filter(new_var_1>5)%>% #Filter
arrange(new_var_1,desc(var2))%>% #Arrange
summarize(new_var=mean(new_var_1,na.rm=TRUE))#Summarize
Hands-on Exercise
On January \(25^{th}\), chinese startup DeepSeek disrupted the tech stock market as investors reassessed the likely future investment in Artificial Intelligence hardware. As part of your work as a buy-side analyst, you were asked to analyze how the Magnificent 7 performed after the DeepSeek. To this point, follow the instructions and answer to the following question: which stock suffered the most during January 2025?
- To answer this question, you will be using all
dplyr
verbs you’ve practiced so far - Furthermore, you will be also using some common base R and ther
dplyr
functions, likelag()
,prod()
,as.Date()
anddrop_na()
The expected result is a data.frame
object that shows, for each symbol
, the monthly return on January, 2025, ordered from lowest-to-highest.
The data, stored in M7.csv
, can be loaded using read.csv('M7.csv')
. You can download it using the link shown in Slide 4.
- Select only the
symbol
,date
, andadjusted
columns, and arrange the dataset from oldest to newest - Mutate your
date
variable, making sure to read it as a Date object usingas.Date()
- Create a
Year
variable and filter only on observations happening in 2025. You can use theyear()
function to retrieve the year of a givenDate
column. - Group data by
symbol
- Create, for each different
symbol
, aReturn
variable that is defined as \(P_{t+1}/P_{t}\), where \(t\) refers to a date. You can use thelag()
function for this - You will see that
lag
produces anNA
whenever you try to lag the first observation. To make sure your data does not contain anyNA
, calldrop_na()
- Create, for each different
symbol
, aCum_Return
variable that is defined as the cumulative return. Compounded returns over time can be written as \(\small \prod(1+R_t)=(1+R_1)\times(1+R_2)\times...\times(1+R_t)\). For this, you can use theprod()
function. - Pick the latest observation from each
symbol
and arrange the table from lowest-to-highest return. The functionslice_tail(n=x)
retrieves the bottomx
observations, whereasslice_head(n=y)
retrieves the topy
.
Solution walkthrough
#Read the Data
%>%
M7#Select only the columns of interest
select(symbol,date,adjusted)%>%
#Make sure date is read as a Date object
mutate(date=as.Date(date))%>%
#Filter for observations happening in 2025
filter(year(date)==2025)%>%
#Arrange from chronological order
arrange(date)%>%
#Group by Symbol to perform the calculations
group_by(symbol)%>%
#Create the return
mutate(Return = adjusted/lag(adjusted,default = NA))%>%
#Remove NAs before doing the cumulative product
drop_na()%>%
mutate(Cum_Return = cumprod(Return)-1)%>%
#Select the latest observation from each symbol
slice_tail(n=1)%>%
#Select symbol, date, and cumulative return
select(symbol,date,Cum_Return)%>%
#Arrange from lowest-to-highest
arrange(Cum_Return)
# A tibble: 7 × 3
# Groups: symbol [7]
symbol date Cum_Return
<chr> <date> <dbl>
1 NVDA 2025-01-29 -0.106
2 AAPL 2025-01-29 -0.0184
3 TSLA 2025-01-29 0.0259
4 GOOG 2025-01-29 0.0344
5 MSFT 2025-01-29 0.0567
6 AMZN 2025-01-29 0.0765
7 META 2025-01-29 0.129
This code processes stock price data from M7
using the dplyr
package. It calculates the cumulative return for each stock (symbol
) in the year 2025, then selects the latest available observation per stock and sorts them from lowest to highest cumulative return.
Read the Data.
M7
is assumed to be a data frame or tibble containing stock data ready in your session. You can useread.csv()
and store it in an R object. The pipe operator%>%
) is used to chain functions together.Select Relevant Columns. Keeps only the relevant columns for the analysis:
symbol
→ The stock tickerdate
→ The trading dateadjusted
→ The adjusted closing price (used for return calculations)
Making sure the select
function is applied as one of the first adjustments can facilitate data wrangling as it shrinks the dataset for the upcoming operations.
Ensure
date
is aDate
object in your session. The code converts thedate
column to aDate
object to enable time-based filtering and calculations, likeyear()
.Filter Data for 2025. The code uses
year(date) == 2025
(from thelubridate
package, loaded together with thetidyverse
) to keep only data from 2025.Sort Data in Chronological Order. The code ensures that stock prices are arranged earliest to latest for correct return calculations.
Groups the dataset by stock (
symbol
). Usinggroup_by()
ensures that return calculations are performed for each stock separatelyCalculate Daily Returns. After the dataset is grouped, we use the
mutate()
function to create our return metric:
\[ Return=\dfrac{P_{t}}{P_{t-1}} \]
- Uses
lag(adjusted)
to get the previous day’s adjusted price. - The first row in each group will have NA (because there’s no previous price)
Because of that, we also need a call to drop_na()
to make sure that whenever we are multiplying these indices, we are not including NA
values.
The function cumprod()
, which calculates the cumulative product of a series, multiplies values sequentially. However, if there are missing values (NA
) in the sequence, cumprod()
propagates NA
to all subsequent values. This can corrupt the entire computation.
For example, ommitting the drop_na()
step in the solution code would produce NA
all over Cum_Return
:
#Read the Data
%>%
M7#Select only the columns of interest
select(symbol,date,adjusted)%>%
#Make sure date is read as a Date object
mutate(date=as.Date(date))%>%
#Filter for observations happening in 2025
filter(year(date)==2025)%>%
#Arrange from chronological order
arrange(date)%>%
#Group by Symbol to perform the calculations
group_by(symbol)%>%
#Create the return
mutate(Return = adjusted/lag(adjusted,default = NA))%>%
#Remove NAs before doing the cumulative product
mutate(Cum_Return = cumprod(Return)-1)
# A tibble: 126 × 5
# Groups: symbol [7]
symbol date adjusted Return Cum_Return
<chr> <date> <dbl> <dbl> <dbl>
1 AAPL 2025-01-02 244. NA NA
2 AMZN 2025-01-02 220. NA NA
3 GOOG 2025-01-02 191. NA NA
4 META 2025-01-02 599. NA NA
5 MSFT 2025-01-02 419. NA NA
6 NVDA 2025-01-02 138. NA NA
7 TSLA 2025-01-02 379. NA NA
8 AAPL 2025-01-03 243. 0.998 NA
9 AMZN 2025-01-03 224. 1.02 NA
10 GOOG 2025-01-03 193. 1.01 NA
# ℹ 116 more rows
- Calculate Cumulative Returns. With the series of daily returns in place cumulative return over time can be retrieved by compounding each individual return over time:
\[
\text{Cumulative Return}_{t=1\rightarrow T}= (1+R_1)\times(1+R_2)\times(1+R_3)\times...\times(1+R_t)\equiv\prod_{t=1}^{T}(1+R_t)
\] To perform such calculations, the code uses cumprod(Return)
, which multiplies returns over time. In the end, we also need to subtract 1 to express it as a percentage return.
Select the Latest Observation Per Stock. The
slice_tail()
function keeps only the last row (i.e., the most recent date) for each stock. Note that this behavior is only possible because our data has been grouped bysymbol
in the subsequent steps. In anungrouped
case,slice_tail()
would retrieve the latest observation considering the data as a whole - in this case, META cumulative returns.Keep Only Key Columns and Rearrange. After we’re done creating the relevant variables, we can use the
select()
function to keep only the columns that are of interest:symbol
,date
, andCum_Return
, and use thearrange()
function to sort observations by ascending order of cumulative returns (i.e, lowest-to-highest).
Try doing some edits on your own!
Try thinking about changes you could do to either improve code readability of the analysis. A couple of edits that can be made include, but are not limited, to:
- Adding more time periods to the analysis
- Increasing the set of
assets
to include more tech firms other than the magnificent seven - Calculate volatility metrics using
var()
orstdev()
functions
Play around with these concepts to get familiar with all the data manipulation tools that come with dplyr
!