There’s the new cool world of project management - agile, scrumbling, cool. There’s the old sluggish way of project management using stuff like gantt diagrams. Let’s stick to the old world and come up with a gantt diagram.
The gant diagram itself is no big deal. Just some horizontal lines referring to dates. Somewhat more interesting is to populate a raw data frame in a way that allows for convenient plotting.
Say we start with a basic dataframe (from a CSV file) that contains the following columns:
library(tidyverse)
## Warning: package 'dplyr' was built under R version 3.5.1
gant_data_raw <- read_csv2("https://data-se.netlify.com/download/gantt.txt")
head(gant_data_raw)
## # A tibble: 6 x 6
## Section Task Previous_Event Status Start_Date Duration
## <chr> <chr> <chr> <chr> <date> <int>
## 1 Inhalte Neue Inhalte ent… <NA> open 2019-03-01 6
## 2 Inhalte Inhalte weiteren… Neue Inhalte entwi… open NA 6
## 3 Inhalte Anpassungen Inhalte weiterentw… open NA 3
## 4 Apps Apps konzipieren Anpassungen open NA 3
## 5 Apps Apps programmier… Apps konzipieren open NA 3
## 6 Apps Feedback-Tools k… Apps programmieren open NA 3
Of importance are only Task, Previous_Evnet and Duration. In addition, we need an overall start date (“2019-03-01” in this case). Each subsequent task is assumed to follow neatly its predecessing event.
Our job is to compute the start date and end date of task given that we know the initial start date and the durations. As said, this procedure is based on the assumption that there is a frictionless and gapless sequence of tasks.
Consider this function to populate the data:
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
prepare_gant_data <- function(gantdf = gantdf){
# Given one intial project start date and the tasks duratations,
# this function computes the start and end dates of each task
# input: raw data (csv)
# output: populated/prepared gantt data suitable for plotting
# add some more columns to the data frame
gantdf$End_Date <- parse_date("")
gantdf$ID <- 1:nrow(gantdf)
# initialize the data population: Compute first end date
gantdf$End_Date[1] <- gantdf$Start_Date[1] + months(gantdf$Duration[1])
# now start loop for each successive element
for (i in 2:(nrow(gantdf))) {
# for each task, we need to find its start date
# the start date is the *end date* of the event defined as its "ancestor"
# in other words, the start date is the end date of the respective "previous_event"
previous_event <- gantdf$Previous_Event[i]
previous_event_pos_vec <- match(gantdf$Task, previous_event)
NonNAindex <- which(!is.na(previous_event_pos_vec))
previous_event_pos <- min(NonNAindex)
gantdf$Start_Date[i] <- gantdf$End_Date[previous_event_pos]
gantdf$End_Date[i] <- gantdf$Start_Date[i] + months(gantdf$Duration[i])
}
return(gantdf)
}
Run it:
gantdf <- prepare_gant_data(gantdf = gant_data_raw)
That’s the workhorse.
Now let’s plot. Before that, it comes handy to compute some comfort variables:
project_start <- gantdf$Start_Date[1]
project_end <- max(gantdf$End_Date)
project_duration <- interval(project_start, project_end)
project_duration_months <- project_duration %/% months(1)
## Note: method with signature 'Timespan#Timespan' chosen for function '%/%',
## target signature 'Interval#Period'.
## "Interval#ANY", "ANY#Period" would also be valid
Here’s the actual plot:
my_breaks = seq(as.Date(project_start), as.Date(project_start+years(3)), by="1 year")
gantdf %>%
#mutate(Task = factor(Task)) %>%
ggplot() +
aes(y = reorder(Task, -ID), yend = reorder(Task, -ID),
x = Start_Date, xend = End_Date,
color = factor(Section)) +
geom_segment(size = 3) +
theme_bw() +
scale_x_date(date_labels = "%Y-%m", breaks = my_breaks, limits = c(project_start, project_start+years(3)) ) +
labs(caption = paste0("Dauer [Monate]: ", project_duration_months),
x = "Zeit",
y = "Arbeitspakete",
color = "") +
theme(legend.position = "bottom")
Note that the data population (prepare gantt data) funtion assumes that a previous task does happen earlier in time. That’s because the data frame is populated row by row. We cannot access a later row because the start and end dates would still be empty.