EDA

Column

Introduction to the Data Set

Analysis of Top 200 Swimming Times World-Wide

Variables

This is what all of the variables included in this dataset mean:

  • event_name: The name of the swimming event where the race occurred

  • swim_time: The time the athlete achieved to get onto the best 200 times

  • swim_date: Date when the event occurred

  • event_description: The event that the swimmers participated in

  • team_code: The code of the country where the team is from

  • team_name: The country the swimmer swims for

  • athlete_full_name: The name of the athlete

  • gender: The gender of the athlete

  • athlete_birth_date: The date of birth of the athlete

  • rank_order: The place in the top 200 times that the swimmer is at

  • city: What city the race took place

  • country_code: What country the race took place

  • duration_hh_mm_ss_ff: The full time in hours, seconds, and milliseconds

Methods and Questions

For this project, I used R studio and the data analysis that comes along with this type of coding. I was able to use exploratory data analysis to start to understand what exactly this project would cover, then I went into making adjustments to the data set and producing figures that would help explain some of the questions that I am interested in.

I am interested in looking at different events that swimmers participate in and whether age, gender, and nationality have anything to do with better or worse swimming times.

Column

Goals For This Project

For this project, it is important for me to understand the different countries and which are better at the sport of swimming. The majority of this project will be distinguishing which teams are more advanced with more swimmers who excelled at their sport. This project will also look at the ages of some of the top swimmers to see where there are discrepancies or big gaps between ages of the winners. I also want to take a look at some of the box plots that will help me visualize the amount of variety between the winning time out of some of the events of interest to me. Also, with living in the United States, I want to make sure I look at how we are doing in comparison to other countries in different events.

Why This Data Set Interests Me

I love swimming and swam for 9 years of my life with 4 years on a club team in elementary and middle school, 4 years in high school, and finally one year in college. This has been an integral part of my life with some of my closest friends coming from the team, which is one reason that I stuck with it for as long as I did. I was never amazing at the sport, but being able to hang out with some of my favorite people in my school really made it all worth it. With looking into my dataset, I want to be able to investigate some of the events that I participated in in high school and how these athletes are a lot better than I was throughout my swimming career.

Team Analysis

Column

Count of Top 10 Countries

Column

Top 200 Times Analysis

With the United States of America having 1360 swimmers that have achieved a spot on the top 200 times, we are the best swimmers of the world. This could be because we have a lot better compensation throughout our programs, the size of the team that is generally going to the olympics, the swimming legacy that must be upheld, the better facilites we have, or the general population. Some of the other countries that have a great number of swimmers on the top 200 times is Australia with 637 swimmers, Japan with 427 swimmers, and the People’s Republic of China with 377 swimmers.

World View

Column

World View

Top 5 Per Event

Column

Men’s 100 Breaststroke

Women’s 100 Breaststroke

Column

Top 10 Analysis

It is fascinating how there is one athlete who has dominated the 100 breaststroke times and has achieved the top 10 times without anyone being able to top him. One unexpected thing also is how he is from the United Kingdom which is not the top team with swimmer’s on these charts nor do they have any of the top ten amounts of swimmers. This could be one thing to look at is how specific events are distributed by countries.

Women are more evenly split with a good amount from the Americas and Europe and even one from South Africa. Lilly King seems to be somewhat dominating the charts with a good 4 spots on it with one of them being in first place. Again, it seems like countries might be an important thing to look at in different events since there are so many differences between them.

Age

Column

Age of Swimmer

Column

Typical Age of Fast Swimmer

As shown in the histogram here, this is a pretty uniform distribution allowing for the mean to be 22 with a value of 633. There are still some outliers around 14, 33, and 34. Swimming is a very evenly distributed sport especially with the peak time in one’s life that they will excell at it.

Outliers and Significance

As seen in the histogram, there is a minimum age value of 14 which is completely insane when thinking these are the best 200 times out of the entire world. Some of the notable athletes form the United States are Katie Ledecky and Katie Grimes who both were able to achieve positions in the top 200 times at only 15 years old. Recently in 2022, David Popovici from Romania achieved the world record in the 100 meter freestyle at merely 18 years old. This was the inspiration for the dataset and the reason it was made in the first place, so it was important to highlight this event.

Important Events

Column

Men Free

Women Free

Men Age

Women Age

Column

Boxplot Analysis

For the men’s 100 freestyle, there is a fluctuation at around 47.7 with some of the fastest times being closer to 39.8. These are a lot of outliers, but one of these is the Romanian swimmer David Popovici. Also, as seen through the continents aspect, there is only one time from Africa with a lot of the diversity coming from Europe, who also had the fastest times and a lot of outliers. There is a very large boxplot in the Americas which is especially interesting.

The women’s 100 freestyle is around 54 seconds with some of the outliers around 51.5. The best swim time is from Fran Halsall who swims for Great Britain. Again, there is a very small amount of times that come from Africa, or at least they all scored the same time. Either way, in this event, Europe had the best time again with an outlier at 51.5 approximately. Also, interestingly, there is a very large span for the Australia times.

Scatterplot Analysis

The ages of different swimmer’s in these two events are not relevant to that swimmer’s overall time. I also believe that this is consistent throughout all of swimming, since there is a lot of diversity throughout the ages of this study. As seen in the histogram before, there is a lot more 22 year-olds than any other age, which is not even consistent with the top time in the Men’s and Women’s 100 freestyle as the men’s top time was from an 18 year-old and the women’s top time was from a 24-year old.

Conclusion

Column

Discussion

I believe that this analysis project was a success as I was able to figure out a lot of interpretations of the data that helped me answer some of my major questions. One of these discoveries was how there was a huge discrepancy between a lot of the continents that were analyzed especially since Europe was found to be one of the most active continenets as far as swimmers making it on the top 200 times, but USA has the most out of any team by far.

Secondly, when looking at the age of the swimmers, it was clear that it was a uniform distribution with 22 being the maximum number. Though a lot of claims can be made about the ideal time in a swimmer’s life when they will excel at the sport, it was found later on that a lot of the younger swimmers, 18 or close to it, have been getting first place in events completely out of the blue. This might be a new trend that the age will shift towards.

Thirdly, as far as gender goes, men are faster than women and that is just a fact especially when getting to the higher levels of Olympic swimming. This is more so just physiological differences in the male and female body and nothing more. This was never really an argument that I was trying to make since women and men already swim in different events anyways.

Column

Limitations

One of the limitations of this study was how many different events there were that made it hard to analyze all of them for certain trends. If I was able to individually analyze all of the events by their ranking at the same time, it would help a lot with seeing whether all of the variables discussed in the discussion section were specific to some events or to all of swimming.

Future Studies

I believe that working through all of these events would really help provide a more holistic approach to this study. This can come in the form of analyzing all of these for the questions that were provided at the beginning and then comparing all of them to each other. I was able to find out some things about certain events, but it would take a lot longer to get through all of them.

---
title: "Swimming"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: lux
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

<style>
.chart-title { /* chart_title */
    font-size: 20px;
    }
body{ /* Normal */ 
      font-size: 18px; 
      }
</style>

```{r setup, include=FALSE}
library(flexdashboard)
```

EDA
===

Column {data-width=500 .tabset}
-----------------------------------------------------------------------

### Introduction to the Data Set


**Analysis of Top 200 Swimming Times World-Wide**


```{r glimpse}
pacman::p_load(corrplot, conflicted, DT, knitr, plotly, tidyverse, countrycode)
conflicts_prefer(dplyr::filter)
conflicts_prefer(plotly::layout)
#conflict_prefer(dplyr::select)
swimming <- read_csv("Swimming database.csv")
names(swimming) <- make.names(names(swimming))
datatable(swimming[1:100,])
```

### Variables

This is what all of the variables included in this dataset mean: 

* event_name: The name of the swimming event where the race occurred 

* swim_time: The time the athlete achieved to get onto the best 200 times

* swim_date: Date when the event occurred

* event_description: The event that the swimmers participated in

* team_code: The code of the country where the team is from

* team_name: The country the swimmer swims for

* athlete_full_name: The name of the athlete

* gender: The gender of the athlete

* athlete_birth_date: The date of birth of the athlete

* rank_order: The place in the top 200 times that the swimmer is at

* city: What city the race took place

* country_code: What country the race took place

* duration_hh_mm_ss_ff: The full time in hours, seconds, and milliseconds

### Methods and Questions

For this project, I used R studio and the data analysis that comes along with this type of coding. I was able to use exploratory data analysis to start to understand what exactly this project would cover, then I went into making adjustments to the data set and producing figures that would help explain some of the questions that I am interested in. 

I am interested in looking at different events that swimmers participate in and whether age, gender, and nationality have anything to do with better or worse swimming times. 

Column {data-width=350}
-----------------------------------------------------------------------

### Goals For This Project

For this project, it is important for me to understand the different countries and which are better at the sport of swimming. The majority of this project will be distinguishing which teams are more advanced with more swimmers who excelled at their sport. This project will also look at the ages of some of the top swimmers to see where there are discrepancies or big gaps between ages of the winners. I also want to take a look at some of the box plots that will help me visualize the amount of variety between the winning time out of some of the events of interest to me. Also, with living in the United States, I want to make sure I look at how we are doing in comparison to other countries in different events. 

### Why This Data Set Interests Me

I love swimming and swam for 9 years of my life with 4 years on a club team in elementary and middle school, 4 years in high school, and finally one year in college. This has been an integral part of my life with some of my closest friends coming from the team, which is one reason that I stuck with it for as long as I did. I was never amazing at the sport, but being able to hang out with some of my favorite people in my school really made it all worth it. With looking into my dataset, I want to be able to investigate some of the events that I participated in in high school and how these athletes are a lot better than I was throughout my swimming career. 

Team Analysis
===

Column {data-width=500}
-----------------------------------------------------------------------

### Count of Top 10 Countries

```{r countries}
count(swimming, Team.Name) %>% arrange(desc(n)) -> Team_Names
Team_Names <- Team_Names[1:10,]
ggplot(Team_Names, aes(y = Team.Name, x = n, text = paste0(n, " swimmers"))) +
  geom_bar(stat = "identity", fill = "darkgreen") + 
  labs(x = "Count",
       y = "Team Name",
       title = "Amount of Top 200 Times per Team") -> p 

ggplotly(p, tooltip = "text")
```

Column {data-width=500}
-----------------------------------------------------------------------

### Top 200 Times Analysis

With the United States of America having 1360 swimmers that have achieved a spot on the top 200 times, we are the best swimmers of the world. This could be because we have a lot better compensation throughout our programs, the size of the team that is generally going to the olympics, the swimming legacy that must be upheld, the better facilites we have, or the general population. Some of the other countries that have a great number of swimmers on the top 200 times is Australia with 637 swimmers, Japan with 427 swimmers, and the People's Republic of China with 377 swimmers. 

World View
=====

Column {data-width=500}
--------

### World View

```{r map}
swimming$Team.Name <- recode(swimming$Team.Name,
                             "Chinese Taipei" = "Taiwan",
                             "Club" = "USA",
                             "German Democratic Republic" = "Germany",
                             "Great Britain" = "United Kingdom",
                             "Hong Kong, China" = "China",
                             "People's Republic of China" = "China",
                             "ROC" = "Taiwan",
                             "Republic of Korea" = "South Korea",
                             "Russian Federation" = "Russia",
                             "United States of America" = "USA")

swim_counts <- swimming %>%
  group_by(Team.Name) %>%
  summarise(n = n())

swim_counts <- swim_counts %>%
  mutate(continents = countrycode(Team.Name, "country.name", "continent"))


swim_counts %>%
  ggplot(aes(x = continents)) +
  geom_bar(fill = "blue")

swimming <- swimming %>%
  mutate(continents = countrycode(Team.Name, "country.name", "continent"))

unique(swimming$Team.Name) -> countries

map_data("world") -> World_Map

country_swim <- World_Map %>%
  filter(region %in% countries)

swimming_map <- swim_counts %>%
  left_join(country_swim, by = c("Team.Name" = "region"))

region.data <- swimming_map %>%
  group_by(Team.Name) %>%
  summarise(long = mean(long), lat = mean(lat))

World_Map %>% ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group, text = paste0("Country: ", region)), fill = "lightgrey") +
  geom_polygon(data = swimming_map, aes(x = long, y = lat, group = group, fill = n,
                                        text = paste0("Country: ", Team.Name, "\n", n, " swimmers"))) +
  scale_color_brewer(palette = "Set1") +
  theme_void() -> p

ggplotly(p, tooltip = "text")
```

Top 5 Per Event
==========

Column {data-width=500 .tabset}
-----------------------------------------------------------------------

### Men's 100 Breaststroke

```{r M Breast}
M.Breaststroke <- filter(swimming, Event.description == "Men 100 Breaststroke LCM Male")
Top5MBreast <- head(M.Breaststroke, 10)
datatable(Top5MBreast)
```

### Women's 100 Breaststroke 

```{r W Breast}
W.Breaststroke <- filter(swimming, Event.description == "Women 100 Breaststroke LCM Female")
Top5WBreast <- head(W.Breaststroke, 10)
datatable(Top5WBreast)
```

Column {data-width=500 }
-----------------------------------------------------------------------

### Top 10 Analysis

It is fascinating how there is one athlete who has dominated the 100 breaststroke times and has achieved the top 10 times without anyone being able to top him. One unexpected thing also is how he is from the United Kingdom which is not the top team with swimmer's on these charts nor do they have any of the top ten amounts of swimmers. This could be one thing to look at is how specific events are distributed by countries. 

Women are more evenly split with a good amount from the Americas and Europe and even one from South Africa. Lilly King seems to be somewhat dominating the charts with a good 4 spots on it with one of them being in first place. Again, it seems like countries might be an important thing to look at in different events since there are so many differences between them. 


Age
==========

Column {data-width=500}
-----------------------------------------------------------------------

### Age of Swimmer

``` {r Age}
library(date)
swimming$Athlete.birth.date <- as.date(swimming$Athlete.birth.date)
swimming <- mutate(swimming, 
                   birth.year = format(as.Date(swimming$Athlete.birth.date, format="%d/%m/%Y"),"%Y"))
swimming$Swim.date <- as.date(swimming$Swim.date)
swimming <- mutate(swimming, 
                   swim.year = format(as.Date(swimming$Swim.date, format = "%d/%m/%Y"),"%Y"))
swimming$swim.year <- as.numeric(swimming$swim.year)
swimming$birth.year <- as.numeric(swimming$birth.year)
swimming <- mutate(swimming, 
                   age.at.event = swim.year - birth.year)

#ggplot(swimming, aes(x = age.at.event)) +
#  geom_histogram(fill = "#007991", color = "black") +
#  labs(x = "Age at Time of Event")



# Use plot_ly
plot_ly(data = swimming, 
        x = ~age.at.event, 
        type = "histogram", 
        marker = list(color = "#007991"),
        name = "Age Distribution") %>%
  layout(xaxis = list(title = "Age at Time of Event"))
```

Column {data-width=500}
-----------------------------------------------------------------------

### Typical Age of Fast Swimmer

As shown in the histogram here, this is a pretty uniform distribution allowing for the mean to be 22 with a value of 633. There are still some outliers around 14, 33, and 34. Swimming is a very evenly distributed sport especially with the peak time in one's life that they will excell at it. 

### Outliers and Significance

``` {r outlier}
fourteen <- filter(swimming, age.at.event == 14)
fifteen <- filter(swimming, age.at.event == 15)
```

As seen in the histogram, there is a minimum age value of 14 which is completely insane when thinking these are the best 200 times out of the entire world. Some of the notable athletes form the United States are Katie Ledecky and Katie Grimes who both were able to achieve positions in the top 200 times at only 15 years old. Recently in 2022, David Popovici from Romania achieved the world record in the 100 meter freestyle at merely 18 years old. This was the inspiration for the dataset and the reason it was made in the first place, so it was important to highlight this event. 

Important Events
===

Column {data-width=500 .tabset}
-----------------------------------------------------------------------

### Men Free

``` {r M Freestyle}
M.Freestyle <- filter(swimming, Event.description == "Men 100 Freestyle LCM Male")
M.Freestyle$Swim.time <- as.numeric(M.Freestyle$Swim.time)
ggplot(M.Freestyle, aes(x = Swim.time, y = continents)) +
  geom_boxplot(fill = "#77AF9C") +
  labs(title = "Top 200 Men's 100 Freestyle Times") +
       xlab("Swim Time") +
       ylab ("Continents")
```

### Women Free

``` {r W Freestyle}
W.Freestyle <- filter(swimming, Event.description == "Women 100 Freestyle LCM Female")
W.Freestyle$Swim.time <- as.numeric(W.Freestyle$Swim.time)
ggplot(W.Freestyle, aes(x = Swim.time, y = continents)) +
  geom_boxplot(fill = "#77AF9C") +
  labs(title = "Top 200 Women's 100 Freestyle Times") +
       xlab("Swim Time") +
       ylab ("Continents")
```

### Men Age

``` {r Men Age}
ggplot(M.Freestyle, aes(x = Swim.time, y = age.at.event)) +
  geom_point(color = "darkblue") +
  labs(title = "Age of Top 200 Men's 100 Freestyle") +
       xlab("Swim Time") +
       ylab ("Age at Time of Event")
```

### Women Age

```{r W Age}
ggplot(W.Freestyle, aes(x = Swim.time, y = age.at.event)) +
  geom_point(color = "darkblue") +
  labs(title = "Age of Top 200 Women's 100 Freestyle") +
       xlab("Swim Time") +
       ylab ("Age at Time of Event")
```



Column {data-width=500}
-----------------------------------------------------------------------

### Boxplot Analysis

For the men's 100 freestyle, there is a fluctuation at around 47.7 with some of the fastest times being closer to 39.8. These are a lot of outliers, but one of these is the Romanian swimmer David Popovici. Also, as seen through the continents aspect, there is only one time from Africa with a lot of the diversity coming from Europe, who also had the fastest times and a lot of outliers. There is a very large boxplot in the Americas which is especially interesting. 

The women's 100 freestyle is around 54 seconds with some of the outliers around 51.5. The best swim time is from Fran Halsall who swims for Great Britain. Again, there is a very small amount of times that come from Africa, or at least they all scored the same time. Either way, in this event, Europe had the best time again with an outlier at 51.5 approximately. Also, interestingly, there is a very large span for the Australia times.

### Scatterplot Analysis

The ages of different swimmer's in these two events are not relevant to that swimmer's overall time. I also believe that this is consistent throughout all of swimming, since there is a lot of diversity throughout the ages of this study. As seen in the histogram before, there is a lot more 22 year-olds than any other age, which is not even consistent with the top time in the Men's and Women's 100 freestyle as the men's top time was from an 18 year-old and the women's top time was from a 24-year old. 



Conclusion
===

Column {data-width=500 .tabset}
-----------------------------------------------------------------------

### Discussion

I believe that this analysis project was a success as I was able to figure out a lot of  interpretations of the data that helped me answer some of my major questions. One of these discoveries was how there was a huge discrepancy between a lot of the continents that were analyzed especially since Europe was found to be one of the most active continenets as far as swimmers making it on the top 200 times, but USA has the most out of any team by far. 

Secondly, when looking at the age of the swimmers, it was clear that it was a uniform distribution with 22 being the maximum number. Though a lot of claims can be made about the ideal time in a swimmer's life when they will excel at the sport, it was found later on that a lot of the younger swimmers, 18 or close to it, have been getting first place in events completely out of the blue. This might be a new trend that the age will shift towards. 

Thirdly, as far as gender goes, men are faster than women and that is just a fact especially when getting to the higher levels of Olympic swimming. This is more so just physiological differences in the male and female body and nothing more. This was never really an argument that I was trying to make since women and men already swim in different events anyways. 

### Sources

https://blog.myswimpro.com/2021/08/10/why-does-the-usa-win-so-many-medals-in-swimming/

https://data.world/romanian-data/swimming-dataset-top-200-world-times

Column {data-width=500}
-----------------------------------------------------------------------

### Limitations

One of the limitations of this study was how many different events there were that made it hard to analyze all of them for certain trends. If I was able to individually analyze all of the events by their ranking at the same time, it would help a lot with seeing whether all of the variables discussed in the discussion section were specific to some events or to all of swimming.

### Future Studies

I believe that working through all of these events would really help provide a more holistic approach to this study. This can come in the form of analyzing all of these for the questions that were provided at the beginning and then comparing all of them to each other. I was able to find out some things about certain events, but it would take a lot longer to get through all of them.