Exploratory Data Analysis

After downloading the dataset here, the first step is to do some exploratory data analysis.

Let’s load the packages we will need to do the analysis.

library(countrycode)
library(geosphere)
library(ggplot2)
library(magrittr)
library(maps)
library(tidyverse)

Now, let’s read the dataset and take a look at the first rows.

hopper <- read_csv('datasets/puzzle.csv', col_names = c("lat", "long"))
head(hopper)
## # A tibble: 6 x 2
##     lat  long
##   <dbl> <dbl>
## 1 0.397  2.11
## 2 0.397  2.11
## 3 0.397  2.11
## 4 0.397  2.11
## 5 0.397  2.11
## 6 0.397  2.11

The dataset contains two columns lat (representing the latitude) and long (representing the longitude), let’s get now the numbers of rows.

nrow(hopper)
## [1] 1024

Now that we know a little bit more about the dataset (1024 rows (representing coordinates) and 2 columns lat and long), let’s plot the coordinates.

Basic Plot

Let’s plot the coordinates to see what it could represent.

ggplot(hopper, aes(x = long, y = lat)) +
  geom_point()

Since the dataset is from Hopper there is a good chance that these points represent airports. Let’s use a map to see it more clearly.

Map

Before using a map, let’s convert the coordinates to radians.

hopper %<>%
  mutate(
    lat = lat * (180 / pi),
    long = long * (180 / pi)
  )

Now, we can display the converted coordinates using a map.

world_map <- map_data("world")
map <- ggplot() +
  geom_polygon(data = world_map, aes(x = long, y = lat, group = group),
               fill = "lightgray", colour = "black")
map + geom_point(data = hopper, aes(x = long, y = lat), color = "red")

Using the map we can have a better visualization of the airports. Let’s do something fun and try to use a different dataset to have more informations about these airports.

OurAirports dataset

In order to get more informations about the airports, i will use the OurAirports dataset.

First step, let’s import the dataset and have a look at what informations we have about airports.

ourairports <- read_csv('datasets/airports.csv', col_types = cols_only(
  name = col_character(), latitude_deg = col_double(),
  longitude_deg = col_double(), municipality = col_character(),
  iata_code = col_character(), iso_country = col_character()
))
str(ourairports)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 55113 obs. of  6 variables:
##  $ name         : chr  "Total Rf Heliport" "Aero B Ranch Airport" "Lowell Field" "Epps Airpark" ...
##  $ latitude_deg : num  40.1 38.7 59.9 34.9 35.6 ...
##  $ longitude_deg: num  -74.9 -101.5 -151.7 -86.8 -91.3 ...
##  $ iso_country  : chr  "US" "US" "US" "US" ...
##  $ municipality : chr  "Bensalem" "Leoti" "Anchor Point" "Harvest" ...
##  $ iata_code    : chr  NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols_only(
##   ..   id = col_skip(),
##   ..   ident = col_skip(),
##   ..   type = col_skip(),
##   ..   name = col_character(),
##   ..   latitude_deg = col_double(),
##   ..   longitude_deg = col_double(),
##   ..   elevation_ft = col_skip(),
##   ..   continent = col_skip(),
##   ..   iso_country = col_character(),
##   ..   iso_region = col_skip(),
##   ..   municipality = col_character(),
##   ..   scheduled_service = col_skip(),
##   ..   gps_code = col_skip(),
##   ..   iata_code = col_character(),
##   ..   local_code = col_skip(),
##   ..   home_link = col_skip(),
##   ..   wikipedia_link = col_skip(),
##   ..   keywords = col_skip()
##   .. )

I will use the latitude_deg and longitude_deg columns to join with the lat and long columns of the previous dataset.

However, I will encounter a problem: the value of the airports’ coordinates will be different from the two datasets.

The solution is the following code:

grid <- expand.grid(1:nrow(hopper), 1:nrow(ourairports))
distances <- matrix(distGeo(hopper[grid[,1], c("long", "lat")],
                            ourairports[grid[,2], c("longitude_deg",
                                                    "latitude_deg")]),
             nrow = nrow(hopper))
infos <- ourairports[apply(distances, 1, function(x) {
  c <- which(x <= 2500)
  if (length(c) == 0) NA else c[1]
}), ]

rm(list = c("grid", "distances"))

The idea is to find the shortest distance between two points (coordinates from the hopper dataset and from the ourAirport dataset). That way I can “match” the coordinates between the two datasets and get the airports’ informations.

Here I get the shortest airport from a distance of 2500 meters or less. I tested different values but 2500 was the best one to “match” all the coordinates from the hopper dataset.

I use the function rm to delete the grid and distances since I don’t need them later on.

Next, I combine the infos dataset to the hopper dataset using cbind. Using the ISO code column and the countrycode function, I get the country name.

Some countries are not found with the countrycode function, so I manually add the values.

I also filter out airports without IATA code because I will not able to merge them later with the other datasets.

Finally, I check the updated dataset with the head function.

hopper %<>%
  cbind(infos) %>%
  select(-latitude_deg, -longitude_deg)
hopper$country <- countrycode(hopper$iso_country, "iso2c", "country.name")
hopper %<>%
  filter(!is.na(iata_code)) %>%
  mutate(
    country = ifelse(is.na(country) & iata_code == "PRN", "Kosovo", country),
    country = ifelse(is.na(country) & iata_code == "WDH", "Namibia", country))
head(hopper)
##      lat    long            name iso_country municipality iata_code
## 1 22.755 121.102 Taitung Airport          TW Taitung City       TTT
## 2 22.755 121.102 Taitung Airport          TW Taitung City       TTT
## 3 22.755 121.102 Taitung Airport          TW Taitung City       TTT
## 4 22.755 121.102 Taitung Airport          TW Taitung City       TTT
## 5 22.755 121.102 Taitung Airport          TW Taitung City       TTT
## 6 22.755 121.102 Taitung Airport          TW Taitung City       TTT
##   country
## 1  Taiwan
## 2  Taiwan
## 3  Taiwan
## 4  Taiwan
## 5  Taiwan
## 6  Taiwan

With this updated dataset, let’s explore the new informations we have in our hands.

Top 10 Most used Airports

hopper %>%
  count(name, municipality, country, name = "occurences") %>%
  arrange(desc(occurences)) %>%
  top_n(10)
## # A tibble: 11 x 4
##    name                       municipality           country     occurences
##    <chr>                      <chr>                  <chr>            <int>
##  1 Taitung Airport            Taitung City           Taiwan             101
##  2 Atatürk International Air~ Istanbul               Turkey               2
##  3 Düsseldorf Airport         Düsseldorf             Germany              2
##  4 Francisco de Sá Carneiro ~ Porto                  Portugal             2
##  5 Learmonth Airport          Exmouth                Australia            2
##  6 Naples International Airp~ Nápoli                 Italy                2
##  7 Palma De Mallorca Airport  Palma De Mallorca      Spain                2
##  8 Pinto Martins Internation~ Fortaleza              Brazil               2
##  9 Sofia Airport              Sofia                  Bulgaria             2
## 10 Southend Airport           Southend               United Kin~          2
## 11 Tri-Cities Regional TN/VA~ Bristol/Johnson/Kings~ United Sta~          2

Taitung Airport is the only airport that appears more than two times. Seems to be a very popular among the Hopper users.

Top 10 departure countries

hopper %>%
  count(country, name = "occurences") %>%
  arrange(desc(occurences)) %>%
  top_n(10)
## # A tibble: 10 x 2
##    country        occurences
##    <chr>               <int>
##  1 United States         332
##  2 Taiwan                104
##  3 Mexico                 35
##  4 Canada                 28
##  5 Brazil                 24
##  6 China                  20
##  7 Spain                  19
##  8 United Kingdom         19
##  9 India                  17
## 10 Italy                  17

OpenFlights datasets (airlines and routes)

We have the informations about the airport but it would be more interesting to also have the airlines and routes from these airports.

With all these informations combine we can build a dashboard to have a better understanding and visualization.

First step, let’s import these two new datasets available on OpenFlights.

routes <- read_csv('datasets/routes.dat',
                   col_names = c("AirlineId","Airline_OpenFlightsId","OriginId",
                                 "Origin_OpenFlightsId", "DestinationId",
                                 "Destination_OpenFlightsId","Codeshare",
                                 "Stops","Equipment"),
                   na = c('\\N', "", "NA"),
                   col_types = cols(
                     Codeshare = col_skip(),
                     Stops = col_skip(),
                     Equipment = col_skip()))
airlines <- read_csv('datasets/airlines.dat',
                     col_names = c("Airline_OpenFlightsId", "Airline_Name",
                                   "Airline_Alias","Airline_IataCode",
                                   "Airline_IcaoCode","Airline_Callsign",
                                   "Airline_CountryOfBase","Airline_Active"),
                     na = c('\\N', "", "NA"))

Before join the different datasets together, I group the airports by existing IATA code.

ourairports %<>%
  select(destinationName = name, iata_code,
         destinationLat = latitude_deg, destinationLong = longitude_deg) %>%
  filter(!is.na(iata_code), iata_code != "0") %>%
  group_by(iata_code) %>%
  slice(1) %>%
  ungroup()

In order to have our final dataset, we need to join all the different datasets together. Finally, I save the final dataset as a csv file.

hopper %<>%
  inner_join(routes, by = c("iata_code" = "OriginId")) %>%
  inner_join(ourairports, by = c("DestinationId" = "iata_code")) %>%
  select(-AirlineId) %>%
  inner_join(airlines) %>%
  select(-Origin_OpenFlightsId, -Destination_OpenFlightsId, -Airline_OpenFlightsId)
## Joining, by = "Airline_OpenFlightsId"
write_excel_csv(hopper, "datasets/hopper.csv")

This csv file is used on this dashboard where you can see the different airports with routes and airlines.