After downloading the dataset here, the first step is to do some exploratory data analysis.
Let’s load the packages we will need to do the analysis.
library(countrycode)
library(geosphere)
library(ggplot2)
library(magrittr)
library(maps)
library(tidyverse)
Now, let’s read the dataset and take a look at the first rows.
hopper <- read_csv('datasets/puzzle.csv', col_names = c("lat", "long"))
head(hopper)
## # A tibble: 6 x 2
## lat long
## <dbl> <dbl>
## 1 0.397 2.11
## 2 0.397 2.11
## 3 0.397 2.11
## 4 0.397 2.11
## 5 0.397 2.11
## 6 0.397 2.11
The dataset contains two columns lat (representing the latitude) and long (representing the longitude), let’s get now the numbers of rows.
nrow(hopper)
## [1] 1024
Now that we know a little bit more about the dataset (1024 rows (representing coordinates) and 2 columns lat and long), let’s plot the coordinates.
Let’s plot the coordinates to see what it could represent.
ggplot(hopper, aes(x = long, y = lat)) +
geom_point()
Since the dataset is from Hopper there is a good chance that these points represent airports. Let’s use a map to see it more clearly.
Before using a map, let’s convert the coordinates to radians.
hopper %<>%
mutate(
lat = lat * (180 / pi),
long = long * (180 / pi)
)
Now, we can display the converted coordinates using a map.
world_map <- map_data("world")
map <- ggplot() +
geom_polygon(data = world_map, aes(x = long, y = lat, group = group),
fill = "lightgray", colour = "black")
map + geom_point(data = hopper, aes(x = long, y = lat), color = "red")
Using the map we can have a better visualization of the airports. Let’s do something fun and try to use a different dataset to have more informations about these airports.
In order to get more informations about the airports, i will use the OurAirports dataset.
First step, let’s import the dataset and have a look at what informations we have about airports.
ourairports <- read_csv('datasets/airports.csv', col_types = cols_only(
name = col_character(), latitude_deg = col_double(),
longitude_deg = col_double(), municipality = col_character(),
iata_code = col_character(), iso_country = col_character()
))
str(ourairports)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 55113 obs. of 6 variables:
## $ name : chr "Total Rf Heliport" "Aero B Ranch Airport" "Lowell Field" "Epps Airpark" ...
## $ latitude_deg : num 40.1 38.7 59.9 34.9 35.6 ...
## $ longitude_deg: num -74.9 -101.5 -151.7 -86.8 -91.3 ...
## $ iso_country : chr "US" "US" "US" "US" ...
## $ municipality : chr "Bensalem" "Leoti" "Anchor Point" "Harvest" ...
## $ iata_code : chr NA NA NA NA ...
## - attr(*, "spec")=
## .. cols_only(
## .. id = col_skip(),
## .. ident = col_skip(),
## .. type = col_skip(),
## .. name = col_character(),
## .. latitude_deg = col_double(),
## .. longitude_deg = col_double(),
## .. elevation_ft = col_skip(),
## .. continent = col_skip(),
## .. iso_country = col_character(),
## .. iso_region = col_skip(),
## .. municipality = col_character(),
## .. scheduled_service = col_skip(),
## .. gps_code = col_skip(),
## .. iata_code = col_character(),
## .. local_code = col_skip(),
## .. home_link = col_skip(),
## .. wikipedia_link = col_skip(),
## .. keywords = col_skip()
## .. )
I will use the latitude_deg and longitude_deg columns to join with the lat and long columns of the previous dataset.
However, I will encounter a problem: the value of the airports’ coordinates will be different from the two datasets.
The solution is the following code:
grid <- expand.grid(1:nrow(hopper), 1:nrow(ourairports))
distances <- matrix(distGeo(hopper[grid[,1], c("long", "lat")],
ourairports[grid[,2], c("longitude_deg",
"latitude_deg")]),
nrow = nrow(hopper))
infos <- ourairports[apply(distances, 1, function(x) {
c <- which(x <= 2500)
if (length(c) == 0) NA else c[1]
}), ]
rm(list = c("grid", "distances"))
The idea is to find the shortest distance between two points (coordinates from the hopper dataset and from the ourAirport dataset). That way I can “match” the coordinates between the two datasets and get the airports’ informations.
Here I get the shortest airport from a distance of 2500 meters or less. I tested different values but 2500 was the best one to “match” all the coordinates from the hopper dataset.
I use the function rm to delete the grid and distances since I don’t need them later on.
Next, I combine the infos dataset to the hopper dataset using cbind. Using the ISO code column and the countrycode function, I get the country name.
Some countries are not found with the countrycode function, so I manually add the values.
I also filter out airports without IATA code because I will not able to merge them later with the other datasets.
Finally, I check the updated dataset with the head function.
hopper %<>%
cbind(infos) %>%
select(-latitude_deg, -longitude_deg)
hopper$country <- countrycode(hopper$iso_country, "iso2c", "country.name")
hopper %<>%
filter(!is.na(iata_code)) %>%
mutate(
country = ifelse(is.na(country) & iata_code == "PRN", "Kosovo", country),
country = ifelse(is.na(country) & iata_code == "WDH", "Namibia", country))
head(hopper)
## lat long name iso_country municipality iata_code
## 1 22.755 121.102 Taitung Airport TW Taitung City TTT
## 2 22.755 121.102 Taitung Airport TW Taitung City TTT
## 3 22.755 121.102 Taitung Airport TW Taitung City TTT
## 4 22.755 121.102 Taitung Airport TW Taitung City TTT
## 5 22.755 121.102 Taitung Airport TW Taitung City TTT
## 6 22.755 121.102 Taitung Airport TW Taitung City TTT
## country
## 1 Taiwan
## 2 Taiwan
## 3 Taiwan
## 4 Taiwan
## 5 Taiwan
## 6 Taiwan
With this updated dataset, let’s explore the new informations we have in our hands.
hopper %>%
count(name, municipality, country, name = "occurences") %>%
arrange(desc(occurences)) %>%
top_n(10)
## # A tibble: 11 x 4
## name municipality country occurences
## <chr> <chr> <chr> <int>
## 1 Taitung Airport Taitung City Taiwan 101
## 2 Atatürk International Air~ Istanbul Turkey 2
## 3 Düsseldorf Airport Düsseldorf Germany 2
## 4 Francisco de Sá Carneiro ~ Porto Portugal 2
## 5 Learmonth Airport Exmouth Australia 2
## 6 Naples International Airp~ Nápoli Italy 2
## 7 Palma De Mallorca Airport Palma De Mallorca Spain 2
## 8 Pinto Martins Internation~ Fortaleza Brazil 2
## 9 Sofia Airport Sofia Bulgaria 2
## 10 Southend Airport Southend United Kin~ 2
## 11 Tri-Cities Regional TN/VA~ Bristol/Johnson/Kings~ United Sta~ 2
Taitung Airport is the only airport that appears more than two times. Seems to be a very popular among the Hopper users.
hopper %>%
count(country, name = "occurences") %>%
arrange(desc(occurences)) %>%
top_n(10)
## # A tibble: 10 x 2
## country occurences
## <chr> <int>
## 1 United States 332
## 2 Taiwan 104
## 3 Mexico 35
## 4 Canada 28
## 5 Brazil 24
## 6 China 20
## 7 Spain 19
## 8 United Kingdom 19
## 9 India 17
## 10 Italy 17
We have the informations about the airport but it would be more interesting to also have the airlines and routes from these airports.
With all these informations combine we can build a dashboard to have a better understanding and visualization.
First step, let’s import these two new datasets available on OpenFlights.
routes <- read_csv('datasets/routes.dat',
col_names = c("AirlineId","Airline_OpenFlightsId","OriginId",
"Origin_OpenFlightsId", "DestinationId",
"Destination_OpenFlightsId","Codeshare",
"Stops","Equipment"),
na = c('\\N', "", "NA"),
col_types = cols(
Codeshare = col_skip(),
Stops = col_skip(),
Equipment = col_skip()))
airlines <- read_csv('datasets/airlines.dat',
col_names = c("Airline_OpenFlightsId", "Airline_Name",
"Airline_Alias","Airline_IataCode",
"Airline_IcaoCode","Airline_Callsign",
"Airline_CountryOfBase","Airline_Active"),
na = c('\\N', "", "NA"))
Before join the different datasets together, I group the airports by existing IATA code.
ourairports %<>%
select(destinationName = name, iata_code,
destinationLat = latitude_deg, destinationLong = longitude_deg) %>%
filter(!is.na(iata_code), iata_code != "0") %>%
group_by(iata_code) %>%
slice(1) %>%
ungroup()
In order to have our final dataset, we need to join all the different datasets together. Finally, I save the final dataset as a csv file.
hopper %<>%
inner_join(routes, by = c("iata_code" = "OriginId")) %>%
inner_join(ourairports, by = c("DestinationId" = "iata_code")) %>%
select(-AirlineId) %>%
inner_join(airlines) %>%
select(-Origin_OpenFlightsId, -Destination_OpenFlightsId, -Airline_OpenFlightsId)
## Joining, by = "Airline_OpenFlightsId"
write_excel_csv(hopper, "datasets/hopper.csv")
This csv file is used on this dashboard where you can see the different airports with routes and airlines.