• Assigned: Tuesday, February 16, 2016
  • Due: By the beginning of class Thursday, February 25, 2016
  • Submit via GitHub

Web Scraping

You’ll be scraping data from this website, which contains a list of incidents involving commercial aircraft listed by year.

Part a (Extra Credit)

Write a scraper that will produce a pandas dataframe containing the following columns:

  • When the accident occurred (year, month, and day - use a datetime object)
  • The short text description (everything to the right of the date)
  • The link to the detail page.

If you choose not to do the extra credit assignment, you can start the assignment from this csv file which contains the description and the link for each page. Read this csv file into a pandas DataFrame as a starting point for part b.

Part b

Now write a code that clicks each link and scrapes additional content from the detailed page associated with each individual crash. How will you ensure that you rate limit your requests to the target web server? Once you have implemented this feature, scrape the content located in the right column of each details page and put it in a DataFrame:

  • Number of passengers
  • Number of crew
  • Number of fatalities
  • Number of survivors
  • Registration
  • Flight origin
  • Destination

If there are multiple responses for passengers, just save the first one for simplicity. Similarly if there are not entries (e.g. for registration in the first link) then you can simply fill that entry in the DataFrame with 'No data'.

Part c

Which were the top 5 most deadly aviation incidents? Report the number of fatalities and the flight origin for each.

Part d

Which flight origin has the highest number of aviation incidents in the last 25 years?

Part e

Save this Dataframe as JSON and commit to your repo, along with the notebook / python code used to do this assignment.