• Assigned: Friday, February 7, 2016
  • Due: Before class, February 16, 2016
  • Submit via GitHub

Introduction

In this assignment you will use SQL to explore some large datasets from the Environmental Protection Agency's Resource Conservation and Recovery Act programs. All the tables are in the cfpp database whose connection information is found in a Chalk announcement.

To get credit for your answer, you must show in full the code that produced the answer.

Question 1: Handlers Exploration

The hhandler table contains registration forms for hazardous waste handlers. See the data dictionary here.

(a) How many forms have been received by the EPA?

(b) How many facilities have registered? Hint: use the count(distinct ) function

(c) How many forms were received per year over the last 5 years?

Use a GROUP BY.

Question 2: Evaluations

The cmecomp3 table contains a list of evaluations (inspections) of these handlers. See thedata dictionary here.

(a) How many evaluations are there?

(b) How many evaluations found violations?

(c) What proportion of evaluations found violations?

(d) Which five handler_ids have been found in violation the most times? How many times? Also find these handlers' site names in the hhandlers table.

Hint: Use a GROUP BY and an ORDER BY DESC.

Question 3: Industries

The North American Industry Classification System is a system used by federal agencies to classify a business according to its industry. The naics table contains this information as retrieved from here. Start by skimming this file.

(a) How many different naics codes are there? How many six-digit industry classifications are there? How many two-digit classifications are there? These determine the sectors as described here.

(b) The hnaics table contains naics codes for some handlers. How many handlers have naics codes? How many don't?

(c) Join the hnaics table with the naics table and use a GROUP BY to determine which the number of facilities in each sector. Which sector has the most hazardous-waste handlers? The least?

Hint: You can get the digit naics code from the naics_code using this expression: substring(naics_code for 2) || '----'

Hint: group by naics_description to get the description instead of the code.

(d) Create a temporary table called hnaics containing unique pairs of handler ids and sector descriptions.

Hint: Use a GROUP BY to ensure only unique pairs.

Note: We'll discuss creating temporary tables in class on Thursday.

(e) Join hsectors to cmecomp3, to determine for each sector, the number of handlers evaluated, the number of evaluations, the number of violations, and the proportion of evaluations finding violations. Which sector has the most violations? The highest proportion of evaluations finding violations?