In [10]:
import pandas as pd
%matplotlib inline
pd.options.display.max_rows = 6
In [11]:
crimes = pd.read_csv('~/www/Crimes_-_2001_to_present.csv', parse_dates=['Date'])
In [12]:
socio = pd.read_csv('Census_Data_-_Selected_socioeconomic_indicators_in_Chicago__2008___2012.csv')

Question 1: Crime counts and socioeconomics

Download the crime data for all of the year 2015. Also download the socioeconomic data.

(a) Calculate the number of crimes in each Community Area in 2015.

In [13]:
crime_counts = pd.DataFrame({'Crime Count': 
            crimes.groupby('Community Area')['ID'].count()})
crime_counts
Out[13]:
Crime Count
Community Area
0 2
1 3525
2 3063
... ...
75 2055
76 1621
77 2213

78 rows × 1 columns

(b) Sort the Community Areas by 2015 crime count. Which Community Area (by name) has the highest crime count. The lowest?

In [14]:
crimes_socio = crimes.merge(socio, 
        left_on='Community Area', right_on='Community Area Number')


crimes_socio.groupby('COMMUNITY AREA NAME')['ID'].count().sort_values()
Out[14]:
COMMUNITY AREA NAME
Edison Park          254
Burnside             380
Forest Glen          444
                   ...  
South Shore         8932
Near North Side     8944
Austin             17050
Name: ID, dtype: int64

(c) Create a table whose rows are days in the year and columns are the 77 Community Area crime counts. Select a few Communities that you are interested and plot time series.

In [15]:
def day(t):
    return t.replace(hour=0, minute=0, second=0)

crimes_socio['Day'] = crimes_socio['Date'].apply(
        day)

crimes_socio['Day'] = crimes_socio['Date'].apply(
        lambda t: t.replace(hour=0, minute=0, second=0))
In [16]:
community_crimes_by_day = crimes_socio.groupby(['COMMUNITY AREA NAME', 'Day'])['ID'].count().unstack('COMMUNITY AREA NAME')
In [17]:
community_crimes_by_day[['Hyde Park', 'Englewood', 'Washington Park']].plot(figsize=(10,5))
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0389553d50>

(d) By joining with the socioeconomic data, create a scatter plot of crime counts against per capita income. Summarize the relationship in words.

In [18]:
crime_counts_socio = socio.merge(
        crime_counts, left_on='Community Area Number', right_index=True)
crime_counts_socio.plot(kind='scatter', x='PER CAPITA INCOME ', y='Crime Count')
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f038a1c9b90>
In [19]:
stations = pd.read_csv('Police_Stations.csv')

Question 2: Community Area populations

Download the census block population data and the Community Area tracts mapping.

In [20]:
block_populations = pd.read_csv('Population_by_2010_Census_Block.csv')
In [21]:
community_tracts = pd.read_csv('community_tracts.csv')

(a) Join these together using the fact that the last six digits of the tract id in the mapping data correspond to the first six digits of the block id. However, the data portal has a bug: if the block starts with a zero, that digit is missing!

In [22]:
community_tracts
Out[22]:
tract_id community_id
0 17031842400 44
1 17031840300 59
2 17031841100 34
... ... ...
798 17031130300 13
799 17031292200 29
800 17031630900 63

801 rows × 2 columns

In [23]:
# pad with zeroes to 10 digits. there are many ways to do this
block_populations['CENSUS BLOCK'] = \
    block_populations['CENSUS BLOCK'].apply(lambda b: '%010d' % b)
# first six digits for tract
block_populations['CENSUS TRACT'] = block_populations['CENSUS BLOCK'].str[:6]
In [26]:
community_tracts['CENSUS TRACT'] = \
    community_tracts.tract_id.astype(str).str[-6:]

(b) Calculate the total population in each Community Area.

In [27]:
block_tract_populations = community_tracts.merge(
            block_populations, on='CENSUS TRACT')
In [28]:
community_populations = pd.DataFrame({'population': block_tract_populations
            .groupby(' community_id')['TOTAL POPULATION'].sum()})
community_populations
Out[28]:
population
community_id
1 54991
2 71942
3 56362
... ...
75 22544
76 12756
77 56521

77 rows × 1 columns

Question 3: Crime rates

Using your answer to (2), calculate the crime rate (defined as crime count per thousand capita) for the city in 2015. Then reanswer (1a-d) with crime count replaced by crime rate. Summarize your findings in words.

(a), (b)

In [29]:
crime_counts_population = crime_counts.merge(community_populations, left_index=True, right_index=True)
In [30]:
crime_counts_population['Crime Rate'] = crime_counts_population['Crime Count'] / crime_counts_population['population']*1000
crime_counts_population_socio = crime_counts_population.merge(socio, left_index=True, right_on='Community Area Number')
In [31]:
crime_counts_population_socio[['Crime Rate', 'COMMUNITY AREA NAME']].sort_values('Crime Rate')
Out[31]:
Crime Rate COMMUNITY AREA NAME
8 22.704925 Edison Park
11 23.989626 Forest Glen
73 31.896507 Mount Greenwood
... ... ...
26 256.673312 East Garfield Park
36 290.681502 Fuller Park
25 322.426532 West Garfield Park

77 rows × 2 columns

(c)

In [32]:
community_crime_rate_by_day = community_crimes_by_day / crime_counts_population_socio.set_index('COMMUNITY AREA NAME')['population'] * 1000
In [33]:
community_crime_rate_by_day[['Hyde Park', 'Englewood', 'Washington Park']].plot(figsize=(10,5))
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0386462a50>

(d)

In [44]:
crime_counts_population_socio.plot(kind='scatter', y='PER CAPITA INCOME ', x='Crime Rate')
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0383b30e50>

Question 4: Crime and Police Stations

Download the police stations data.

In [47]:
stations = pd.read_csv('Police_Stations.csv')

(a) Extract the latitudes and longitudes of the police stations (found in the LOCATION column) as floats into their own columns called 'Station Latitude' and 'Station Longitude', respectively.

In [50]:
stations['Station Latitude'] = stations['LOCATION'].apply(lambda l: l[l.rfind('(')+1:l.rfind(',')]).astype(float)
stations['Station Longitude'] = stations['LOCATION'].apply(lambda l: l[l.rfind(',')+1:l.rfind(')')]).astype(float)

(b) Join the crime data with the stations on police district. Hint: the station district is a text field (because one of them is 'Headquarters') so you'll need to convert the crime district to the same.

In [51]:
crimes['DISTRICT'] = crimes.District.astype(str)
In [52]:
crimes_stations = crimes.merge(stations, on='DISTRICT')

(c) Define a function which calculates the distance in kilometers between two points (latitude, longitude) using the Pythagorean theorem.

In [53]:
def distance(row):
    return 95*( (row['Latitude']-row['Station Latitude'])**2 + (row['Longitude'] - row['Station Longitude'])**2)**.5

(d) Calculate the distance between each crime and its district police station.

In [54]:
crimes_stations['Distance'] = crimes_stations.apply(distance, axis=1)

(e) Plot a histogram of crime count against distance to district police station. Summarize the relationship in words.

In [55]:
crimes_stations['Distance'].plot(kind='hist', bins=50)
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0383b0ac50>