In [1]:
import pandas as pd
pd.options.display.max_rows = 6
%matplotlib inline
In [2]:
crimes = pd.read_csv('Crimes_-_2001_to_present.csv', parse_dates=['Date'])


















Group By

Given a key column, one can partition a table into groups:

In [3]:
crimes50_by_community = crimes.head(50).groupby('Community Area')
crimes50_by_community.groups
Out[3]:
{1: [26],
 6: [3, 13],
 8: [28, 38, 45, 49],
 13: [14],
 14: [1, 43],
 19: [30],
 22: [0, 27],
 23: [32],
 24: [31],
 25: [5, 24, 42],
 28: [10, 11, 22, 36],
 29: [33, 40],
 30: [16, 25],
 32: [4],
 33: [37],
 37: [48],
 41: [2],
 43: [7, 8, 20],
 49: [29],
 56: [23],
 58: [21, 46],
 61: [47],
 63: [9],
 67: [15, 19],
 69: [17],
 70: [41, 44],
 71: [6, 34],
 75: [12],
 76: [18],
 77: [35, 39]}
In [4]:
crimes_by_community = crimes.groupby('Community Area')


















Aggregate

We can aggregate a column over the groups using an aggregate function.

  • Built-in functions include 'sum' 'mean' 'median' 'mode' 'count' 'std' 'nunique'.
  • Can also pass an arbitrary function whose argument is a vector and returns a scalar.
In [5]:
community_crime_count = crimes_by_community['ID'].agg('count')
community_crime_count
Out[5]:
Community Area
1     259
2     263
3     263
     ... 
75    118
76    112
77    166
Name: ID, dtype: int64
In [6]:
community_crime_count.plot(kind='bar', figsize=(12,5))
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199dd22e90>



































Daily Timeseries

In [7]:
def to_day(timestamp):
    return timestamp.replace(minute=0,hour=0, second=0)

crimes['Day'] = crimes['Date'].apply(to_day)
In [8]:
crimes_by_day = crimes.groupby('Day')
In [9]:
crimes_by_day['ID'].agg('count').plot()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199d631290>


















What about a time series for a single community area?

In [10]:
crimes[crimes['Community Area'] == 41].groupby('Day')['ID'].agg('count').plot()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199d396e50>


















Multiple Group By

To get a series for each community area we could just loop over them. But there is a better way.

Group by can take multiple keys, in this case community area and day. The result has a (row) multi-index.

In [11]:
crimes_by_community_day = crimes.groupby(['Community Area', 'Day'])
crimes_by_community_day_count = crimes_by_community_day['ID'].agg('count')
In [12]:
crimes_by_community_day_count
Out[12]:
Community Area  Day       
1               2015-12-01    14
                2015-12-02    12
                2015-12-03     6
                              ..
77              2015-12-28     8
                2015-12-29     4
                2015-12-30     5
Name: ID, dtype: int64


















Similarly for arrests, then we can take their quotient:

In [13]:
crimes_by_community_day_arrests = crimes_by_community_day['Arrest'].agg('sum')
In [14]:
community_day_arrest_prop = crimes_by_community_day_arrests / crimes_by_community_day_count
community_day_arrest_prop
Out[14]:
Community Area  Day       
1               2015-12-01    0.214286
                2015-12-02    0.166667
                2015-12-03    0.166667
                                ...   
77              2015-12-28    0.250000
                2015-12-29    0.500000
                2015-12-30    0.000000
dtype: float64


















How can we plot multiple of these time series together? Unstacking

In [15]:
community_day_arrest_prop.unstack('Community Area')
Out[15]:
Community Area 1 2 3 4 5 6 7 8 9 10 ... 68 69 70 71 72 73 74 75 76 77
Day
2015-12-01 0.214286 0.133333 0.100000 0.375 0.200000 0.076923 0.000000 0.000000 NaN 0.000000 ... 0.333333 0.318182 0.111111 0.285714 0.0 0.000000 0.000000 0.000000 0.400000 0.0
2015-12-02 0.166667 0.000000 0.100000 0.250 0.000000 0.000000 0.000000 0.173913 0 0.000000 ... 0.173913 0.307692 0.000000 0.347826 0.0 0.500000 0.500000 0.166667 0.166667 0.0
2015-12-03 0.166667 0.181818 0.333333 0.000 0.000000 0.166667 0.000000 0.031250 0 0.000000 ... 0.250000 0.210526 0.142857 0.125000 0.0 0.285714 0.000000 0.000000 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-12-29 0.000000 0.285714 0.083333 0.000 0.333333 0.125000 0.000000 0.217391 NaN 0.000000 ... 0.277778 0.200000 0.142857 0.277778 NaN 0.125000 0.333333 0.000000 0.000000 0.5
2015-12-30 0.000000 0.250000 0.428571 0.000 0.000000 0.272727 0.133333 0.222222 NaN 0.333333 ... 0.181818 0.260870 0.000000 0.142857 0.5 0.125000 0.000000 0.000000 0.250000 0.0
2015-12-31 NaN NaN NaN NaN NaN 0.000000 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

31 rows × 77 columns


















There are missing values after reshaping.
That means there were no crimes in that area on that day so fillna(0)

In [16]:
community_arrest_timeseries = community_day_arrest_prop.unstack('Community Area')
community_arrest_timeseries.fillna(0, inplace=True)
community_arrest_timeseries
Out[16]:
Community Area 1 2 3 4 5 6 7 8 9 10 ... 68 69 70 71 72 73 74 75 76 77
Day
2015-12-01 0.214286 0.133333 0.100000 0.375 0.200000 0.076923 0.000000 0.000000 0 0.000000 ... 0.333333 0.318182 0.111111 0.285714 0.0 0.000000 0.000000 0.000000 0.400000 0.0
2015-12-02 0.166667 0.000000 0.100000 0.250 0.000000 0.000000 0.000000 0.173913 0 0.000000 ... 0.173913 0.307692 0.000000 0.347826 0.0 0.500000 0.500000 0.166667 0.166667 0.0
2015-12-03 0.166667 0.181818 0.333333 0.000 0.000000 0.166667 0.000000 0.031250 0 0.000000 ... 0.250000 0.210526 0.142857 0.125000 0.0 0.285714 0.000000 0.000000 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2015-12-29 0.000000 0.285714 0.083333 0.000 0.333333 0.125000 0.000000 0.217391 0 0.000000 ... 0.277778 0.200000 0.142857 0.277778 0.0 0.125000 0.333333 0.000000 0.000000 0.5
2015-12-30 0.000000 0.250000 0.428571 0.000 0.000000 0.272727 0.133333 0.222222 0 0.333333 ... 0.181818 0.260870 0.000000 0.142857 0.5 0.125000 0.000000 0.000000 0.250000 0.0
2015-12-31 0.000000 0.000000 0.000000 0.000 0.000000 0.000000 0.000000 0.000000 0 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.0

31 rows × 77 columns


















Now we can plot multiple community area timeseries:

In [17]:
community_arrest_timeseries[[40,41,42]].plot()
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199d22d110>


















Another dataset

Let's look at Chicago's affordable housing dataset

In [18]:
housing = pd.read_csv('Affordable_Rental_Housing_Developments.csv')
In [19]:
housing
Out[19]:
Community Area Name Community Area Number Property Type Property Name Address Zip Code Phone Number Management Company Units X Coordinate Y Coordinate Latitude Longitude Location
0 Albany Park 14 Senior Mayfair Commons 4444 W. Lawrence Ave. 60630 773-205-7862 Metroplex, Inc. 97 1145674.753818 1931569.979045 41.968224 -87.739747 4444 W. Lawrence Ave.\n(41.9682242321, -87.739...
1 Albany Park 14 Senior Senior Suites of Ravenswood Manor 2800 W. Montrose Ave. 60618 773-509-9333 Senior Lifestyle Corp. 80 1156898.513945 1929148.460973 41.961359 -87.698544 2800 W. Montrose Ave.\n(41.9613586525, -87.698...
2 Albany Park 14 Senior Darul Amaan Senior Living 4814-58 N. Kedzie Ave. 60625 773-463-3004 East Lake Management & Development Corp. 56 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 Woodlawn 42 Multifamily The Grant at Woodlawn Pk Phase I 6227-29 S. Cottage Grove Ave. 60637 312-283-0030 Preservation Housing Management LLC 67 1186722.404223 1863775.146697 41.781408 -87.605807 6227-29 S. Cottage Grove Ave.\n(41.7814080037,...
262 Woodlawn 42 Multifamily The Grant at Woodlawn Pk Phase II 6129 S. Cottage Grove Ave. 60637 312-283-0030 Preservation Housing Management LLC 33 1186094.258818 1864209.572438 41.782505 -87.590962 6129 S. Cottage Grove Ave.\n(41.7825052194, -8...
263 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1864312.842731 41.782803 -87.593261 (41.7828034603, -87.5932613147)

264 rows × 14 columns


















Join

What if we want to know the crime rate in the community for each housing development?
We can find this using what's called a join or merge.
First let's turn the community_crime_count Series into a DataFrame:

In [20]:
community_area_crime = pd.DataFrame({'Crime Count': community_crime_count})
community_area_crime
Out[20]:
Crime Count
Community Area
1 259
2 263
3 263
... ...
75 118
76 112
77 166

77 rows × 1 columns

In [21]:
housing_crime = housing.merge(community_area_crime, 
                              left_on='Community Area Number', right_index=True)
housing_crime
Out[21]:
Community Area Name Community Area Number Property Type Property Name Address Zip Code Phone Number Management Company Units X Coordinate Y Coordinate Latitude Longitude Location Crime Count
0 Albany Park 14 Senior Mayfair Commons 4444 W. Lawrence Ave. 60630 773-205-7862 Metroplex, Inc. 97 1145674.753818 1931569.979045 41.968224 -87.739747 4444 W. Lawrence Ave.\n(41.9682242321, -87.739... 195
1 Albany Park 14 Senior Senior Suites of Ravenswood Manor 2800 W. Montrose Ave. 60618 773-509-9333 Senior Lifestyle Corp. 80 1156898.513945 1929148.460973 41.961359 -87.698544 2800 W. Montrose Ave.\n(41.9613586525, -87.698... 195
2 Albany Park 14 Senior Darul Amaan Senior Living 4814-58 N. Kedzie Ave. 60625 773-463-3004 East Lake Management & Development Corp. 56 NaN NaN NaN NaN NaN 195
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 Woodlawn 42 Senior Butler Apartments 6146 S. Kenwood Ave. 60637 773-955-9290 Woodlawn Comm. Dev. Corp. 17 1182676.846895 1862140.114893 41.776909 -87.603954 6146 S. Kenwood Ave.\n(41.7769092523, -87.6039... 265
261 Woodlawn 42 Multifamily The Grant at Woodlawn Pk Phase I 6227-29 S. Cottage Grove Ave. 60637 312-283-0030 Preservation Housing Management LLC 67 1186722.404223 1863775.146697 41.781408 -87.605807 6227-29 S. Cottage Grove Ave.\n(41.7814080037,... 265
262 Woodlawn 42 Multifamily The Grant at Woodlawn Pk Phase II 6129 S. Cottage Grove Ave. 60637 312-283-0030 Preservation Housing Management LLC 33 1186094.258818 1864209.572438 41.782505 -87.590962 6129 S. Cottage Grove Ave.\n(41.7825052194, -8... 265

260 rows × 15 columns


















Another aggregate

We can pass other functions to aggregate:

In [22]:
community_housing = pd.DataFrame({
    'Affordable Housing Units': housing.groupby('Community Area Number')['Units'].agg('sum')
})
community_housing
Out[22]:
Affordable Housing Units
Community Area Number
1 517
2 307
3 940
... ...
71 352
73 142
77 504

53 rows × 1 columns

In [23]:
housing_crime_aggregate = community_housing.join(community_area_crime)
housing_crime_aggregate.shape
Out[23]:
(53, 2)


















Join types

Why does the above join only have entries for 53 community areas?
Because merge defaults to a left join. There are four types of joins:

  • left
  • right
  • inner
  • outer

In this case we want an outer join.

In [24]:
housing_crime_aggregate = community_housing.merge(
        community_area_crime, left_index=True, right_index=True, how='outer')
housing_crime_aggregate.shape
Out[24]:
(77, 2)
In [25]:
housing_crime_aggregate
Out[25]:
Affordable Housing Units Crime Count
1 517 259
2 307 263
3 940 263
... ... ...
75 NaN 118
76 NaN 112
77 504 166

77 rows × 2 columns


















Rows without a corresponding row in the merge are filled with missing values. In our case those should be zeros, so we can fillna(0).

In [26]:
housing_crime_aggregate.fillna(0, inplace=True)
housing_crime_aggregate.plot(kind='scatter', x='Affordable Housing Units', y='Crime Count')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199d028ad0>