In [57]:
import pandas as pd
pd.options.display.max_rows = 8
%matplotlib inline





















Types

You can get the type of a python variable using the type function

In [58]:
x = 1
print(type(x))
<type 'int'>

In [59]:
y = 'Eric'
print(type(y))
<type 'str'>

In [60]:
z = True
print(type(z))
<type 'bool'>

In [61]:
l = [1, 2, 3]
print(type(l))
<type 'list'>





















Type Conversion

Also recall that we can cast between different types, for example:

In [62]:
print(3 + 1)
4

In [63]:
print(str(3) + str(1))
31





















Custom Functions: Definition

We can also define our own functions or methods. This is good practice because it makes the programmer's job easier and the resulting program less error-prone.

In [64]:
def print_type(var):
    print(str(var) + ' is type' + str(type(var)))

Where:

  • def is the keyword for defining a method
  • print_type is the method
  • var is the argument





















Custom Functions: Calling

Now we can call our function just like we would any other python function:

In [65]:
for var in [x,y,z,l]:
    print_type(var)
1 is type<type 'int'>
Eric is type<type 'str'>
True is type<type 'bool'>
[1, 2, 3] is type<type 'list'>





















Custom Functions: Returning

Functions can also have a return value, for example the type() function returns the type, the int() function returns the cast integer, etc. We can write our own functions that return values like so:

In [66]:
def return_max(x,y):
    if x > y:
        return x
    else:
        return y
In [67]:
z = return_max(1,2) # now z is the max of 1 and 2
print(z)
2





















Objects

Besides the above built-in types, python has many other types. What about the data we read from our csv file?

In [68]:
df = pd.read_csv('Current_Employee_Names__Salaries__and_Position_Titles.csv')
df
Out[68]:
Name Position Title Department Employee Annual Salary
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT $88968.00
1 AARON, JEFFERY M POLICE OFFICER POLICE $80778.00
2 AARON, KARINA POLICE OFFICER POLICE $80778.00
3 AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER GENERAL SERVICES $84780.00
... ... ... ... ...
32178 ZYMANTAS, MARK E POLICE OFFICER POLICE $83616.00
32179 ZYRKOWSKI, CARLO E POLICE OFFICER POLICE $86520.00
32180 ZYSKOWSKI, DARIUSZ CHIEF DATA BASE ANALYST DoIT $110352.00
32181 NaN NaN NaN NaN

32182 rows × 4 columns





















Objects

In [69]:
print(type(df))
<class 'pandas.core.frame.DataFrame'>

A "kind" of object (like int or float or DataFrame) is called a class or a type. An instance or example of a type (1, 'Eric', df) are called objects.

Objects have members, including functions and attributes.





















Objects: Methods

In [70]:
df.dropna(inplace=True)
df
Out[70]:
Name Position Title Department Employee Annual Salary
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT $88968.00
1 AARON, JEFFERY M POLICE OFFICER POLICE $80778.00
2 AARON, KARINA POLICE OFFICER POLICE $80778.00
3 AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER GENERAL SERVICES $84780.00
... ... ... ... ...
32177 ZYGOWICZ, PETER J POLICE OFFICER POLICE $86520.00
32178 ZYMANTAS, MARK E POLICE OFFICER POLICE $83616.00
32179 ZYRKOWSKI, CARLO E POLICE OFFICER POLICE $86520.00
32180 ZYSKOWSKI, DARIUSZ CHIEF DATA BASE ANALYST DoIT $110352.00

32181 rows × 4 columns





















Objects: Methods

In [71]:
df.sort_values(by='Department')
Out[71]:
Name Position Title Department Employee Annual Salary
2812 BREWER, JAMIE ADMINISTRATIVE ASST II ADMIN HEARNG $63708.00
3019 BROWN, ELOUISE V ADMINISTRATIVE ASST II ADMIN HEARNG $66684.00
3338 BURAGE, LORRIE A CUSTOMER ACCOUNT REPRESENTATIVE ADMIN HEARNG $55428.00
4236 CASSELLA, CATHERINE ADMINISTRATIVE ASST II ADMIN HEARNG $60780.00
... ... ... ... ...
32104 ZIZUMBO, DANIEL POOL MOTOR TRUCK DRIVER WATER MGMNT $71780.80
32128 ZOTTA, SANDINO MECHANICAL ENGINEER IV WATER MGMNT $104736.00
32157 ZUNO, ERIK LABORER - APPRENTICE WATER MGMNT $63232.00
32170 ZWARYCZ, THOMAS J POOL MOTOR TRUCK DRIVER WATER MGMNT $71780.80

32181 rows × 4 columns





















Objects: Documentation

Learn about methods by reading the pandas documentation

In [72]:
df.sort_values('Department', ascending=False)
Out[72]:
Name Position Title Department Employee Annual Salary
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT $88968.00
4 ABAD JR, VICENTE M CIVIL ENGINEER IV WATER MGMNT $104736.00
18 ABDUL-KARIM, MUHAMMAD A ENGINEERING TECHNICIAN VI WATER MGMNT $106104.00
29 ABRAHAM, GIRLEY T CIVIL ENGINEER IV WATER MGMNT $104736.00
... ... ... ... ...
28206 TAKATA, NANCY M EXEC ADMINISTRATIVE ASST II ADMIN HEARNG $67224.00
30552 WEINERT, DARCI N SENIOR HEARING OFFICER ADMIN HEARNG $66696.00
30907 WILKINS, THADDEUS SENIOR ADMINISTRATIVE LAW OFFICER ADMIN HEARNG $91980.00
31532 WOODRIDGE, RACHENETTE ADMINISTRATIVE ASST II ADMIN HEARNG $58020.00

32181 rows × 4 columns





















Objects: Attributes

In addition to functions, objects have (data) attributes:

In [73]:
df.shape
Out[73]:
(32181, 4)
In [74]:
df.columns
Out[74]:
Index([u'Name', u'Position Title', u'Department', u'Employee Annual Salary'], dtype='object')





















DataFrames

DataFrames are tables. They have rows and columns.

The columns are indexed by their names. We can list the columns using the attribute columns:

In [75]:
print(df.columns)
Index([u'Name', u'Position Title', u'Department', u'Employee Annual Salary'], dtype='object')

The rows are indexed by the attribute index:

In [76]:
print(df.index)
Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            32171, 32172, 32173, 32174, 32175, 32176, 32177, 32178, 32179,
            32180],
           dtype='int64', length=32181)





















Python Indexing

We saw last class that we can select a column from our tables as follows:

In [77]:
departments = df['Department']
print(departments)
0             WATER MGMNT
1                  POLICE
2                  POLICE
3        GENERAL SERVICES
               ...       
32177              POLICE
32178              POLICE
32179              POLICE
32180                DoIT
Name: Department, dtype: object

A single column has its own type, called Series, which is like a column.

In [78]:
print(type(departments))
<class 'pandas.core.series.Series'>





















Pandas Indexing

This is called indexing and is a generalization of the simpler indexing that we use with lists. All indexing in python is denoted using square brackets [].

In [79]:
l = ['apple','banana', 'orange']
l[2]
Out[79]:
'orange'

For lists the index is an integer, between 0 and len(list)-1.

In [80]:
d = {'a': 1, 'b': 2, 'c': 3}
d['a']
Out[80]:
1





















Pandas Indexing: Columns

We can also select multiple columns by indexing with a list of column names:

In [81]:
df[['Name', 'Department']]
Out[81]:
Name Department
0 AARON, ELVIA J WATER MGMNT
1 AARON, JEFFERY M POLICE
2 AARON, KARINA POLICE
3 AARON, KIMBERLEI R GENERAL SERVICES
... ... ...
32177 ZYGOWICZ, PETER J POLICE
32178 ZYMANTAS, MARK E POLICE
32179 ZYRKOWSKI, CARLO E POLICE
32180 ZYSKOWSKI, DARIUSZ DoIT

32181 rows × 2 columns





















Pandas Indexing: Rows

We also saw that indexes can select a subset of rows as follows:

In [82]:
df[df.Department == 'POLICE']
Out[82]:
Name Position Title Department Employee Annual Salary
1 AARON, JEFFERY M POLICE OFFICER POLICE $80778.00
2 AARON, KARINA POLICE OFFICER POLICE $80778.00
9 ABBATE, TERRY M POLICE OFFICER POLICE $86520.00
11 ABBOTT, LYNISE M CLERK III POLICE $43920.00
... ... ... ... ...
32174 ZYDEK, BRYAN POLICE OFFICER POLICE $80778.00
32177 ZYGOWICZ, PETER J POLICE OFFICER POLICE $86520.00
32178 ZYMANTAS, MARK E POLICE OFFICER POLICE $83616.00
32179 ZYRKOWSKI, CARLO E POLICE OFFICER POLICE $86520.00

13570 rows × 4 columns





















Pandas Indexing: Rows

How does this work? What is the index in the expression df[df.Department == 'POLICE']?

In [83]:
df['Department'] == 'POLICE'
Out[83]:
0        False
1         True
2         True
3        False
         ...  
32177     True
32178     True
32179     True
32180    False
Name: Department, dtype: bool

This is called a boolean series.

  • It is a series of the same index (rows) as df.
  • For each (row) index value (in this case 0 through 32180) the value of the series is a boolean (True or False)

in this case the booleans indicate whether the Department column is equal to 'Police'. When we pass it as an index we are selecting the rows for which the series is True.





















Applying functions to a Series

In [84]:
salaries = df['Employee Annual Salary'].copy()
salaries
Out[84]:
0         $88968.00
1         $80778.00
2         $80778.00
3         $84780.00
            ...    
32177     $86520.00
32178     $83616.00
32179     $86520.00
32180    $110352.00
Name: Employee Annual Salary, dtype: object





















String manipulation

In [85]:
salary = salaries[0]
print_type(salary)
$88968.00 is type<type 'str'>

How to make this a number? Remove the first character and cast to float

In [86]:
t = float(salary[1:])
print_type(t)
88968.0 is type<type 'float'>





















Manipulating a Series

Want to do this for every salary? Could just loop over and assign:

In [87]:
for i in range(len(salaries)):
    salaries[i] = float(salaries[i][1:])
salaries
Out[87]:
0         88968
1         80778
2         80778
3         84780
          ...  
32177     86520
32178     83616
32179     86520
32180    110352
Name: Employee Annual Salary, dtype: object





















Manipulating a Series

Alternatively define a function to do the conversion and rewrite the loop:

In [88]:
def salary_to_float(salary):
    return float(salary[1:])

salaries = df['Employee Annual Salary'].copy()
for i in range(len(salaries)):
    salaries[i] = salary_to_float(salaries[i])
salaries
Out[88]:
0         88968
1         80778
2         80778
3         84780
          ...  
32177     86520
32178     83616
32179     86520
32180    110352
Name: Employee Annual Salary, dtype: object





















Applying a function to a Series

Better yet, we could use the Series.apply function, which applies a given function to every element in a Series (every entry in a given column):

In [89]:
df['Employee Annual Salary'].apply(salary_to_float)
Out[89]:
0         88968
1         80778
2         80778
3         84780
          ...  
32177     86520
32178     83616
32179     86520
32180    110352
Name: Employee Annual Salary, dtype: float64

Finally we can reassign the original column to our converted column:

In [90]:
df['Employee Annual Salary Numeric'] = df['Employee Annual Salary'].apply(salary_to_float)
df
Out[90]:
Name Position Title Department Employee Annual Salary Employee Annual Salary Numeric
0 AARON, ELVIA J WATER RATE TAKER WATER MGMNT $88968.00 88968
1 AARON, JEFFERY M POLICE OFFICER POLICE $80778.00 80778
2 AARON, KARINA POLICE OFFICER POLICE $80778.00 80778
3 AARON, KIMBERLEI R CHIEF CONTRACT EXPEDITER GENERAL SERVICES $84780.00 84780
... ... ... ... ... ...
32177 ZYGOWICZ, PETER J POLICE OFFICER POLICE $86520.00 86520
32178 ZYMANTAS, MARK E POLICE OFFICER POLICE $83616.00 83616
32179 ZYRKOWSKI, CARLO E POLICE OFFICER POLICE $86520.00 86520
32180 ZYSKOWSKI, DARIUSZ CHIEF DATA BASE ANALYST DoIT $110352.00 110352

32181 rows × 5 columns





















Datetimes

Dates and times are important for data analysis. Let's look again at the Chicago crime dataset.

In [91]:
df_crimes = pd.read_csv('Crimes_-_2001_to_present.csv', parse_dates=['Date'])
df_crimes
Out[91]:
ID Case Number Date Block IUCR Primary Type Description Location Description Arrest Domestic ... Ward Community Area FBI Code X Coordinate Y Coordinate Year Updated On Latitude Longitude Location
0 10367002 HZ103159 2016-01-03 23:57:00 016XX W 77TH ST 0486 BATTERY DOMESTIC BATTERY SIMPLE RESIDENCE False True ... 17 71 08B 1166811 1853653 2016 01/10/2016 08:46:55 AM 41.753985 -87.664263 (41.753985135, -87.664263469)
1 10366954 HZ103177 2016-01-03 23:45:00 003XX W 42ND ST 0930 MOTOR VEHICLE THEFT THEFT/RECOVERY: AUTOMOBILE STREET False False ... 3 37 07 1174795 1877110 2016 01/10/2016 08:46:55 AM 41.818180 -87.634306 (41.818179533, -87.634305993)
2 10366945 HZ103160 2016-01-03 23:44:00 055XX N LINCOLN AVE 502R OTHER OFFENSE VEHICLE TITLE/REG OFFENSE STREET True False ... 40 4 26 1157929 1936966 2016 01/10/2016 08:46:55 AM 41.982789 -87.694541 (41.982789367, -87.69454123)
3 10367534 HZ103498 2016-01-03 23:36:00 081XX S ESSEX AVE 1320 CRIMINAL DAMAGE TO VEHICLE STREET False False ... 7 46 14 1194260 1851569 2016 01/10/2016 08:46:55 AM 41.747637 -87.563742 (41.747636668, -87.563741805)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1726 10365595 HZ100414 2016-01-01 00:01:00 020XX N NEWLAND AVE 1310 CRIMINAL DAMAGE TO PROPERTY RESIDENCE False False ... 36 25 14 1129760 1912773 2016 01/10/2016 08:46:55 AM 41.916931 -87.798697 (41.916930792, -87.798697185)
1727 10368855 HZ104913 2016-01-01 00:01:00 085XX S SANGAMON ST 2820 OTHER OFFENSE TELEPHONE THREAT RESIDENCE False False ... 21 71 26 NaN NaN 2016 01/10/2016 08:46:55 AM NaN NaN NaN
1728 10365598 HZ101287 2016-01-01 00:01:00 056XX S NARRAGANSETT AVE 1320 CRIMINAL DAMAGE TO VEHICLE STREET False False ... 23 56 14 1134682 1866432 2016 01/10/2016 08:46:55 AM 41.789678 -87.781707 (41.789678467, -87.781707212)
1729 10366666 HZ102516 2016-01-01 00:01:00 043XX N SPAULDING AVE 1320 CRIMINAL DAMAGE TO VEHICLE STREET False False ... 33 16 14 NaN NaN 2016 01/10/2016 08:46:55 AM NaN NaN NaN

1730 rows × 22 columns





















Datetime series

In [92]:
df_crimes['Date']
Out[92]:
0      2016-01-03 23:57:00
1      2016-01-03 23:45:00
2      2016-01-03 23:44:00
3      2016-01-03 23:36:00
               ...        
1726   2016-01-01 00:01:00
1727   2016-01-01 00:01:00
1728   2016-01-01 00:01:00
1729   2016-01-01 00:01:00
Name: Date, dtype: datetime64[ns]





















Timestamp object

In [93]:
date = df_crimes['Date'][0]
print_type(date)
2016-01-03 23:57:00 is type<class 'pandas.tslib.Timestamp'>

In [94]:
print(date.year)
print(date.month)
print(date.day)
print(date.dayofweek)
2016
1
3
6





















Applying dayofweek

In [95]:
def dayofweek(date):
    return date.dayofweek
In [96]:
df_crimes['Date'].apply(dayofweek)
Out[96]:
0       6
1       6
2       6
3       6
       ..
1726    4
1727    4
1728    4
1729    4
Name: Date, dtype: int64





















Applying dayofweek

In [97]:
df_crimes['DayOfWeek'] = df_crimes['Date'].apply(dayofweek)
In [98]:
df_crimes
Out[98]:
ID Case Number Date Block IUCR Primary Type Description Location Description Arrest Domestic ... Community Area FBI Code X Coordinate Y Coordinate Year Updated On Latitude Longitude Location DayOfWeek
0 10367002 HZ103159 2016-01-03 23:57:00 016XX W 77TH ST 0486 BATTERY DOMESTIC BATTERY SIMPLE RESIDENCE False True ... 71 08B 1166811 1853653 2016 01/10/2016 08:46:55 AM 41.753985 -87.664263 (41.753985135, -87.664263469) 6
1 10366954 HZ103177 2016-01-03 23:45:00 003XX W 42ND ST 0930 MOTOR VEHICLE THEFT THEFT/RECOVERY: AUTOMOBILE STREET False False ... 37 07 1174795 1877110 2016 01/10/2016 08:46:55 AM 41.818180 -87.634306 (41.818179533, -87.634305993) 6
2 10366945 HZ103160 2016-01-03 23:44:00 055XX N LINCOLN AVE 502R OTHER OFFENSE VEHICLE TITLE/REG OFFENSE STREET True False ... 4 26 1157929 1936966 2016 01/10/2016 08:46:55 AM 41.982789 -87.694541 (41.982789367, -87.69454123) 6
3 10367534 HZ103498 2016-01-03 23:36:00 081XX S ESSEX AVE 1320 CRIMINAL DAMAGE TO VEHICLE STREET False False ... 46 14 1194260 1851569 2016 01/10/2016 08:46:55 AM 41.747637 -87.563742 (41.747636668, -87.563741805) 6
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1726 10365595 HZ100414 2016-01-01 00:01:00 020XX N NEWLAND AVE 1310 CRIMINAL DAMAGE TO PROPERTY RESIDENCE False False ... 25 14 1129760 1912773 2016 01/10/2016 08:46:55 AM 41.916931 -87.798697 (41.916930792, -87.798697185) 4
1727 10368855 HZ104913 2016-01-01 00:01:00 085XX S SANGAMON ST 2820 OTHER OFFENSE TELEPHONE THREAT RESIDENCE False False ... 71 26 NaN NaN 2016 01/10/2016 08:46:55 AM NaN NaN NaN 4
1728 10365598 HZ101287 2016-01-01 00:01:00 056XX S NARRAGANSETT AVE 1320 CRIMINAL DAMAGE TO VEHICLE STREET False False ... 56 14 1134682 1866432 2016 01/10/2016 08:46:55 AM 41.789678 -87.781707 (41.789678467, -87.781707212) 4
1729 10366666 HZ102516 2016-01-01 00:01:00 043XX N SPAULDING AVE 1320 CRIMINAL DAMAGE TO VEHICLE STREET False False ... 16 14 NaN NaN 2016 01/10/2016 08:46:55 AM NaN NaN NaN 4

1730 rows × 23 columns





















Complex Queries: Logical Operations on Series

Given boolean series there are three basic logical operations: and (&), or (|) and not (~). We can use these on boolean series to do more complex queries of our table.

In [107]:
df['Employee Annual Salary Numeric'] > 100000
Out[107]:
0        False
1        False
2        False
3        False
         ...  
32177    False
32178    False
32179    False
32180     True
Name: Employee Annual Salary Numeric, dtype: bool





















Logical operation on series

In [106]:
~(df['Employee Annual Salary Numeric'] > 100000)
Out[106]:
0         True
1         True
2         True
3         True
         ...  
32177     True
32178     True
32179     True
32180    False
Name: Employee Annual Salary Numeric, dtype: bool





















Row indexing

In [108]:
df[df['Employee Annual Salary Numeric'] > 100000]
Out[108]:
Name Position Title Department Employee Annual Salary Employee Annual Salary Numeric
4 ABAD JR, VICENTE M CIVIL ENGINEER IV WATER MGMNT $104736.00 104736
18 ABDUL-KARIM, MUHAMMAD A ENGINEERING TECHNICIAN VI WATER MGMNT $106104.00 106104
20 ABDULLAH, KEVIN LIEUTENANT FIRE $110370.00 110370
28 ABOUELKHEIR, HASSAN A SENIOR PROGRAMMER/ANALYST DoIT $104736.00 104736
... ... ... ... ... ...
32128 ZOTTA, SANDINO MECHANICAL ENGINEER IV WATER MGMNT $104736.00 104736
32130 ZOVKO, RICHARD A HOISTING ENGINEER - MECHANIC GENERAL SERVICES $104208.00 104208
32158 ZUPAN, BILL M LIEUTENANT-EMT FIRE $106524.00 106524
32180 ZYSKOWSKI, DARIUSZ CHIEF DATA BASE ANALYST DoIT $110352.00 110352

3919 rows × 5 columns





















Logical operations on series

In [109]:
# Parentheses are essential for logical operations like below!
over100k_and_police = (df['Employee Annual Salary Numeric'] > 100000) & (df['Department'] == 'POLICE')
print(over100k_and_police)
0        False
1        False
2        False
3        False
         ...  
32177    False
32178    False
32179    False
32180    False
dtype: bool





















Indexing

In [103]:
df[over100k_and_police]
Out[103]:
Name Position Title Department Employee Annual Salary Employee Annual Salary Numeric
1 AARON, JEFFERY M POLICE OFFICER POLICE $80778.00 80778
2 AARON, KARINA POLICE OFFICER POLICE $80778.00 80778
9 ABBATE, TERRY M POLICE OFFICER POLICE $86520.00 86520
11 ABBOTT, LYNISE M CLERK III POLICE $43920.00 43920
... ... ... ... ... ...
32174 ZYDEK, BRYAN POLICE OFFICER POLICE $80778.00 80778
32177 ZYGOWICZ, PETER J POLICE OFFICER POLICE $86520.00 86520
32178 ZYMANTAS, MARK E POLICE OFFICER POLICE $83616.00 83616
32179 ZYRKOWSKI, CARLO E POLICE OFFICER POLICE $86520.00 86520

13570 rows × 5 columns