layout:true
Data Analysis with Python
-- class:center, middle # Data Analysis With Python - - - ## Instructors: Eric Brelsford and Reshama Shaikh ### Follow along at: http://bit.ly/data-analysis-python
See the code at: http://bit.ly/data-analysis-python-code ![img-center-50](images/datapolitan.png) --- class:center,middle # Welcome --- # Introduce Yourself to Your Neighbor + Who you are + Where you work + What you've done with code (any code) --- # What to Expect Today ??? + Facilitator sets key expectations for the class -- + Introduction to Python -- + Why use Python? -- + Using Python in Data Analysis -- + Getting Familiar: Python Syntax + Jupyter Notebook -- + 311 Data Analysis -- + Presentations! --- # What Not To Expect Today -- + Becoming a Python expert
-- (That takes much longer than we have) -- + Becoming a data analytics pro
-- (Also not possible in the time we have) -- + Becoming a visualization wizard
-- (Yep, you guessed it) --- # Also... -- + We don't have all the answers -- + You won't either when you do this yourself -- + Feel free to share if you know something -- + But pay attention to how we find the answers we're looking for -- + That's really how you'll learn --- # Housekeeping -- + We’ll have one 15 minute break in the morning and a 15 minute break in the afternoon -- + We'll have 1 hour for lunch -- + Class will start promptly after break -- + Feel free to use the bathroom if you need during class -- + Please take any phone conversations into the hall to not disrupt the class ??? + Expectations and norms are established with the class --- # What is Analysis? ??? + Facilitator frames the class discussion with basic definition of analysis leading into the need for tools + Emphasize the importance of why over the tools (ie don't have a solution in search of a problem) -- >“Analysis is simply the pursuit of understanding, usually through detailed inspection or comparison” ## - [Carter Hewgley](https://www.linkedin.com/in/carterhewgley), Senior Advisor for Family & Homeless Services, Department of Human Services, District of Columbia --- # But Sometimes We Have Challenges ![img-center-80](images/wrong_tool.jpg) Source: Justin Baeder, [link to original](https://www.flickr.com/photos/justinbaeder/5317820857) ??? + Facilitator emphasizes data isn't always easy to work with --- # Challenges We Face ??? + Facilitator ideally prompts participants to describe their challenges with data + Students identify key challenges and (if possible) how they see Python helping with those challenges -- + Data is hard to get (web scraping) -- + Data is too messy -- + Data is too large -- + Operation is too complex -- + Need to repeat (or automate) tasks -- + Need to document analytical steps ??? + Establish key reasons for using a programming language --- # Getting Started + We're going to use a [Notebook](http://jupyter.org/) -- + Web application that runs code, as well as visualizations and explanatory text -- + Can save notebooks and share them with others -- + Look in your packet for the weblink to start your own Notebook -- + Start a new Python 3 Notebook ![img-right-30](images/notebook_start_rev.png) --- # Using Jupyter Notebook ??? + Even if you're fluent with keyboard shortcuts, don't use them. Model the interface elements the students should use. -- ![img-right-45](images/notebook_run.png) + Type code into a cell and run the cell -- + You can also press `Shift` + `Enter` to run a cell -- ![img-right-40](images/notebook_output.png) + Any output will print below -- + You can type as much or as little code as you'd like -- + You can re-run the cell as many times as you'd like -- + Each webpage is a self-contained program (you can't share variables between webpages) -- + Great for experimenting, prototyping, or simple scripts --- name:notebook-exercise # Your Turn + Change the code to print your name + Re-run to make sure it prints properly + Have fun -> you won't break anything + Here are [extra tips](#notebook-tips) to explore ??? + Facilitator provides a confidence building exercise for students to become comfortable with the environment --- # What is Syntax? ??? + Facilitator prompts students to think about the role of grammar in spoken language as a bridge to understanding the importance of correct syntax in programming -- ![img-center-80](images/syntax.png) .caption[Image Credit: AnonMoos, Public Domain via [Wikipedia](https://commons.wikimedia.org/wiki/File%3ABasic_constituent_structure_analysis_English_sentence.svg)] --- # Python Syntax ??? + Facilitator connects the grammar metaphor to the basic elements of Python syntax -- + [Variables](https://www.tutorialspoint.com/python/python_variable_types.htm) hold some value -- + You can think of them like the subject of a sentence -- + We create variables and assign a value using the `=` sign ```python apples = 1 # assign the value 1 to the variable `apples` ("apples is 1") pears = 2 # assign the value 2 to the variable `pears` ("pears" is 2") ``` -- + We can perform operations with mathematical operators ```python # add the values of apples and pears, and assign to variable fruits # ("fruits is apples and pears") fruits = apples + pears ``` --- # Mathematical operators + `+` addition + `-` subtraction + `*` multiplication + `/` division --- # Your Turn + According to the [EPA](https://www.epa.gov/greenvehicles/greenhouse-gas-emissions-typical-passenger-vehicle), a typical car emits 404 grams of CO
2
per mile driven + If someone drives 12 miles to work and 12 miles home, what are the typical emissions created for a work week? + What are the emissions for three work weeks? + What are the emissions in kilograms? + Remember to use descriptive names for your variables! --- # Functions ??? + Facilitator introduces pre-defined functions in Python + With classes experienced in Excel, references to Excel functions can be helpful -- + We can use built-in functions for operations ```python print(c) # print the value of variable c ``` -- + This is a pre-defined operation (like `=SUM` in Excel) -- + Functions are like verbs describing an action -- + Unfortunately, there aren't many [built-in functions](https://docs.python.org/3/library/functions.html) to basic Python --- exclude:true # Benefits of Functions + Make your code more modular + Can write your own functions + For now, we'll use functions created by others + These are usually made available through packages --- # Packages -- + Provide a set of pre-constructed functions -- + Usually for some specific task
-- (accessing databases, creating charts, scraping the web) -- + If you need to do something, there's probably a package for it -- + This helps expand the core functionality of Python -- + Even some basic things require packages
-- (like [`math`](https://docs.python.org/3/library/math.html) and [`time`](https://docs.python.org/3/library/time.html)) -- + Import packages using the keyword [`import`](https://docs.python.org/3/tutorial/modules.html) --- # Important Packages for Exercises + [`pandas`](http://pandas.pydata.org/) provides data analytics functionality similar to a spreadsheet -- + [`matplotlib`](http://matplotlib.org/) provides data visualization functionality -- ```python import pandas as pd import matplotlib.pyplot as plt ``` -- + We alias them to make it easier to distinguish functions --- # Old Faithful Exercise - Data Download #### `http://training.datapolitan.com/data-analysis-python/data/faithful.csv` ![img-center-70](images/old_faithful.jpeg) .caption[Image Credit: Astroval1, [CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0) via [Wikimedia Commmons](https://commons.wikimedia.org/wiki/File%3ABig_Dipper_Ursa_Major_over_Old_Faithful_geyser_Yellowstone_National_Park_Wyoming_Astrophotography.jpg)] --- # Analyzing the Data + Import the data ```python df = pd.read_csv('http://training.datapolitan.com/data-analysis-python/data/faithful.csv',index_col=0) ``` -- + Inspect the data ```python df.head() # Show the first 5 rows of data df.tail() # Show the last 5 rows of data ``` -- + Count the number of rows ```python df.count() # Count the number of non-null values in each column ``` --- # Analyzing the Data + Find the range of values ```python df.max() # Find the maximum value df.min() # Find the minimum value ``` -- + Find the mean (average) ```python df.mean() # Find the mean value of all non-null columns ``` -- + Find the median (middle) ```python df.median() # Find the median value ``` --- # Analyzing the Data + Or do it all together ```python df.describe() # Provide summary statistics for all columns ``` --- # Referencing Columns -- + To reference a particular column, you use the following syntax: -- ```python df['Column Name'] # Example of the syntax for referencing a single column ``` -- + For example, we can get the eruptions column: -- ```python df['eruptions'] ``` -- + We can then call functions on just that column, such as `mean()`: -- ```python df['eruptions'].mean() # Calculate the mean of the eruptions column ``` --- # Visualizing the Data -- + To visualize the data, we need to first import the `matplotlib` package ```python import matplotlib.pyplot as plt # Imports the visualization package %matplotlib inline # This tells Jupyter to show images inside notebook cells # instead of in a separate window ``` -- + (You can look up all the `%magic` commands [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html?highlight=magic).) -- ![img-right-40](images/faithful_chart.png) + Then we can create a simple histogram using the `hist()` function ```python df.hist() # Create a histogram ``` --- # Visualizing the Data + We can create a simple scatter plot that shows the values plotted against each other ```python df.plot(kind='scatter', x='waiting', y='eruptions') # Create a scatter plot ``` ![img-center-60](images/faithful_scatter.png) --- # Breaking Down the Code ```python df.plot(kind='scatter', x='waiting', y='eruptions') # Create a scatter plot ``` -- + `df` is our dataframe object (the noun) -- + We call the `plot` function (the verb) -- + We give Python three parameters telling it how to plot (adverbs that modify the verb) -- + "_Create a `scatter` plot with the parameters `x='waiting', y='eruptions'`_" -- + We'll experiment more with this later --- # Your Turn ??? + Release exercise for participants to practice skills, with an emphasis on changing the given code and learning the content + Facilitator reviews what they changed and anything they learned before putting them on break -- + Using the code we just used, start experimenting with altering the code to get a different result -- + Look at the documentation for [`plot()`](http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.DataFrame.plot.html) and [visualizing in pandas](http://pandas.pydata.org/pandas-docs/version/0.19.2/visualization.html#visualization-scatter) for ideas -- + Find at least one attribute to change --- # 5 Data Analytics Tasks ??? + Facilitator introduces the 5 main tasks of analysis + The goal is to dispell the "black box" sense of analytics as something mystic and unknowable -- 1. Aggregating -- 2. Sorting -- 3. Filtering -- 4. Manipulating -- 5. Visualizing --- # Filtering + Sometimes you will want to reduce the number of rows you are working with -- + You can filter the rows in a dataframe that match a condition -- + What do you think this would do? ```python df['eruptions'] > 5 ``` -- + Try it! --- # Filtering ```python df['eruptions'] > 5 ``` -- + Returns a `boolean` (`True`/`False`) value for each row -- + If `True`, the row matches. If `False`, it does not. -- + You don't usually write a condition like this on its own -- + Instead: ```python df[df['eruptions'] > 5] ``` --- # Filtering + The template for filtering is: ```python df[condition] ``` -- + The `condition` usually also uses `df` to compare a column to a value -- + So you will often see `df` twice: ```python df[df['eruptions'] > 5] ``` -- + We'll talk more about filtering this afternoon --- # Sorting + Helps you order data and see which rows are at the high and low ends of a range -- + We can sort by the `eruptions` column like this: ```python df.sort_values('eruptions') ``` -- + Try it out -- + What do you notice? --- # Sorting -- + By default `sort_values` sorts the rows from lowest to highest -- + This is sometimes referred to as *ascending* order -- + You can flip the order and get the data from highest to lowest: ```python df.sort_values('eruptions', ascending=False) ``` --- class:center, middle # Wrap-up ## Tell us one thing you learned this morning --- name:morning-break class:center, middle # 15 Min Break ![img-center-50](images/xkcd_python.png) Source: http://xkcd.com/353/ --- name:python-intro ![img-right-40](https://www.python.org/static/community_logos/python-logo-master-v3-TM-flattened.png) # What is Python? ??? + Facilitator solicts basic description of Python from class + Allows facilitator to gauge level of understanding/awareness in the room + All else fails, remind them they convinced someone to let them take the class. What did they tell them? -- + [Open-source](https://opensource.com/resources/what-open-source) programming languge -- + Can run in standalone scripts or [full applications](http://www.hartmannsoftware.com/Blog/Articles_from_Software_Fans/Most-Famous-Software-Programs-Written-in-Python) -- + Can easily do data analysis and visualization -- + Can also do other programming tasks --- # Python vs. Excel ??? + Facilitator compares Python directly to Excel for context (assuming most participants are well-acquainted with Excel) -- + Python is a _programming language_ while Excel is an _application_ -- + Python can work with much larger datasets than Excel -- + Python can perform more complex operations than Excel -- + Python commands can be easily saved, re-run, and automated -- + Python doesn't have the icons, animations, and wizards of Excel --- exclude:true # Python Facts + Comes from Monty Python and Flying Circus + Very flexible and highly readable + Can work on the commandline and in scripts --- class:center,middle # What Can Be Done With Programming Languages Like Python? ??? + Provide a real-world example of using a programming language to answer a need in city government --- # New Orleans Distributes Smoke Alarms ![img-center-40](images/neworleansfire.jpeg) .caption[Image Credit: Michael Barnett [CC BY-SA 2.5](http://creativecommons.org/licenses/by-sa/2.5), via Wikimedia Commons] ??? + New Orleans was able to dramatically improve the distribution of smoke alarms with data from the US Census Bureau and better using their own administrative data, brought together, cleaned, and analyzed in code (actually R) + [The code is made available online](https://github.com/enigma-io/smoke-signals-model) + For more information, see [this presentation](https://nola.gov/performance-and-accountability/reports/nolalytics-reports/full-report-on-analytics-informed-smoke-alarm-outr/) --- # Targeted Outreach Saves Lives ![img-center-90](images/nolasmokealarm.png) .caption[Image Credit: City of New Orleans, via [nola.gov](http://nola.gov/performance-and-accountability/nolalytics/files/full-report-on-analytics-informed-smoke-alarm-outr/)] ??? + Provide a real-world example of using a programming language to answer a need in city government --- # Targeted Outreach Saves Lives ![img-center-90](images/nolaimpact.png) .caption[Image Credit: City of New Orleans, via [nola.gov](http://nola.gov/performance-and-accountability/nolalytics/files/full-report-on-analytics-informed-smoke-alarm-outr/)] ??? + Provide a real-world example of using a programming language to answer a need in city government --- class:center, middle # Let's load another dataset --- # 311 Data -- + Open the [`311_exercise.ipynb`](311_exercise.ipynb) notebook to follow along -- + This notebook loads the 311 data from `https://s3.amazonaws.com/datapolitan-training-files/311_Requests_Oct15_Nov20.csv` -- + Ignore the warning you get ([more on that warning](http://stackoverflow.com/a/27232309/1808021)) -- ![img-center-90](images/error.png) --- # 311 Data -- + In a moment you will inspect the data based on the techniques we used with the Old Faithful dataset -- + This [data dictionary](https://data.cityofnewyork.us/api/views/erm2-nwe9/files/68b25fbb-9d30-486a-a571-7115f54911cd?download=true&filename=311_SR_Data_Dictionary_2018.xlsx) explains each column --- # Summary Statistics on 311 Data -- + How many columns are in the data? -- + How many rows are in the data? -- + What is the time range of the data? -- + What is the result of running `describe()`? -- + Try creating a histogram -- + Work together and help each other out -- + **Stop** when you get to "Which borough has the most complaints?"—we haven't covered that yet! --- # Data Structures -- + Remember the `df` object? -- + It's called a [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) -- + It's a basic data structure that stores our data -- + Data structures are a way of organizing data in a computer program -- + The method of organizing data differs across types, allowing for different applications -- + Python has [lots of different data structures](http://interactivepython.org/runestone/static/pythonds/index.html) that we aren't going to talk about ??? + Briefly introduce the concept of data structures + Demystify the idea for further study --- # Data Types -- + Run `df.dtypes` -- + What are the results? -- + What do you think they mean? ??? + Introduce the concept of data types --- # Data Types -- + Python has various containers that define the type of data -- + Excel has the same (think dates, numbers, text, percentages, etc) -- + In our example there are `int64` and `float64` -- + What do you think these are? ??? + Overview of data types in the context of Excel operations they should be familiar with + Start introduction with numeric types --- # Integers and Floats -- + Both are [numeric types](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex) -- + Integers are whole numbers (decimals get dropped) -- + Floats include the decimal points ??? + Introduce data types and the different operations + Key reference: https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex --- # Other Types of Interest -- + [Boolean (`True` and `False`)](http://www.pythonforbeginners.com/basics/boolean/) -- + [Strings](https://docs.python.org/3/library/stdtypes.html#string-methods) -- + [Datetime Objects](http://buddylindsey.com/python-date-and-datetime-objects-getting-to-know-them/) -- ## [And you can convert between types](https://www.digitalocean.com/community/tutorials/how-to-convert-data-types-in-python-3) -- ## [More on Python Data Types here](https://www.digitalocean.com/community/tutorials/understanding-data-types-in-python-3) ??? + Discuss other key data types + Provide links to additional information --- # Visualizing the 311 data -- + Histograms are a little disappointing for the 311 data -- ![img-center-50](images/311_hist.png) -- + Why? -- + What might be a better way to visualize this data? --- # Making simple maps -- + Have we done anything today that might get us close to *mapping* this data? -- + Specifically, these columns: ![img-center-30](images/311_coords.png) --- # Making simple maps -- + Why not use a scatterplot? -- ```python df.plot(kind='scatter', x='Longitude', y='Latitude') ``` -- ![img-center-50](images/311_map.png) --- # GeoPandas -- + If you wanted to map data other than points, you'll want to try [GeoPandas](https://geopandas.org/) -- + We won't get to it today, but we encourage you to try it --- # Today's Analysis ![img-center-90](images/derelictvan.png) .center[Derelict Vehicles Across NYC] ??? + Facilitator provides frame for class -> anecdotal to the global, is this a real problem? "Let's go to the data..." -- .center[But we're going to do some practice first] -- + What borough generates the most complaints? -- + What department gets the most complaints? -- + What does this look like? --- class:center, middle # Wrap-Up ## Tell us one thing you learned since the break --- name:lunch class:center,middle # Lunch ![img-center-60](images/xkcd_automation.png) Source: https://xkcd.com/1319/ --- class:center,middle # Welcome Back --- # Going Deeper into the Data -- + Which borough has the most complaints? -- ![img-center-60](images/311_borough.png) ??? + Students will conduct the same commands from Faithful with 311 exercise + Students will hit the roadblocks + Can't run summary statistics + Exercise will be run through script showing comments (not on slide) + Script will mirror the Faithful with intention of not working --- # Aggregating Data -- + One of our data analytics tasks -- + What is a PivotTable in Excel and when do you use it? -- ![img-center-100](images/pivot_table.png) -- + `groupby` works similarly for making summaries --- # Aggregating Data -- + Use `groupby` like this: ```python df.groupby(['Column you want to group by']) ``` -- + In this case ```python df.groupby(['Borough']) ``` -- + This makes a new, grouped, DataFrame. Let's put this in a variable: -- ```python df_by_borough = df.groupby(['Borough']) ``` --- # Aggregating Data -- + Then we can count the rows within each group -- + Select the column to count: -- ```python df_by_borough['Unique Key'] ``` -- + And call `count()` like usual: -- ```python df_by_borough['Unique Key'].count() ``` -- ![img-center-50](images/borough_count_table.png) --- # Aggregating Data -- + Let's put that in its own variable: -- ```python df_by_borough_count = df_by_borough['Unique Key'].count() ``` -- + How would we sort this? -- ```python df_by_borough_count.sort_values(ascending=False) ``` -- ![img-center-50](images/borough_count_table_sorted.png) --- # Aggregating Data -- + We can put the counted and sorted data in its own variable too: -- ```python df_sorted_by_borough_count = df_by_borough_count.sort_values(ascending=False) ``` -- + What if we want to plot the result? -- + We just add a `.plot()`: -- ```python df_sorted_by_borough_count.plot(kind='bar') ``` -- + Try it! --- # Aggregating Data ```python # Group by borough df_by_borough = df.groupby(['Borough']) # Count rows by group df_by_borough_count = df_by_borough['Unique Key'].count() # Sort by count df_sorted_by_borough_count = df_by_borough_count.sort_values(ascending=False) # Make a bar chart df_sorted_by_borough_count.plot(kind='bar') ``` --- name:function-chain # Function Chaining -- + An alternative to making variables -- + We can connect operations using a dot between functions -- + Python executes these from left to right (like we read) -- + We can group, count, then sort all in one line: ```python df.groupby(['Borough'])['Unique Key'].count().sort_values(ascending=False) ``` --- # Function Chaining ```python df.groupby(['Borough'])['Unique Key'].count().sort_values(ascending=False) ``` -- + We first run the `groupby()` function to aggregate the values -- + Then count the results using the `count()` function -- + Then sort the result by values using the `sort_values()` function --- # Plotting Service Requests by Borough + Using function chaining: ```python df.groupby(['Borough'])['Unique Key'].count()\ .sort_values(ascending=False)\ .plot(kind='bar') ``` -- + (The `\` mark allows us to break the code onto multiple lines) ??? + Specify bar colors with the `color` kwarg of `.plot()` + An example using a different format for each of the six bars: ```python color=['r', # short for 'red' 'chartreuse', # named color '#FFFF00', # hex RGB '0.75', # grayscale (0.8, 0, 0.8), # tuple RGB (0, 0, 1, 0.5)] # tuple RGBA ``` + Note that if this question does come up, start with the simplest options: ```python color='blue' # all one color color=['red', 'green', 'purple', 'orange', 'brown', 'yellow'] ``` --- # Function Chaining vs. Storing Intermediate Results -- + Is there a difference? When should you use one or the other? -- + Function Chaining is compact and saves memory, but hides the steps in between -- + Storing intermediate results can make clutter, but it's easier to debug -- + Our recommendation: Use both! -- + Chain functions up to the important points that you want to inspect or reuse --- # Your Turn + Pick a column to aggregate the 311 data on + Group, count, and sort the data with that column + Try plotting the result --- class:center,middle # More Filtering --- # Selecting Columns + Remember to select a column of data, we do this: ```python df['column1'] ``` -- + To select multiple columns, we do this: ```python df[['column1', 'column2', 'column3']] # select columns 1, 2, and 3 ``` -- + We can assign this subset to another dataframe ```python df2 = df[['column1', 'column2', 'column3']] ``` -- + Identify key columns and assign them to a new dataframe -- + Hint: Use `df.columns` to get exact column names ??? + Discuss how to subset the columns + Prepare to introduce data structures in Python --- # Selecting Rows (Filtering) + Remember filtering from the Old Faithful dataset? -- ```python df[df['eruptions'] > 5] ``` -- + Recall that `df['eruptions'] > 5` is the condition we are filtering on -- + We can use other numerical operators such as: + `>=` (greater than or equal to) + `<` (less than) + `<=` (less than or equal to) --- # Selecting Rows (Filtering) + We can also select rows where a column is exactly equal to some value -- ```python df[df['Borough'] == 'QUEENS'] ``` -- + `==` means "equivalent to" while `=` is for assigning a variable's value -- + We can also assign this subset to another dataframe ```python df_queens = df[df['Borough'] == 'QUEENS'] ``` --- # Selecting Rows (Filtering) + And we can filter on multiple columns ```python df[(df['Borough'] == 'QUEENS') & (df['Complaint Type'] == 'Rodent')] ``` -- + `&`: both conditions must be true -- + In this case, `Borough` must be `QUEENS`, `Complaint Type` must be `Rodent` -- + When using multiple conditions like this, wrap them in `()` -- + More abstractly: ```python df[(condition1) & (condition2)] ``` --- # Selecting Rows (Filtering) -- + You can also use `|`, which selects rows where *either* condition is true -- ```python df[(df['Borough'] == 'QUEENS') | (df['Borough'] == 'BRONX')] ``` -- + Pandas has [other ways](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#the-query-method) to filter if you're curious! --- # Your Turn ### Filter the 311 data twice: 1. Find the complaints in Brooklyn 1. Find the complaints in your ZIP code --- # Selecting Rows + Filtering the 311 data for noise complaints: ```python df[df['Complaint Type'] == 'Noise'] ``` -- + We can count these ```python df[df['Complaint Type'] == 'Noise']['Unique Key'].count() # 6523 ``` -- ![img-right-40](images/noise_count_table_box.png) + But -- + How do we match these? --- # Selecting Rows - Fuzzy Matching + To do fuzzy matching, we need to use a function ```python df[df['Complaint Type'].str.contains('Noise')] # fuzzy match on "Noise" ``` -- + `str` contains a set of useful [string-related functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary) -- + It has a function called `contains()` that will do a fuzzy match -- + We pass it the string we're looking for ('Noise') -- + We could also use the function [`startswith()`](https://www.tutorialspoint.com/python/string_startswith.htm) --- # Try It Out ![img-right-40](images/derelict_veh_complaints_box.png) + Filter your data for both categories of derelict vehicles + Count the number of complaints by borough + Count the number of complaints by community board ??? + Students should find that Queens has the most (2387) ```python df[df['Complaint Type'].str.contains('Derelict Vehicle')]\ .groupby(['Borough'])['Unique Key']\ .count() ``` + Students should find that 12 Queens has the most (447) ```python df[df['Complaint Type'].str.contains('Derelict Vehicle')]\ .groupby(['Community Board'])['Unique Key']\ .count().sort_values(ascending=False) ``` --- name:afternoon-break class:center,middle # 15 Min Break ![img-center-80](images/xkcd_here_to_help.png) Source: https://xkcd.com/1831/ --- # Final Exercise ??? + Facilitator prompts students to think of an interesting/engaging question + Students share them with the class and students are given the opportunity to pair up + Students can move around the classroom to work together in pairs or small groups + Facilitators and TA work with students to answer their questions with additional resources or skills as necessary, but learning should be student driven ("We've given you enough to be dangerous...") -- + Take a moment and think about what you've looked at -- + What questions are you interested in exploring? -- + Take a moment to write them down in your workbook --- name:final-exercise # Your Turn ??? + common Q: top five complaint types? ``` df.groupby(['Complaint Type']).count()['Unique Key'].sort_values(ascending=False).head() ``` + common Q: how to get hour of the day? ``` col = pd.to_datetime(df['Closed Date']) col.dt.hour ``` -- ![img-right-40](images/new_notebook.png) + Open a new Jupyter Notebook -- + Load the data and begin exploring your question -- + Try to find a story in the data (if you can) -- + But mostly just have fun -- + We're here to help -- + Here's [a list of functions](#pandas-reference) --- exclude:true # old exercise + filter for a different complaint type + Find the borough, ZIP code, or community board with the most complaints + Try grouping by another column (Location Type or Agency) --- # Debugging -- + Everyone gets errors all the time -- + It's just a matter of how complex they are
-- _And fixing them_ -- + **Syntax errors** -> using the wrong instructions -- + **Semantic errors** -> doing the wrong things -- + When in doubt, take a breath, try breaking things apart into smaller pieces, review the documentation, and search for help ??? + Students will be introduced to key concepts in identifying and resolving errors + This will be done with a lecture/discussion leading into an exercise + Class exercise finding errors in code -> slide with code snippets in Markdown with errors + deal with issue of correctness --- # Exercise -- + Switch seats with a neighbor -- + Walk through your neighbor's Python code -- + Do the steps make sense? Is it well documented? -- + If there are issues with the code, try to debug it and verify results ??? + Students will examine another student's code, run the code, and fix any errors + Students will have a better understanding of how to think in code + Goal is to get students talking to each other about their code + have documentation at end of slides --- class:center,middle #
Click to submit your work
--- class:middle,center # Show and tell ??? + Students will review select code examples + Goal is to model a collaborative process for data analysis + Time buffer for end of class --- class:center,middle # WRAP-UP --- # What We've Covered -- + Basic Python syntax -- + Working in Jupyter -- + Opening a dataset -- + Exploring a dataset -- + Visualizing a dataset -- + What else? --- # What We Haven't Covered -- + All the data structures in Python -- + More packages -- and there are [a lot of packages](https://pypi.python.org/pypi/?) -- + How to be [Pythonic](http://stackoverflow.com/questions/25011078/what-does-pythonic-mean) -- + How to use APIs -- + So much more... --- # Remember -- + Python is a powerful tool for cleaning, analyzing, and visualizing data -- + Integrating it into your workflow takes practice -- + Google is your friend -- + [Anaconda](https://www.anaconda.com/distribution/) makes it easy to get started (and you should be able to install it on your work computer) --- # Key Links -- + [Download Python 3 (Anaconda distribution)](https://www.anaconda.com/distribution/) -- + [Download exercise files from this class](http://bit.ly/data-analysis-python-code) -- + [Jupyter Quick Start Guide](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/) --- # Learning and Practicing More with Python -- + [Beginner's Python Tutorial](https://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial) - A good way to get started with basic tasks -- + [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653/) - The textbook on using `pandas` for data analysis -- + [Real Python Tutorials](https://realpython.com/) - Great tutorials, videos, and mini-courses on Python (mix free and paid content) -- + [Whirlwind Tour of Python](http://nbviewer.jupyter.org/github/jakevdp/WhirlwindTourOfPython/blob/master/Index.ipynb) - a fast-paced introduction to essential components of Python for those with a background in programming --- # Other Useful Resources -- + [An A-Z of Useful Python Tricks](https://medium.freecodecamp.org/an-a-z-of-useful-python-tricks-b467524ee747) - Helpful tidbits to make things easier (and more fun) -- + [23 great pandas codes for Data Scientists](https://towardsdatascience.com/23-great-pandas-codes-for-data-scientists-cca5ed9d8a38) - Common tasks in `pandas` with code samples -- + [Stack Overflow](https://stackoverflow.com/questions) - One of the best Q&A sites for technology -- + [Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) - How to be "Pythonic" -- + [Class handout](workbook.pdf) -- + [Datapolitan Training Classes](https://www.datapolitan.com) --- # Contact Information ## Eric Brelsford + Email: hi[at]ebrelsford[dot]com ## Reshama Shaikh + Email: --- class:center, middle # THANK YOU! --- name:notebook-tips # Jupyter Notebook Tips (Back to [exercise](#notebook-exercise)) ![img-right-45](images/notebook_ribbon.png) + Click on "Untitled" to rename the notebook + Add new cells with the "+" + Cut cells with the scissors + Rearrange cells with the up and down arrows + If a cell looks weird or doesn't do anything, make sure that it's a "code" cell + The circle in the upper right tells you if python is running + Cells also get numbered after they run (e.g. `IN [1]`) --- name:pandas-reference # Key `pandas` Functions (1/2) (Back to [exercise](#final-exercise)) + [`read_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) - import file from CSV ([`read_excel()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html)) + [`head()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) & [`tail()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) - first and last 5 rows of DataFrame + [`count()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) - count of all rows in column + [`max()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html) & [`min()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html) - max and min values in column + [`mean()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html) & [`median()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html) - mean and median values of numbers in column --- # Key `pandas` Functions (2/2) (Back to [exercise](#final-exercise)) + [`describe()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) - summary statistics for DataFrame + [`hist()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html) - create a histogram of values + [`groupby()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) - group values together in DataFrame + [`sort_values()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) - sort by values + [`unique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) - get unique values in a column + [`nunique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html) - get number of unique values in a column --- exclude:true class:middle,center # Data structures --- exclude:true # What is a data structure? + A way of organizing data in a computer program + The method of organizing data differs across types, allowing for different applications + Important to use the right data structure for the proper task --- exclude:true class:center,middle # Data structures in Python --- exclude:true # Lists + The most versatile data structure in Python + Can contain multiple data types + Can be sliced or added to + Are mutable (can be changed) + Access individual items by index ![img-center-40](images/list1.png) --- exclude:true # List functions + append(_object_) - add object to list + extend(_list_) - extend original list with elements of specified in list + insert(_index, object_) - insert object at specified index + pop(_[index]_) - pop element from list at index and return (default to last value in list) + remove(_value_) - delete first occurrence of value from list (no return) --- exclude:true # List functions + reverse - reverse the order of the list in place + sort() - sort the list + count(_value_) - count number of occurrences of value + index(_value_) - find index of first value and return ## For more information, see [the documentation ](https://docs.python.org/2/library/functions.html#list) --- exclude:true # When to use a list + When order matters + When you can look up the value using a simple numerical index + When your data might be changed, removed, or extended + When your data doesn’t need to be unique --- exclude:true # Your Turn + Find the Location Type that has the most rodent complaints + We'll be around to help --- exclude:true # Exploratory Data Analysis + Goal -> Discover patterns in the data + Understand the context + Summarize fields + Use graphical representations of the data + Explore outliers ####Tukey, J.W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley --- exclude:true # Sets + An unordered collection with no duplicate values + Created using the keyword “set” + More limited in the types of objects that can be included (must be hashable, no lists) ![img-center-55](images/set1.png) --- exclude:true # When to use sets + When you only need unique values + When the data types you’re working with are relatively basic (hashable) + When your data changes + When you need to manipulate your sets mathematically (set supports operations like union, intersection, difference, etc) --- exclude:true # Tuples + Immutable (unchangable) data structure similar to a list + Can contain elements of different types (don’t need to be hashable) ![img-center-60](images/tuple1.png) --- exclude:true # When to use tuples + When your data doesn’t change + When performance is important (tuples provide better performance because of their immutability) --- exclude:true # Dictionaries + Stores data in key-value pairs + Provides lookup based on custom keys (instead of numerical indexes) + Keys must be unique ![img-center-100](images/dict1.png) --- exclude:true # When to use a dictionary + When you need to lookup values by a custom key + When you need a fast way to lookup values + When your data needs to be modified --- exclude:true # Things to remember with a dictionary + Key-value pairs aren’t stored in order (use `collections.OrderedDict` if you need to key order is important) + `collections.defaultdict` is a more flexible implementation for creating a dictionary and adding values ## For more information, [check this out](http://code.tutsplus.com/articles/advanced-python-data-structures--net-32748) --- exclude:true # Another example
## What are the inputs, operation, and output here? --- exclude:true class:middle,center # We often write algorithms in functions ## --- exclude:true class:middle,center # We often write algorithms in functions ## So what's a function? --- exclude:true # Functions + Block of organized, reusable code + Ideally perform a single action + Make your code more modular --- exclude:true # When do I write a function? + If you find yourself repeating the same code sequence, it’s time to write a function + [**D**on’t **R**epeat **Y**ourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) + Helps make your code more readable and easier to maintain --- exclude:true # Function Example + Try this ```python def add(x,y): return x + y ``` ![img-center-70](images/add_algo0.png) --- exclude:true # Function Elements ![img-center-70](images/add_algo1.png) ![img-center-70](images/add_algo2.png) --- exclude:true # Function Elements ![img-center-70](images/add_algo1.png) ![img-center-70](images/add_algo3.png) --- exclude:true # Function Elements ![img-center-70](images/add_algo1.png) ![img-center-70](images/add_algo4.png) --- exclude:true # Function Elements ![img-center-70](images/add_algo1.png) ![img-center-70](images/add_algo5.png) --- exclude:true # Function Elements ![img-center-70](images/add_algo1.png) ![img-center-70](images/add_algo6.png) --- exclude:true # Function Elements ![img-center-70](images/add_algo1.png) ![img-center-70](images/add_algo7.png)