layout:true
Data Analysis with Python
-- class:center, middle # Data Analysis With Python - - - ## Instructors: Alfred Lee and Mark Yarish ### Follow along at: http://bit.ly/data-analysis-python
See the code at: http://bit.ly/data-analysis-python-code  --- class:center,middle # Welcome --- # Introduce Yourself to Your Neighbor + Who you are + Where you work + What you've done with code (any code) --- # What to Expect Today ??? + Facilitator sets key expectations for the class -- + Introduction to Python -- + Why use Python? -- + Using Python in Data Analysis -- + Getting Familiar: Python Syntax + Jupyter Notebook -- + 311 Data Analysis -- + Presentations! --- # What Not To Expect Today -- + Becoming a Python expert
-- (That takes much longer than we have) -- + Becoming a data analytics pro
-- (Also not possible in the time we have) -- + Becoming a visualization wizard
-- (Yep, you guessed it) --- # Also... -- + We don't have all the answers -- + You won't either when you do this yourself -- + Feel free to share if you know something -- + But pay attention to how we find the answers we're looking for -- + That's really how you'll learn --- # Housekeeping -- + We’ll have one 15 minute break in the morning and a 15 minute break in the afternoon -- + We'll have 1 hour for lunch -- + Class will start promptly after break -- + Feel free to use the bathroom if you need during class -- + Please take any phone conversations into the hall to not disrupt the class ??? + Expectations and norms are established with the class --- # What is Analysis? ??? + Facilitator frames the class discussion with basic definition of analysis leading into the need for tools + Emphasize the importance of why over the tools (ie don't have a solution in search of a problem) -- >“Analysis is simply the pursuit of understanding, usually through detailed inspection or comparison” ## - [Carter Hewgley](https://twitter.com/CarterHewgley), (Former) Director of Analytics, [Center for Government Excellence](https://govex.jhu.edu/) --- # But Sometimes We Have Challenges  Source: Justin Baeder, [link to original](https://www.flickr.com/photos/justinbaeder/5317820857) ??? + Facilitator emphasizes data isn't always easy to work with --- # Challenges We Face ??? + Facilitator ideally prompts participants to describe their challenges with data + Students identify key challenges and (if possible) how they see Python helping with those challenges -- + Data is hard to get (web scraping) -- + Data is too messy -- + Data is too large -- + Operation is too complex -- + Need to repeat (or automate) tasks -- + Need to document analytical steps ??? + Establish key reasons for using a programming language --- name:python-intro  # What is Python? ??? + Facilitator solicts basic description of Python from class + Allows facilitator to gauge level of understanding/awareness in the room + All else fails, remind them they convinced someone to let them take the class. What did they tell them? -- + [Open-source](https://opensource.com/resources/what-open-source) programming langage -- + Can run in standalone scripts or [full applications](http://www.hartmannsoftware.com/Blog/Articles_from_Software_Fans/Most-Famous-Software-Programs-Written-in-Python) -- + Can easily do data analysis and visualization -- + Can also do other programming tasks --- # Python vs. Excel ??? + Facilitator compares Python directly to Excel for context (assuming most participants are well-acquainted with Excel) -- + Python is a _programming language_ while Excel is an _application_ -- + Python can work with much larger datasets than Excel -- + Python can perform more complex operations than Excel -- + Python commands can be easily saved, re-run, and automated -- + Python doesn't have the icons, animations, and wizards of Excel --- class:center,middle # What Can Be Done With Programming Languages Like Python? ??? + Provide a real-world example of using a programming language to answer a need in city government --- # New Orleans Distributes Smoke Alarms  .caption[Image Credit: Michael Barnett [CC BY-SA 2.5](http://creativecommons.org/licenses/by-sa/2.5), via Wikimedia Commons] ??? + New Orleans was able to dramatically improve the distribution of smoke alarms with data from the US Census Bureau and better using their own administrative data, brought together, cleaned, and analyzed in code (actually R) + [The code is made available online](https://github.com/enigma-io/smoke-signals-model) + For more information, see [this presentation](https://nola.gov/performance-and-accountability/reports/nolalytics-reports/full-report-on-analytics-informed-smoke-alarm-outr/) --- # Targeted Outreach Saves Lives  .caption[Image Credit: City of New Orleans, via [nola.gov](http://nola.gov/performance-and-accountability/nolalytics/files/full-report-on-analytics-informed-smoke-alarm-outr/)] ??? + Provide a real-world example of using a programming language to answer a need in city government --- # Targeted Outreach Saves Lives  .caption[Image Credit: City of New Orleans, via [nola.gov](http://nola.gov/performance-and-accountability/nolalytics/files/full-report-on-analytics-informed-smoke-alarm-outr/)] ??? + Provide a real-world example of using a programming language to answer a need in city government --- # Today's Analysis  .center[Derelict Vehicles Across NYC] ??? + Facilitator provides frame for class -> anecdotal to the global, is this a real problem? "Let's go to the data..." -- .center[But we're going to do some practice first] --- exclude:true # Python Facts + Comes from Monty Python and Flying Circus + Very flexible and highly readable + Can work on the commandline and in scripts --- # Getting Started + We're going to use a [Notebook](http://jupyter.org/) -- + Web application that runs code, as well as visualizations and explanatory text -- + Can save notebooks and share them with others -- + Look in your packet for the weblink to start your own Notebook --  + Start a new Python 3 Notebook --- # Using Jupyter Notebook --  + Type code into a block and run the block -- + You can also press `Shift` + `Enter` to run a block --  + The output (if any) will print below -- + You can type as much or as little code as you'd like -- + You can re-run the block as many times as you'd like -- + Each webpage is a self-contained program (you can't share variables between webpages) -- + Great for experimenting, prototyping, or simple scripts --- # Your Turn + Change the code to print your name + Re-run to make sure it prints properly + Have fun -> you won't break anything ??? + Facilitator provides a confidence building exercise for students to become comfortable with the environment --- # What is Syntax? ??? + Facilitator prompts students to think about the role of grammar in spoken language as a bridge to understanding the importance of correct syntax in programming --  .caption[Image Credit: AnonMoos, Public Domain via [Wikipedia](https://commons.wikimedia.org/wiki/File%3ABasic_constituent_structure_analysis_English_sentence.svg)] --- # Python Syntax ??? + Facilitator connects the grammar metaphor to the basic elements of Python syntax -- + [Variables](https://www.tutorialspoint.com/python/python_variable_types.htm) hold some value -- + You can think of them like the subject of a sentence -- + We create variables and assign a value using the `=` sign ```python a = 1 # assign the value 1 to the variable a ("a is 1") b = 2 # assign the value 2 to the variable c ("b is 2") ``` -- + We can perform operations with mathematical operators ```python c = a + b # add the value of a and b, and assign to variable c # ("c is a and b") ``` --- # Your turn + Calculate the sum of `5` and `7`, assign the result to a variable and print the result + If you want to see an interactive example of what's going on, try [the Python Tutor](http://pythontutor.com/live.html#mode=edit) --- # Functions ??? + Facilitator introduces pre-defined functions in Python + With classes experienced in Excel, references to Excel functions can be helpful -- + We can use built-in functions for operations ```python print(c) # print the value of variable c ``` -- + This is a pre-defined operation (like `=SUM` in Excel) -- + Functions are like verbs describing an action -- + Unfortunately, there aren't many [built-in functions](https://docs.python.org/3/library/functions.html) to basic Python --- exclude:true # Benefits of Functions + Make your code more modular + Can write your own functions + For now, we'll use functions created by others + These are usually made available through packages --- # Packages -- + Provide a set of pre-constructed functions -- + Usually for some specific task
-- (accessing databases, creating charts, scraping the web) -- + If you need to do something, there's probably a package for it -- + This helps expand the core functionality of Python -- + Even some basic things require packages
-- (like [`math`](https://docs.python.org/3/library/math.html) and [`time`](https://docs.python.org/3/library/time.html)) -- + Import packages using the keyword [`import`](https://docs.python.org/3/tutorial/modules.html) --- # Important Packages for Exercises + [`pandas`](http://pandas.pydata.org/) provides data analytics functionality similar to a spreadsheet -- + [`matplotlib`](http://matplotlib.org/) provides data visualization functionality -- ```python import pandas as pd import matplotlib.pyplot as plt ``` -- + We alias them to make it easier to distinguish functions --- # Old Faithful Exercise - Data Download #### `http://training.datapolitan.com/data-analysis-python/data/faithful.csv`  .caption[Image Credit: Astroval1, [CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0) via [Wikimedia Commmons](https://commons.wikimedia.org/wiki/File%3ABig_Dipper_Ursa_Major_over_Old_Faithful_geyser_Yellowstone_National_Park_Wyoming_Astrophotography.jpg)] --- # Analyzing the Data + Import the data ```python df = pd.read_csv('http://training.datapolitan.com/data-analysis-python/data/faithful.csv',index_col=0) ``` -- + Inspect the data ```python df.head() # Show the first 5 rows of data df.tail() # Show the last 5 rows of data ``` -- + Count the number of rows ```python df.count() # Count the number of non-null values in each column ``` --- # Analyzing the Data + Find the range of values ```python df.max() # Find the maximum value df.min() # Find the minimum value ``` -- + Find the mean (average) ```python df.mean() # Find the mean value of all non-null columns ``` -- + Find the median (middle) ```python df.median() # Find the median value ``` -- + Or do it all together ```python df.describe() # Provide summary statistics for all columns ``` --- # Referencing Columns -- + To reference a particular column, you use the following syntax: -- ```python df['Column Name'] # Example of the syntax for referencing a single column ``` -- + We can then call functions on just that column -- + For example, to just get the mean for `eruptions` ```python df['eruptions'].mean() # Calculate the mean of the eruptions column ``` --- # Visualizing the Data -- + To visualize the data, we need to first import the `matplotlib` package ```python import matplotlib.pyplot as plt # Imports the visualization package %matplotlib inline # This tells Jupyter to use the package in the webpage ``` --  + Then we can create a simple histogram using the `hist()` function ```python df.hist() # Create a histogram ``` --- # Visualizing the Data + We can create a simple scatter plot that shows the values plotted against each other ```python df.plot(kind='scatter',x='waiting',y='eruptions') # Create a scatter plot ```  --- # Breaking Down the Code ```python df.plot(kind='scatter',x='waiting',y='eruptions') # Create a scatter plot ``` -- + `df` is our dataframe object (the noun) -- + We call the `plot` function (the verb) -- + We give Python three parameters telling it how to plot (adverbs that modify the verb) -- + "_Create a `scatter` plot with the parameters `x='waiting',y='eruptions'`_" -- + We'll experiment more with this later --- # Your Turn ??? + Release exercise for participants to practice skills, with an emphasis on changing the given code and learning the content + Facilitator reviews what they changed and anything they learned before putting them on break -- + Using the code we just used, start experimenting with altering the code to get a different result -- + Look at the documentation for [`plot()`](http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.DataFrame.plot.html) and [visualizing in pandas](http://pandas.pydata.org/pandas-docs/version/0.19.2/visualization.html#visualization-scatter) for ideas -- + Find at least one attribute to change --- name:morning-break class:center, middle # 15 Min Break  Source: http://xkcd.com/353/ --- class:center,middle # Just Some Quick Points --- # Data Structures -- + Remember the `df` object? -- + It's called a [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) -- + It's a basic data structure that stores our data -- + Data structures are a way of organizing data in a computer program -- + The method of organizing data differs across types, allowing for different applications -- + Python has [lots of different data structures](http://interactivepython.org/runestone/static/pythonds/index.html) that we aren't going to talk about ??? + Briefly introduce the concept of data structures + Demystify the idea for further study --- # 311 Data Exercise + Load the data from `https://s3.amazonaws.com/datapolitan-training-files/311_Requests_Oct15_Nov20.csv` + Don't add the `index_col=0` that we had before + Ignore the warning you get ([more on that warning](http://stackoverflow.com/a/27232309/1808021))  + Inspect the data based on the techniques we just used + You can use the [`311_exercise.ipynb`](code\311_exercise.ipynb) file as we go along --- # Summary Statistics on 311 Data -- + How many columns are in the data? -- + How many rows are in the data? -- + What is the time range of the data? -- + What is the result of running `describe()`? -- + Try creating a histogram -- + Work together and help each other out --- class:center, middle # Wrap-Up --- name:lunch class:center,middle # Lunch  Source: https://xkcd.com/1319/ --- class:center,middle # Welcome Back --- # Data Types -- + Run `df.dtypes` -- + What are the results? -- + What do you think they mean? ??? + Introduce the concept of data types --- # Data Types -- + Python has various containers that define the type of data -- + Excel has the same (think dates, numbers, text, percentages, etc) -- + In our example there are `int64` and `float64` -- + What do you think these are? ??? + Overview of data types in the context of Excel operations they should be familiar with + Start introduction with numeric types --- # Integers and Floats -- + Both are [numeric types](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex) -- + Integers are whole numbers (decimals get dropped) -- + Floats include the decimal points ??? + Introduce data types and the different operations + Key reference: https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex --- # Other Types of Interest -- + [Boolean (`True` and `False`)](http://www.pythonforbeginners.com/basics/boolean/) -- + [Strings](https://docs.python.org/3/library/stdtypes.html#string-methods) -- + [Datetime Objects](http://buddylindsey.com/python-date-and-datetime-objects-getting-to-know-them/) -- ## [And you can convert between types](https://www.digitalocean.com/community/tutorials/how-to-convert-data-types-in-python-3) -- ## [More on Python Data Types here](https://www.digitalocean.com/community/tutorials/understanding-data-types-in-python-3) ??? + Discuss other key data types + Provide links to additional information --- class:center,middle # Algorithms --- # What is an Algorithm? -- + Step by step instructions for any task -- + Inputs = Ingredients -- + Operation = Baking Instructions -- + Output = Dessert -- + Each step in the operation must be clearly defined and work the same with any input ??? + Emphasize algorithm as process + Demystify and empower students --- # [Example](https://www.linkedin.com/pulse/way-algorithm-claudia-perlich)  --- .red[# Inputs]  --- .red[# Operation]  --- class:middle .red[# Output]  ##### Credit: Flickr user [Patent and the Pantry](https://www.flickr.com/photos/26412869@N03/). Image used under a [Creative Commons BY-NC-ND 2.0 license](https://creativecommons.org/licenses/by-nc-nd/2.0/legalcode) ??? + define algorithm as instructions for cake + Lecture with images --- # Data Structures and Algorithms  --- class:center,middle # That's as technical as we'll be getting today
(I promise)  Source: https://xkcd.com/1667/ --- class:center,middle # Let's Get Back to the Data --- # Summary Statistics on 311 Data + How many columns are in the data? + How many rows are in the data? + What is the time range of the data? + What is the result of running `describe()`? + Try creating a histogram + Work together and help each other out --- # Going Deeper into the Data + Which borough has the most complaints? --  ??? + Students will conduct the same commands from Faithful with 311 exercise + Students will hit the roadblocks + Can't run summary statistics + Exercise will be run through script showing comments (not on slide) + Script will mirror the Faithful with intention of not working --- # Aggregating Data -- + What is a PivotTable in Excel and when do you use it? -- + `GroupBy` is the same for making summaries -- ```python df.groupby('Column you want to group')['Column you want to count'].count() ``` -- + In this case ```python df.groupby('Borough')['Unique Key'].count() ``` --  --- # Aggregating Data + To sort the list ```python df.groupby('Borough')['Unique Key'].count().sort_values(ascending=False) ``` --  -- + This is called **function chaining** --- # Function Chaining -- + We can string operations together using the [dot method](http://reeborg.ca/docs/oop_py_en/oop.html) -- + This means we can chain operations using a simple dot between operations -- + Python executes these from left to right (like we read) -- + Example: ```python df.groupby('Borough')['Unique Key'].count().sort_values(ascending=False) ``` --- # Function Chaining ```python df.groupby('Borough')['Unique Key'].count().sort_values(ascending=False) ``` -- + We first run the `groupby()` function to aggregate the values -- + Then the results using the `count()` function -- + Then sort the result by values using the `sort_values()` function -- + What if we want to plot the result? -- + We just add a `.plot()` (with some parameters) -- + Try it yourself --- # Plotting Service Requests by Borough ```python df.groupby('Borough')['Unique Key']\ .count()\ .sort_values(ascending=False)\ .plot(kind='bar',ylim=(0,75000),\ title='Count of NYC 311 Service Requests by Borough') ``` ## The `\` mark just lets me break the code lines so it'll fit --- class:center,middle # Filtering and Sorting --- # Selecting Columns + Remember to select a column of data, we do this: ```python df['column1'] ``` -- + To select multiple columns, we do this: ```python df[['column1','column2','column3']] ``` -- + We can assign this subset to another dataframe ```python df2 = df[['column1','column2','column3']] ``` -- + Identify key columns and assign them to a new dataframe -- + Hint: Use `df.columns` to get exact column names ??? + Discuss how to subset the columns + Prepare to introduce data structures in Python --- # Selecting Rows ```python df[df['Column Name'] == 'Value'] ``` -- + `df['Column Name'] == 'Value'` is the test we use to filter on -- + `==` means "equivalent to" -- + `=` is for assigning value -- + We can also assign this subset to another dataframe ```python df3 = df[df['Column Name'] == 'Value'] ``` --- # Your Turn + Find all the complaints in Brooklyn + After that, find complaints in your ZIP code --- # Selecting Rows + For example ```python df[df['Complaint Type']=='Noise'] ``` -- + This will filter for Complaint Type "Noise" -- + We can count these ```python df[df['Complaint Type']=='Noise']['Unique Key'].count() # 6523 ``` --  + But -- + How do we match these? --- # Selecting Rows - Fuzzy Matching + To do a fuzzy matching, we need to chain another function ```python df[df['Complaint Type'].str.contains('Noise')] ``` -- + `str` is the string [object in Python](https://jeffknupp.com/blog/2014/06/18/improve-your-python-python-classes-and-object-oriented-programming/) -- + It has a function called `contains()` that will do a fuzzy match -- + We pass it the parameter we're looking for 'Noise' -- + We could also use the function [`startswith()`](https://www.tutorialspoint.com/python/string_startswith.htm) --- # Try It Out  + Filter your data for both categories of derelict vehicles + Count the number of complaints by borough + Count the number of complaints by community district ??? + Students should find that Queens has the most (2387) ```python df[df['Complaint Type'].str.contains('Derelict Vehicle')]\ .groupby('Borough')['Unique Key']\ .count() ``` + Students should find that 12 Queens has the most (447) ```python df[df['Complaint Type'].str.contains('Derelict Vehicle')]\ .groupby('Community Board')['Unique Key']\ .count().sort_values(ascending=False) ``` --- name:afternoon-break class:center,middle # 15 Min Break  Source: https://xkcd.com/1831/ --- # Final Exercise ??? + Facilitator prompts students to think of an interesting/engaging question + Students share them with the class and students are given the opportunity to pair up + Students can move around the classroom to work together in pairs or small groups + Facilitators and TA work with students to answer their questions with additional resources or skills as necessary, but learning should be student driven ("We've given you enough to be dangerous...") -- + Take a moment and think about what you've looked at -- + What questions are you interested in exploring? -- + Take a moment to write them down in your workbook --- name:final-exercise # Your Turn --  + Open a new Jupyter Notebook -- + Load the data and begin exploring your question -- + Try to find a story in the data (if you can) -- + But mostly just have fun -- + We're here to help -- + Here's [a list of functions](#pandas-reference) --- exclude:true # old exercise + filter for a different complaint type + Find the borough, ZIP code, or community board with the most complaints + Try grouping by another column (Location Type or Agency) --- # Debugging -- + Everyone makes mistakes -- + It's just a matter of how serious they are
-- _And fixing them_ -- + **Semantic errors** -> doing the wrong things -- + **Syntax errors** -> using the wrong instructions -- + When in doubt, review the documentation ??? + Students will be introduced to key concepts in identifying and resolving errors + This will be done with a lecture/discussion leading into an exercise + Class exercise finding errors in code -> slide with code snippets in Markdown with errors + deal with issue of correctness --- # Exercise + Debug your neighbor's Python code and verify results ??? + Students will examine another student's code, run the code, and fix any errors + Students will have a better understanding of how to think in code + Goal is to get students talking to each other about their code + have documentation at end of slides --- class:center,middle #
Click to submit your work
--- class:middle,center # Code Review ??? + Students will review select code examples + Goal is to model a collaborative process for data analysis + Time buffer for end of class --- name:pandas-reference # Key `pandas` Functions (Back to [exercise](#final-exercise)) + [`read_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) - import file from CSV ([`read_excel()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html)) + [`head()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) & [`tail()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) - first and last 5 rows of DataFrame + [`count()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) - count of all rows in column + [`max()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html) & [`min()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html) - max and min values in column + [`mean()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html) & [`median()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html) - mean and median values of numbers in column + [`describe()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) - summary statistics for DataFrame + [`hist()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html) - create a histogram of values + [`groupby()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) - group values together in DataFrame + [`sort_values()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) - sort by values --- class:center,middle # WRAP-UP --- # What We've Covered -- + Basic Python syntax -- + Working in Jupyter -- + Opening a dataset -- + Exploring a dataset -- + Visualizing a dataset -- + What else? --- # What We Haven't Covered -- + All the data structures in Python -- + Deep dive into algorithms -- + More Packages -- and there are [a lot of packages](https://pypi.python.org/pypi/?) -- even [ones for working with spatial data](http://geopandas.org/) -- + How to be [Pythonic](http://stackoverflow.com/questions/25011078/what-does-pythonic-mean) -- + How to use APIs -- + So much more... --- # Remember -- + Python is a powerful tool for cleaning, analyzing, and visualizing data -- + Integrating it into your workflow takes practice and a commitment to not giving up (Google is your friend) -- + Distributions like [Anaconda](https://www.continuum.io/downloads) make it easy to get started (and you should be able to install it on your work computer) -- + It's best if you just start off with Python 3 (what we've been using) -- + For more on the difference, [check this out](https://www.digitalocean.com/community/tutorials/python-2-vs-python-3-practical-considerations-2) --- # To Learn More -- + [Python for Data Analysis](http://shop.oreilly.com/product/0636920050896.do) - The textbook on using `pandas` for data analysis -- + [Beginner's Python Tutorial](https://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial) - A good way to get started with basic tasks -- + [Top 20 Python Libraries for Data Science in 2018](https://www.datasciencecentral.com/profiles/blogs/top-20-python-libraries-for-data-science-in-2018) -- + [An A-Z of Useful Python Tricks](https://medium.freecodecamp.org/an-a-z-of-useful-python-tricks-b467524ee747) - Helpful tidbits to make things easier (and more fun, like printing emojis in Python) -- + [23 great pandas codes for Data Scientists](https://towardsdatascience.com/23-great-pandas-codes-for-data-scientists-cca5ed9d8a38) - Common tasks in `pandas` with code samples --- # Other Resources -- + [Stack Overflow](http://stackoverflow.com/) - One of the best Q&A sites for technical questions of all kinds -- + [Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) - How to be "Pythonic" -- + [Class handout](workbook.pdf) -- + [Datapolitan Training Classes](http://training.datapolitan.com) --- # Contact Information ## Alfred Lee + LinkedIn: https://www.linkedin.com/in/leealfred/ + Twitter: [@Alphrabet](https://twitter.com/Alphrabet) ## Mark Yarish + Email: mark[dot]yarish[at]gmail[dot]com --- class:center, middle # THANK YOU! --- exclude:true class:middle,center # Data structures --- exclude:true # What is a data structure? + A way of organizing data in a computer program + The method of organizing data differs across types, allowing for different applications + Important to use the right data structure for the proper task --- exclude:true class:center,middle # Data structures in Python --- exclude:true # Lists + The most versatile data structure in Python + Can contain multiple data types + Can be sliced or added to + Are mutable (can be changed) + Access individual items by index  --- exclude:true # List functions + append(_object_) - add object to list + extend(_list_) - extend original list with elements of specified in list + insert(_index, object_) - insert object at specified index + pop(_[index]_) - pop element from list at index and return (default to last value in list) + remove(_value_) - delete first occurrence of value from list (no return) --- exclude:true # List functions + reverse - reverse the order of the list in place + sort() - sort the list + count(_value_) - count number of occurrences of value + index(_value_) - find index of first value and return ## For more information, see [the documentation ](https://docs.python.org/2/library/functions.html#list) --- exclude:true # When to use a list + When order matters + When you can look up the value using a simple numerical index + When your data might be changed, removed, or extended + When your data doesn’t need to be unique --- exclude:true # Your Turn + Find the Location Type that has the most rodent complaints + We'll be around to help --- exclude:true # Exploratory Data Analysis + Goal -> Discover patterns in the data + Understand the context + Summarize fields + Use graphical representations of the data + Explore outliers ####Tukey, J.W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley --- exclude:true # Sets + An unordered collection with no duplicate values + Created using the keyword “set” + More limited in the types of objects that can be included (must be hashable, no lists)  --- exclude:true # When to use sets + When you only need unique values + When the data types you’re working with are relatively basic (hashable) + When your data changes + When you need to manipulate your sets mathematically (set supports operations like union, intersection, difference, etc) --- exclude:true # Tuples + Immutable (unchangable) data structure similar to a list + Can contain elements of different types (don’t need to be hashable)  --- exclude:true # When to use tuples + When your data doesn’t change + When performance is important (tuples provide better performance because of their immutability) --- exclude:true # Dictionaries + Stores data in key-value pairs + Provides lookup based on custom keys (instead of numerical indexes) + Keys must be unique  --- exclude:true # When to use a dictionary + When you need to lookup values by a custom key + When you need a fast way to lookup values + When your data needs to be modified --- exclude:true # Things to remember with a dictionary + Key-value pairs aren’t stored in order (use `collections.OrderedDict` if you need to key order is important) + `collections.defaultdict` is a more flexible implementation for creating a dictionary and adding values ## For more information, [check this out](http://code.tutsplus.com/articles/advanced-python-data-structures--net-32748) --- exclude:true # Another example
## What are the inputs, operation, and output here? --- exclude:true class:middle,center # We often write algorithms in functions ## --- exclude:true class:middle,center # We often write algorithms in functions ## So what's a function? --- exclude:true # Functions + Block of organized, reusable code + Ideally perform a single action + Make your code more modular --- exclude:true # When do I write a function? + If you find yourself repeating the same code sequence, it’s time to write a function + [**D**on’t **R**epeat **Y**ourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) + Helps make your code more readable and easier to maintain --- exclude:true # Function Example + Try this ```python def add(x,y): return x + y ```  --- exclude:true # Function Elements   --- exclude:true # Function Elements   --- exclude:true # Function Elements   --- exclude:true # Function Elements   --- exclude:true # Function Elements   --- exclude:true # Function Elements  