Kaggle Courses - Pandas Course

Onto the next Kaggle Course for Pandas . I've worked a good amount with Pandas, probably more than I have with pure Python. Knowing that Pandas is used to create DataFrames for machine learning models, I am going through this course as a refresher on Pandas basics before moving on to the Data Visualization and Machine Learning courses. I am disappointed in the order that Kaggle presents the courses, as I believe in the "build upon what you know" learning style, and I am not sure that the order that Kaggle presents their courses does this.

Thoughts

"Creating, Reading and Writing" Pandas

This is a quick-and-dirty introduction to Pandas. It focuses on the basics of what a Table is and how it can be created from a dictionary in the format dict = {"keys": "[values]"} which is turned into a Data Frame with columns = "key" and rows = "values". It also introduces Pandas Series, comparing a DF to a combination of Series, and it also introduces indexing. The lesson teaches how to read a CSV into a Pandas DF, and the exercise teaches how to write a DF to a CSV.

"Indexing, Selecting & Assigning"

The lesson teaches how to grab specific data based on indices and columns. I really like how it teaches the "loc" &"iloc" methods and the differences in them, such as the standard exclusive range using iloc (so iloc[0:1000] brings in values for rows 0-999 but loc[0:1000] brings in rows 0-1000. It also teaches conditional selection of data from a DataFrame; I did not know about the df.column_name.isin(['list of values']) method, so that was new and exciting to learn how to select rows whose column-value is "in" a list of values! This lesson also covers assigning values, but does not cover how you may see a warning message by directly assigning column values using df['column name'] = 'new value' instead of using df.loc['column name'] = 'new value'```.

"Summary Functions and Maps"

A quick lesson on the `df.describe() method and a couple of other column summary methods. This lesson also succinctly teaches how to use the df.column.map() and df.apply() methods, and how they work together, as well as how to apply operations between columns. The exercise teaches about the idxmax() method; I will want to learn more about that method as the exercise does not do a great job of explaining how it works, but it seems extremely useful. The exercises also do a great job of providing challenging but basic problems on how to use the map and apply methods.

"Grouping and Sorting"

This is a nice small lesson on how grouping and multi-indexing, as well as some sorting operations. I like the continued use of show how the apply method can be used, such as in this lesson it can be used as an aggregate function and/or filter. The introduction to multi-indexing was also informative, as in my work I had never used multi-indexing until I saw it from a coworker due to not knowing about it; so learning about it was good, because you can't use what you don't know exists.

"Data Types and Missing Values"

Not much to add here. Teaches basic isnull(), fillna(), and replace() methods. Important but very simple concepts here.

"Renaming and Combining"

The lesson teaches how to rename() columns and axes, rename specific index/row labels, and how we can use the rename_axis() method to give a label to the rows and columns. I am disappointed by this lesson. It briefly covers the concat() method to append data by adding the data from one DataFrame to the end of another DataFrame and specifying an axis. The lesson also mentions the merge() method but does not tell us how it functions. It also briefly shows how to use the join() method, though it only shows one example, and does not explain how it functions. I suggest reading through the explanations of concat(),append(),merge(), and join() in the pandas docs. Considering how important it is to combine data sets, such as if we have multiple files that need to be concatenated or disparate tables that need to be joined, this lesson is seriously lacking and covers only the most perfect scenarios of combining data.

Summary

This course was a bit undercooked. It was a great introduction to pandas, which is a pretty simple library and has a lot of straight forward functionality but also some complex functionality such as mapping and applying functions. The examples did a good job of covering the basic uses of pandas and the commonly used methods. Again, I would have liked to see more in depth instruction on the different methods for combining data. Having a background in SQL is a good, but not necessary, prerequisite to using pandas -- especially when it comes to combining data. Pandas is foundational to data analysis in python and this course seems like it will do a good job of helping new learners hit the ground running with the basics. These Kaggle courses so far have done a good job of teaching the basics, though more in depth learning is necessary to understand and use these languages and libraries to their full extent.