- Comparison with SQL
- Python Pandas - Caveats & Gotchas
- Python Pandas - Sparse Data
- Python Pandas - IO Tools
- Python Pandas - Visualization
- Python Pandas - Categorical Data
- Python Pandas - Timedelta
- Python Pandas - Date Functionality
- Python Pandas - Concatenation
- Python Pandas - Merging/Joining
- Python Pandas - GroupBy
- Python Pandas - Missing Data
- Python Pandas - Aggregations
- Python Pandas - Window Functions
- Statistical Functions
- Indexing & Selecting Data
- Options & Customization
- Working with Text Data
- Python Pandas - Sorting
- Python Pandas - Iteration
- Python Pandas - Reindexing
- Function Application
- Descriptive Statistics
- Python Pandas - Basic Functionality
- Python Pandas - Panel
- Python Pandas - DataFrame
- Python Pandas - Series
- Introduction to Data Structures
- Python Pandas - Environment Setup
- Python Pandas - Introduction
- Python Pandas - Home
Python Pandas Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Python Pandas - Comparison with SQL
Since many potential Pandas users have some famiparity with SQL, this page is meant to provide some examples of how various SQL operations can be performed using pandas.
import pandas as pd url = https://raw.github.com/pandasdev/ pandas/master/pandas/tests/data/tips.csv tips=pd.read_csv(url) print tips.head()
Its output is as follows −
total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
SELECT
In SQL, selection is done using a comma-separated pst of columns that you select (or a * to select all columns) −
SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;
With Pandas, column selection is done by passing a pst of column names to your DataFrame −
tips[[ total_bill , tip , smoker , time ]].head(5)
Let’s check the full program −
import pandas as pd url = https://raw.github.com/pandasdev/ pandas/master/pandas/tests/data/tips.csv tips=pd.read_csv(url) print tips[[ total_bill , tip , smoker , time ]].head(5)
Its output is as follows −
total_bill tip smoker time 0 16.99 1.01 No Dinner 1 10.34 1.66 No Dinner 2 21.01 3.50 No Dinner 3 23.68 3.31 No Dinner 4 24.59 3.61 No Dinner
Calpng the DataFrame without the pst of column names will display all columns (akin to SQL’s *).
WHERE
Filtering in SQL is done via a WHERE clause.
SELECT * FROM tips WHERE time = Dinner LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using Boolean indexing.
tips[tips[ time ] == Dinner ].head(5)
Let’s check the full program −
import pandas as pd url = https://raw.github.com/pandasdev/ pandas/master/pandas/tests/data/tips.csv tips=pd.read_csv(url) print tips[tips[ time ] == Dinner ].head(5)
Its output is as follows −
total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
The above statement passes a Series of True/False objects to the DataFrame, returning all rows with True.
GroupBy
This operation fetches the count of records in each group throughout a dataset. For instance, a query fetching us the number of tips left by sex −
SELECT sex, count(*) FROM tips GROUP BY sex;
The Pandas equivalent would be −
tips.groupby( sex ).size()
Let’s check the full program −
import pandas as pd url = https://raw.github.com/pandasdev/ pandas/master/pandas/tests/data/tips.csv tips=pd.read_csv(url) print tips.groupby( sex ).size()
Its output is as follows −
sex Female 87 Male 157 dtype: int64
Top N rows
SQL returns the top n rows using LIMIT −
SELECT * FROM tips LIMIT 5 ;
The Pandas equivalent would be −
tips.head(5)
Let’s check the full example −
import pandas as pd url = https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv tips=pd.read_csv(url) tips = tips[[ smoker , day , time ]].head(5) print tips
Its output is as follows −
smoker day time 0 No Sun Dinner 1 No Sun Dinner 2 No Sun Dinner 3 No Sun Dinner 4 No Sun Dinner
These are the few basic operations we compared are, which we learnt, in the previous chapters of the Pandas Library.
Advertisements