Comparison with SQL

Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations would be performed using pandas.

If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.

As is customary, we import pandas and numpy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a DataFrame called tips and assume we have a database table of the same name and structure.

In [3]: url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'

In [4]: tips = pd.read_csv(url)

In [5]: tips.head()
Out[5]: 
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

SELECT

In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all columns):

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

With pandas, column selection is done by passing a list of column names to your DataFrame:

In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
Out[6]: 
   total_bill   tip smoker    time
0       16.99  1.01     No  Dinner
1       10.34  1.66     No  Dinner
2       21.01  3.50     No  Dinner
3       23.68  3.31     No  Dinner
4       24.59  3.61     No  Dinner

Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).

WHERE

Filtering in SQL is done via a WHERE clause.

SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

In [7]: tips[tips['time'] == 'Dinner'].head(5)
Out[7]: 
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with True.

In [8]: is_dinner = tips['time'] == 'Dinner'

In [9]: is_dinner.value_counts()
Out[9]: 
True     176
False     68
Name: time, dtype: int64

In [10]: tips[is_dinner].head(5)

© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.22.0/comparison_with_sql.html