Remote Data Access
Warning
In pandas 0.17.0, the sub-package pandas.io.data
will be removed in favor of a separately installable pandas-datareader package. This will allow the data modules to be independently updated to your pandas installation. The API for pandas-datareader v0.1.1
is the same as in pandas v0.16.1
. (GH8961)
You should replace the imports of the following:
from pandas.io import data, wb
With:
from pandas_datareader import data, wb
Functions from pandas.io.data
and pandas.io.ga
extract data from various Internet sources into a DataFrame. Currently the following sources are supported:
- Yahoo! Finance
- Google Finance
- St.Louis FED (FRED)
- Kenneth French’s data library
- World Bank
- Google Analytics
It should be noted, that various sources support different kinds of data, so not all sources implement the same methods and the data elements returned might also differ.
Yahoo! Finance
In [1]: import pandas.io.data as web In [2]: import datetime In [3]: start = datetime.datetime(2010, 1, 1) In [4]: end = datetime.datetime(2013, 1, 27) In [5]: f = web.DataReader("F", 'yahoo', start, end) In [6]: f.ix['2010-01-04'] Out[6]: Open 1.017000e+01 High 1.028000e+01 Low 1.005000e+01 Close 1.028000e+01 Volume 6.085580e+07 Adj Close 8.755953e+00 Name: 2010-01-04 00:00:00, dtype: float64
Yahoo! Finance Options
*Experimental*
The Options
class allows the download of options data from Yahoo! Finance.
The get_all_data
method downloads and caches option data for all expiry months and provides a formatted DataFrame
with a hierarchical index, so it is easy to get to the specific option you want.
In [7]: from pandas.io.data import Options In [8]: aapl = Options('aapl', 'yahoo') In [9]: data = aapl.get_all_data() In [10]: data.iloc[0:5, 0:5] Out[10]: Last Bid Ask Chg PctChg Strike Expiry Type Symbol 10.0 2016-06-17 call AAPL160617C00010000 0.0 91.0 91.45 0.0 0.00% put AAPL160617P00010000 0.0 0.0 0.03 0.0 0.00% 12.5 2016-06-17 call AAPL160617C00012500 0.0 88.5 88.95 0.0 0.00% put AAPL160617P00012500 0.0 0.0 0.03 0.0 0.00% 15.0 2016-06-17 call AAPL160617C00015000 0.0 86.0 86.45 0.0 0.00% # Show the $100 strike puts at all expiry dates: In [11]: data.loc[(100, slice(None), 'put'),:].iloc[0:5, 0:5] Out[11]: Last Bid Ask Chg PctChg Strike Expiry Type Symbol 100.0 2016-05-06 put AAPL160506P00100000 6.35 6.35 6.55 -0.62 -8.90% 2016-05-13 put AAPL160513P00100000 6.50 6.40 6.60 -1.40 -17.72% 2016-05-20 put AAPL160520P00100000 6.65 6.55 6.75 -0.47 -6.60% 2016-05-27 put AAPL160527P00100000 7.12 6.75 6.90 0.00 0.00% 2016-06-03 put AAPL160603P00100000 7.18 6.70 6.90 -0.72 -9.11% # Show the volume traded of $100 strike puts at all expiry dates: In [12]: data.loc[(100, slice(None), 'put'),'Vol'].head() Out[12]: Strike Expiry Type Symbol 100.0 2016-05-06 put AAPL160506P00100000 43 2016-05-13 put AAPL160513P00100000 31 2016-05-20 put AAPL160520P00100000 38 2016-05-27 put AAPL160527P00100000 30 2016-06-03 put AAPL160603P00100000 10 Name: Vol, dtype: int64
If you don’t want to download all the data, more specific requests can be made.
In [13]: import datetime In [14]: expiry = datetime.date(2016, 1, 1) In [15]: data = aapl.get_call_data(expiry=expiry) In [16]: data.iloc[0:5:, 0:5] Out[16]: Last Bid Ask Chg PctChg Strike Expiry Type Symbol 55.0 2016-05-06 call AAPL160506C00055000 37.74 39.15 39.55 0.0 0.00% 80.0 2016-05-06 call AAPL160506C00080000 13.75 14.25 14.50 0.0 0.00% 85.0 2016-05-06 call AAPL160506C00085000 9.25 9.20 9.45 0.4 4.52% 86.0 2016-05-06 call AAPL160506C00086000 6.91 8.15 8.50 0.0 0.00% 87.0 2016-05-06 call AAPL160506C00087000 6.35 6.70 6.90 0.0 0.00%
Note that if you call get_all_data
first, this second call will happen much faster, as the data is cached.
If a given expiry date is not available, data for the next available expiry will be returned (January 15, 2015 in the above example).
Available expiry dates can be accessed from the expiry_dates
property.
In [17]: aapl.expiry_dates Out[17]: [datetime.date(2016, 5, 6), datetime.date(2016, 5, 13), datetime.date(2016, 5, 20), datetime.date(2016, 5, 27), datetime.date(2016, 6, 3), datetime.date(2016, 6, 10), datetime.date(2016, 6, 17), datetime.date(2016, 7, 15), datetime.date(2016, 8, 19), datetime.date(2016, 10, 21), datetime.date(2017, 1, 20), datetime.date(2017, 3, 17), datetime.date(2017, 6, 16), datetime.date(2018, 1, 19)] In [18]: data = aapl.get_call_data(expiry=aapl.expiry_dates[0]) In [19]: data.iloc[0:5:, 0:5] Out[19]: Last Bid Ask Chg PctChg Strike Expiry Type Symbol 55.0 2016-05-06 call AAPL160506C00055000 37.74 39.15 39.55 0.0 0.00% 80.0 2016-05-06 call AAPL160506C00080000 13.75 14.25 14.50 0.0 0.00% 85.0 2016-05-06 call AAPL160506C00085000 9.25 9.20 9.45 0.4 4.52% 86.0 2016-05-06 call AAPL160506C00086000 6.91 8.15 8.50 0.0 0.00% 87.0 2016-05-06 call AAPL160506C00087000 6.35 6.70 6.90 0.0 0.00%
A list-like object containing dates can also be passed to the expiry parameter, returning options data for all expiry dates in the list.
In [20]: data = aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3]) In [21]: data.iloc[0:5:, 0:5] Out[21]: Last Bid Ask Chg PctChg Strike Expiry Type Symbol 94.0 2016-05-13 call AAPL160513C00094000 1.53 1.42 1.48 0.28 22.40% 2016-05-20 call AAPL160520C00094000 1.95 1.93 1.95 0.29 17.47% 94.5 2016-05-06 call AAPL160506C00094500 0.69 0.71 0.73 0.07 11.29% 2016-05-13 call AAPL160513C00094500 1.34 1.20 1.24 0.30 28.85% 2016-05-20 call AAPL160520C00094500 1.63 1.64 1.66 0.11 7.24%
The month
and year
parameters can be used to get all options data for a given month.
Google Finance
In [22]: import pandas.io.data as web In [23]: import datetime In [24]: start = datetime.datetime(2010, 1, 1) In [25]: end = datetime.datetime(2013, 1, 27) In [26]: f = web.DataReader("F", 'google', start, end) In [27]: f.ix['2010-01-04'] Out[27]: Open 10.17 High 10.28 Low 10.05 Close 10.28 Volume 60855796.00 Name: 2010-01-04 00:00:00, dtype: float64
FRED
In [28]: import pandas.io.data as web In [29]: import datetime In [30]: start = datetime.datetime(2010, 1, 1) In [31]: end = datetime.datetime(2013, 1, 27) In [32]: gdp=web.DataReader("GDP", "fred", start, end) In [33]: gdp.ix['2013-01-01'] Out[33]: GDP 16440.7 Name: 2013-01-01 00:00:00, dtype: float64 # Multiple series: In [34]: inflation = web.DataReader(["CPIAUCSL", "CPILFESL"], "fred", start, end) In [35]: inflation.head() Out[35]: CPIAUCSL CPILFESL DATE 2010-01-01 217.488 220.633 2010-02-01 217.281 220.731 2010-03-01 217.353 220.783 2010-04-01 217.403 220.822 2010-05-01 217.290 220.962
Fama/French
Dataset names are listed at Fama/French Data Library.
In [36]: import pandas.io.data as web In [37]: ip = web.DataReader("5_Industry_Portfolios", "famafrench") In [38]: ip[4].ix[192607] Out[38]: 1 Cnsmr 5.43 2 Manuf 2.73 3 HiTec 1.83 4 Hlth 1.77 5 Other 2.16 Name: 192607, dtype: float64
World Bank
pandas
users can easily access thousands of panel data series from the World Bank’s World Development Indicators by using the wb
I/O functions.
Indicators
Either from exploring the World Bank site, or using the search function included, every world bank indicator is accessible.
For example, if you wanted to compare the Gross Domestic Products per capita in constant dollars in North America, you would use the search
function:
In [1]: from pandas.io import wb In [2]: wb.search('gdp.*capita.*const').iloc[:,:2] Out[2]: id name 3242 GDPPCKD GDP per Capita, constant US$, millions 5143 NY.GDP.PCAP.KD GDP per capita (constant 2005 US$) 5145 NY.GDP.PCAP.KN GDP per capita (constant LCU) 5147 NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2005 internation...
Then you would use the download
function to acquire the data from the World Bank’s servers:
In [3]: dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX'], start=2005, end=2008) In [4]: print(dat) NY.GDP.PCAP.KD country year Canada 2008 36005.5004978584 2007 36182.9138439757 2006 35785.9698172849 2005 35087.8925933298 Mexico 2008 8113.10219480083 2007 8119.21298908649 2006 7961.96818458178 2005 7666.69796097264 United States 2008 43069.5819857208 2007 43635.5852068142 2006 43228.111147107 2005 42516.3934699993
The resulting dataset is a properly formatted DataFrame
with a hierarchical index, so it is easy to apply .groupby
transformations to it:
In [6]: dat['NY.GDP.PCAP.KD'].groupby(level=0).mean() Out[6]: country Canada 35765.569188 Mexico 7965.245332 United States 43112.417952 dtype: float64
Now imagine you want to compare GDP to the share of people with cellphone contracts around the world.
In [7]: wb.search('cell.*%').iloc[:,:2] Out[7]: id name 3990 IT.CEL.SETS.FE.ZS Mobile cellular telephone users, female (% of ... 3991 IT.CEL.SETS.MA.ZS Mobile cellular telephone users, male (% of po... 4027 IT.MOB.COV.ZS Population coverage of mobile cellular telepho...
Notice that this second search was much faster than the first one because pandas
now has a cached list of available data series.
In [13]: ind = ['NY.GDP.PCAP.KD', 'IT.MOB.COV.ZS'] In [14]: dat = wb.download(indicator=ind, country='all', start=2011, end=2011).dropna() In [15]: dat.columns = ['gdp', 'cellphone'] In [16]: print(dat.tail()) gdp cellphone country year Swaziland 2011 2413.952853 94.9 Tunisia 2011 3687.340170 100.0 Uganda 2011 405.332501 100.0 Zambia 2011 767.911290 62.0 Zimbabwe 2011 419.236086 72.4
Finally, we use the statsmodels
package to assess the relationship between our two variables using ordinary least squares regression. Unsurprisingly, populations in rich countries tend to use cellphones at a higher rate:
In [17]: import numpy as np In [18]: import statsmodels.formula.api as smf In [19]: mod = smf.ols("cellphone ~ np.log(gdp)", dat).fit() In [20]: print(mod.summary()) OLS Regression Results ============================================================================== Dep. Variable: cellphone R-squared: 0.297 Model: OLS Adj. R-squared: 0.274 Method: Least Squares F-statistic: 13.08 Date: Thu, 25 Jul 2013 Prob (F-statistic): 0.00105 Time: 15:24:42 Log-Likelihood: -139.16 No. Observations: 33 AIC: 282.3 Df Residuals: 31 BIC: 285.3 Df Model: 1 =============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------- Intercept 16.5110 19.071 0.866 0.393 -22.384 55.406 np.log(gdp) 9.9333 2.747 3.616 0.001 4.331 15.535 ============================================================================== Omnibus: 36.054 Durbin-Watson: 2.071 Prob(Omnibus): 0.000 Jarque-Bera (JB): 119.133 Skew: -2.314 Prob(JB): 1.35e-26 Kurtosis: 11.077 Cond. No. 45.8 ==============================================================================
Country Codes
New in version 0.15.1.
The country
argument accepts a string or list of mixed two or three character ISO country codes, as well as dynamic World Bank exceptions to the ISO standards.
For a list of the the hard-coded country codes (used solely for error handling logic) see pandas.io.wb.country_codes
.
Problematic Country Codes & Indicators
Note
The World Bank’s country list and indicators are dynamic. As of 0.15.1, wb.download()
is more flexible. To achieve this, the warning and exception logic changed.
The world bank converts some country codes in their response, which makes error checking by pandas difficult. Retired indicators still persist in the search.
Given the new flexibility of 0.15.1, improved error handling by the user may be necessary for fringe cases.
To help identify issues:
There are at least 4 kinds of country codes:
- Standard (2/3 digit ISO) - returns data, will warn and error properly.
- Non-standard (WB Exceptions) - returns data, but will falsely warn.
- Blank - silently missing from the response.
- Bad - causes the entire response from WB to fail, always exception inducing.
There are at least 3 kinds of indicators:
- Current - Returns data.
- Retired - Appears in search results, yet won’t return data.
- Bad - Will not return data.
Use the errors
argument to control warnings and exceptions. Setting errors to ignore or warn, won’t stop failed responses. (ie, 100% bad indicators, or a single “bad” (#4 above) country code).
See docstrings for more info.
Google Analytics
The ga
module provides a wrapper for Google Analytics API to simplify retrieving traffic data. Result sets are parsed into a pandas DataFrame with a shape and data types derived from the source table.
Configuring Access to Google Analytics
The first thing you need to do is to setup accesses to Google Analytics API. Follow the steps below:
-
- In the Google Developers Console
-
- enable the Analytics API
- create a new project
- create a new Client ID for an “Installed Application” (in the “APIs & auth / Credentials section” of the newly created project)
- download it (JSON file)
-
- On your machine
-
- rename it to
client_secrets.json
- move it to the
pandas/io
module directory
- rename it to
The first time you use the read_ga()
function, a browser window will open to ask you to authentify to the Google API. Do proceed.
Using the Google Analytics API
The following will fetch users and pageviews (metrics) data per day of the week, for the first semester of 2014, from a particular property.
import pandas.io.ga as ga ga.read_ga( account_id = "2360420", profile_id = "19462946", property_id = "UA-2360420-5", metrics = ['users', 'pageviews'], dimensions = ['dayOfWeek'], start_date = "2014-01-01", end_date = "2014-08-01", index_col = 0, filters = "pagePath=~aboutus;ga:country==France", )
The only mandatory arguments are metrics,
dimensions
and start_date
. We strongly recommend that you always specify the account_id
, profile_id
and property_id
to avoid accessing the wrong data bucket in Google Analytics.
The index_col
argument indicates which dimension(s) has to be taken as index.
The filters
argument indicates the filtering to apply to the query. In the above example, the page URL has to contain aboutus
AND the visitors country has to be France.
Detailed information in the following:
- pandas & google analytics, by yhat
- Google Analytics integration in pandas, by Chang She
- Google Analytics Dimensions and Metrics Reference
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.18.1/remote_data.html