6.3. Case Study 1: Graphing Infant Mortality on a Map¶
Let’s take on the seemingly simple task of plotting some of the country data on a map like we did in Google Sheets earlier. We’ll see that this is one area where things are not quite as simple as they are in Sheets. But we can make it work with a bit of effort.
Altair provides us with the facility to make a blank map. But filling in the data requires a bit more work on our part.
This is a good example of learning by example, then extrapolating what you need to do based on understanding the example.
The counties data that is passed to the chart is the data needed to create and outline the map.
import pandas as pd
import altair as alt
from vega_datasets import data
counties = alt.topo_feature(data.us_10m.url, 'counties')
unemp_data = data.unemployment.url
alt.Chart(counties).mark_geoshape().project(
type='albersUsa').properties(
width=500,
height=300
)
What about our encoding channels?! The primary data needed to draw the map using
a mark_geoshape
was passed to the Chart, but that is really secondary data
for us. What we care about is graphing the unemployment data by county. That is
in a different data frame with a column called rate
.
With a geoshape, we can encode the county data using color. But, there is no
unemployment data in counties, so we have to use a transform_lookup
to
map from the way counties are identified in the geo data to our DataFrame
that contains unemployment data.
unemp_data = pd.read_csv('http://vega.github.io/vega-datasets/data/unemployment.tsv',sep='\t')
unemp_data.head()
id | rate | |
---|---|---|
0 | 1001 | 0.097 |
1 | 1003 | 0.091 |
2 | 1005 | 0.134 |
3 | 1007 | 0.121 |
4 | 1009 | 0.099 |
Using the transform_lookup
method, we can arrange for the id in the
geographic data to be matched against the id in our unemp_data
data frame.
This allows us to make use of two data frames in one graph. The example below is
a bit misleading, in that id is used both as the lookup, as well as the key in
the call to LookupData. The lookup value refers to the column name in the
DataFrame passed to Chart, whereas the second parameter to the LookupData call
is the name of the column in the unemp_data
DataFrame. It is just a
coincidence that they have the same name in this example.
alt.Chart(counties).mark_geoshape(
).encode(
color='rate:Q'
).transform_lookup(
lookup='id',
from_=alt.LookupData(unemp_data, 'id', ['rate'])
).project(
type='albersUsa'
).properties(
width=500,
height=300,
title='Unemployment by County'
)
6.3.1. Using a Web API to get Country Codes¶
Can you make use of the provided example and the Altair documentation to produce a graph of the world where the countries are colored by one of the features in the data?
In this part of the project we will:
Learn about using web APIs for data gathering
Use a web API to get data that maps country codes to country numbers
Learn how to add columns to a data frame using the
map
function, and possibly learn to use a lambda function if you’ve never used one before
Let’s make a to-do list:
We need to add a column to our
wd
DataFrame that contains the numerical country id. Where can we get this data? There may be some CSV files with this information already in them, but this is a good chance to learn about a common technique used by data scientists everywhere: web APIs. API stands for Application Programmer Interface. Each website will have its own convention for how you ask it for data, and the format in which the data is returned.Once we have the new column, we can follow the example from above to make a world map and show birthrate data.
The first step is to make use of the awesome requests module. The requests module allows us to easily communicate to databases across the web. The documentation for it is fantastic, so you should use that to learn about requests in more detail. We’ll just give you the bare bones here to get started.
The website called restcountries.eu
provides an interface for us to get data
from their site rather than a web page. When thinking about a web API, you have
to understand how to ask it for the data you want. In this case, we will use
/rest/v2/alpha/XXX
. If we unpack that into pieces, let’s look at what it’s
telling us.
/rest
: Technically, REST stands for REpresentational State Transfer. This uses the HTTP protocol to ask for and respond with data./v2
: This is version 2 of this website’s protocol./alpha
: This tells the website that the next thing we are going to pass tell it is the three-letter code for the country.XXX
: This can be any valid three-letter country code, for example “usa”.
Open a new tab in your browser and paste this URL: https://restcountries.eu/rest/v2/alpha/usa. You will see that you don’t get a web page in response, but rather some information that looks like a Python dictionary. We’ll explore that more below. We can do the same thing from a Python program using the requests library.
import requests
res = requests.get('https://restcountries.eu/rest/v2/alpha/usa')
res.status_code
200
The status code of 200 tells us that everything went fine. If you make a typo in the URL, you may see the familiar status code of 404, meaning not found.
We can also look at the text that was returned.
res.text
'{"name":"United States of America","topLevelDomain":[".us"],"alpha2Code":"US","alpha3Code":"USA","callingCodes":["1"],"capital":"Washington, D.C.","altSpellings":["US","USA","United States of America"],"region":"Americas","subregion":"Northern America","population":323947000,"latlng":[38.0,-97.0],"demonym":"American","area":9629091.0,"gini":48.0,"timezones":["UTC-12:00","UTC-11:00","UTC-10:00","UTC-09:00","UTC-08:00","UTC-07:00","UTC-06:00","UTC-05:00","UTC-04:00","UTC+10:00","UTC+12:00"],"borders":["CAN","MEX"],"nativeName":"United States","numericCode":"840","currencies":[{"code":"USD","name":"United States dollar","symbol":"$"}],"languages":[{"iso639_1":"en","iso639_2":"eng","name":"English","nativeName":"English"}],"translations":{"de":"Vereinigte Staaten von Amerika","es":"Estados Unidos","fr":"États-Unis","ja":"アメリカ合衆国","it":"Stati Uniti D'America","br":"Estados Unidos","pt":"Estados Unidos","nl":"Verenigde Staten","hr":"Sjedinjene Američke Države","fa":"ایالات متحده آمریکا"},"flag":"https://restcountries.eu/data/usa.svg","regionalBlocs":[{"acronym":"NAFTA","name":"North American Free Trade Agreement","otherAcronyms":[],"otherNames":["Tratado de Libre Comercio de América del Norte","Accord de Libre-échange Nord-Américain"]}],"cioc":"USA"}'
That looks like an ugly mess! Fortunately, it’s not as bad as it seems. If you
look closely at the data, you will see that it starts with a {
and ends with
a }
. In fact, you may realize this looks a lot like a Python dictionary! If
you thought that, you are correct. This is a big long string that represents a
Python dictionary. Better yet, we can convert this string into an actual Python
dictionary and then access the individual key-value pairs stored in the
dictionary using the usual Python syntax!
The official name for the format that we saw above is called JSON: JavaScript Object Notation. It’s a good acronym to know, but you don’t have to know anything about Javascript in order to make use of JSON. You can think of the results as a Python dictionary. It can be a bit daunting at first as there can be many keys and JSON is often full of dictionaries of dictionaries of lists of dictionaries but fear not, you can figure it out with a bit of experimentation.
usa_info = res.json()
usa_info
{'name': 'United States of America',
'topLevelDomain': ['.us'],
'alpha2Code': 'US',
'alpha3Code': 'USA',
'callingCodes': ['1'],
'capital': 'Washington, D.C.',
'altSpellings': ['US', 'USA', 'United States of America'],
'region': 'Americas',
'subregion': 'Northern America',
'population': 323947000,
'latlng': [38.0, -97.0],
'demonym': 'American',
'area': 9629091.0,
'gini': 48.0,
'timezones': ['UTC-12:00',
'UTC-11:00',
'UTC-10:00',
'UTC-09:00',
'UTC-08:00',
'UTC-07:00',
'UTC-06:00',
'UTC-05:00',
'UTC-04:00',
'UTC+10:00',
'UTC+12:00'],
'borders': ['CAN', 'MEX'],
'nativeName': 'United States',
'numericCode': '840',
'currencies': [{'code': 'USD',
'name': 'United States dollar',
'symbol': '$'}],
'languages': [{'iso639_1': 'en',
'iso639_2': 'eng',
'name': 'English',
'nativeName': 'English'}],
'translations': {'de': 'Vereinigte Staaten von Amerika',
'es': 'Estados Unidos',
'fr': 'États-Unis',
'ja': 'アメリカ合衆国',
'it': "Stati Uniti D'America",
'br': 'Estados Unidos',
'pt': 'Estados Unidos',
'nl': 'Verenigde Staten',
'hr': 'Sjedinjene Američke Države',
'fa': 'ایالات متحده آمریکا'},
'flag': 'https://restcountries.eu/data/usa.svg',
'regionalBlocs': [{'acronym': 'NAFTA',
'name': 'North American Free Trade Agreement',
'otherAcronyms': [],
'otherNames': ['Tratado de Libre Comercio de América del Norte',
'Accord de Libre-échange Nord-Américain']}],
'cioc': 'USA'}
For example, timezones is a top level key, which produces a list of the valid timezones in the USA.
usa_info['timezones']
['UTC-12:00',
'UTC-11:00',
'UTC-10:00',
'UTC-09:00',
'UTC-08:00',
'UTC-07:00',
'UTC-06:00',
'UTC-05:00',
'UTC-04:00',
'UTC+10:00',
'UTC+12:00']
But, languages is more complicated it also returns a list but each element of the list corresponds to one of the official languages of the country. The USA has only one official language but other countries have more. For example, Malta has both Maltese and English as official languages. Notice that the two dictionaries have an identical structure, a key for the two letter abbreviation, a key for the three letter abbreviation, the name and, the native name.
[{'iso639_1': 'mt',
'iso639_2': 'mlt',
'name': 'Maltese',
'nativeName': 'Malti'},
{'iso639_1': 'en',
'iso639_2': 'eng',
'name': 'English',
'nativeName': 'English'}]
Check Your Understanding
Now that we have a really nice way to get the additional country information,
let’s add the numeric country code as a new column in our wd
DataFrame. We
can think of adding the column as a transformation of our three-letter country
code to a number. We can do this using the map
function. You learned about
map
in the Python Review section of this book. If you need to refresh your
memory, see here Python Review.
When we use Pandas, the difference is that we don’t pass the list as a parameter
to map
. map
is a method of a Series, so we use the syntax
df.myColumn.map(function)
. This applies the function we pass as a parameter
to each element of the series and constructs a brand new series.
For our case, we need to write a function that takes a three-letter country code
as a parameter and returns the numeric code we lookup converted to an
integer, let’s call it get_num_code
. You have all the details you need to
write this function. Once you write this function, you can use the code below.
wd['CodeNum'] = wd.Code.map(get_num_code)
wd.head()
Country | Ctry | Code | CodeNum | Region | Population | Area | Pop. Density | Coastline | Net migration | ... | Phones | Arable | Crops | Other | Climate | Birthrate | Deathrate | Agriculture | Industry | Service | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Afghanistan | AFG | 4.0 | ASIA (EX. NEAR EAST) | 31056997 | 647500 | 48.0 | 0.00 | 23.06 | ... | 3.2 | 12.13 | 0.22 | 87.65 | 1.0 | 46.60 | 20.34 | 0.380 | 0.240 | 0.380 |
1 | Albania | Albania | ALB | 8.0 | EASTERN EUROPE | 3581655 | 28748 | 124.6 | 1.26 | -4.93 | ... | 71.2 | 21.09 | 4.42 | 74.49 | 3.0 | 15.11 | 5.22 | 0.232 | 0.188 | 0.579 |
2 | Algeria | Algeria | DZA | 12.0 | NORTHERN AFRICA | 32930091 | 2381740 | 13.8 | 0.04 | -0.39 | ... | 78.1 | 3.22 | 0.25 | 96.53 | 1.0 | 17.14 | 4.61 | 0.101 | 0.600 | 0.298 |
3 | American Samoa | American Samoa | ASM | 16.0 | OCEANIA | 57794 | 199 | 290.4 | 58.29 | -20.71 | ... | 259.5 | 10.00 | 15.00 | 75.00 | 2.0 | 22.46 | 3.27 | NaN | NaN | NaN |
4 | Andorra | Andorra | AND | 20.0 | WESTERN EUROPE | 71201 | 468 | 152.1 | 0.00 | 6.60 | ... | 497.2 | 2.22 | 0.00 | 97.78 | 3.0 | 8.71 | 6.25 | NaN | NaN | NaN |
5 rows × 23 columns
Warning
DataFrame Gotcha
Be careful, wd.CodeNum
and wd['CodeNum']
are ALMOST always
interchangeable, except for when you create a new column. When you create a
new column you MUST use wd['CodeNum'] = blah new column expression
. If
you write wd.CodeNum = blah new column expression
, it will add a
CodeNum
attribute to the wd
object, rather than creating a new
column. This is consistent with standard Python syntax of allowing you to add
an attribute on the fly to any object.
You can make a gray map of the world like this.
countries = alt.topo_feature(data.world_110m.url, 'countries')
alt.Chart(countries).mark_geoshape(
fill='#666666',
stroke='white'
).properties(
width=750,
height=450
).project('equirectangular')
So, now you have the information you need to use the example of the counties above and apply that to the world below.
base = alt.Chart(countries).mark_geoshape(
).encode(tooltip='Country:N',
color=alt.Color('Infant mortality:Q', scale=alt.Scale(scheme="plasma"))
).transform_lookup( # your code here
).properties(
width=750,
height=450
).project('equirectangular')
base
Your final result should look like this.
6.3.2. Using a Web API on Your Own¶
Find a web API that provides some numeric data that interests you. There is tons of data available in the world of finance, sports, environment, travel, etc. A great place to look is at The Programmable Web. Yes, this assignment is a bit vague and open-ended, but that is part of the excitement. You get to find an API and graph some data that appeals to you, not something some author or professor picked out. You might even feel like you have awesome superpowers by the time you finish this project.
Use the web API to obtain the data. Most sites are going to provide it in JSON format similar to what we saw.
Next, create a graph of your using Altair.
Take some time to talk about and present the data and the graph you created to the class.
Lesson Feedback
-
During this lesson I was primarily in my...
- 1. Comfort Zone
- 2. Learning Zone
- 3. Panic Zone
-
Completing this lesson took...
- 1. Very little time
- 2. A reasonable amount of time
- 3. More time than is reasonable
-
Based on my own interests and needs, the things taught in this lesson...
- 1. Don't seem worth learning
- 2. May be worth learning
- 3. Are definitely worth learning
-
For me to master the things taught in this lesson feels...
- 1. Definitely within reach
- 2. Within reach if I try my hardest
- 3. Out of reach no matter how hard I try