8.3. Merging and Tidying Data¶

In this section, we will learn how to merge two data frames that contain information that we wish to put together. This will create a new data frame that contains the information of both data frames as long as they have a primary and foreign key, meaning they have to have a unique column in common. Then we will learn the basic principles of tidying up data and the relationships that a data set can have.

Now that we know how the file is encoded, we can read it easily.

c_codes = pd.read_csv('Data/country_codes.csv', encoding='iso-8859-1')
c_codes.head()

	country	code_2	code_3	country_code	iso_3166_2	continent	sub_region	region_code	sub_region_code
0	Afghanistan	AF	AFG	4	ISO 3166-2:AF	Asia	Southern Asia	142.0	34.0
1	Åland Islands	AX	ALA	248	ISO 3166-2:AX	Europe	Northern Europe	150.0	154.0
2	Albania	AL	ALB	8	ISO 3166-2:AL	Europe	Southern Europe	150.0	39.0
3	Algeria	DZ	DZA	12	ISO 3166-2:DZ	Africa	Northern Africa	2.0	15.0
4	American Samoa	AS	ASM	16	ISO 3166-2:AS	Oceania	Polynesia	9.0	61.0

This DataFrame has a lot of information, and we can add all or just a bit of it to our United Nations DataFrame using the merge method of Pandas.

Before we merge, let’s clean up the column names on the undf data frame and rename country to code_3 to be consistent with the above.

undf.columns = ['session', 'year', 'code_3', 'text']
undf.head()

	session	year	code_3	text
0	44	1989	MDV	It is indeed a pleasure for me and the member...
1	44	1989	FIN	\nMay I begin by congratulating you. Sir, on ...
2	44	1989	NER	\nMr. President, it is a particular pleasure ...
3	44	1989	URY	\nDuring the debate at the fortieth session o...
4	44	1989	ZWE	I should like at the outset to express my del...

Now, we can merge our two data frames. We will keep all the columns from the original undf DataFrame and add country, continent, and subregion from the c_codes DataFrame. We will merge the two data frames on the code_3 column. That is, for every row in undf, we will look for a row in the c_codes DataFrame where the values for code_3 match. Pandas will then add the rest of the columns from the matching row in c_codes to the current row in undf.

In the c_codes data frame, code_3 is the “primary key”, as no two rows have the same value for code_3. In the undf data frame, code_3 is a “foreign key”, as we use it to look up additional information in a table where code_3 is a primary key. More on this when we study SQL queries.

undfe = undf.merge(c_codes[['code_3', 'country', 'continent', 'sub_region']])
undfe.head()

	session	year	code_3	text	country	continent	sub_region
0	44	1989	MDV	It is indeed a pleasure for me and the member...	Maldives	Asia	Southern Asia
1	68	2013	MDV	I wish to begin by \nextending my heartfelt co...	Maldives	Asia	Southern Asia
2	63	2008	MDV	I am delivering this \nstatement on behalf of ...	Maldives	Asia	Southern Asia
3	46	1991	MDV	Allow me at the outset on behalf of the deleg...	Maldives	Asia	Southern Asia
4	41	1986	MDV	It is indeed a pleasure for me and all the mem...	Maldives	Asia	Southern Asia

undfe[undf.code_3 == 'EU ']

/Users/bradleymiller/.local/share/virtualenvs/httlads--V2x4wK-/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  """Entry point for launching an IPython kernel.

	session	year	code_3	text	country	continent	sub_region

Wait! What? What happened to the EU?! Why did it disappear after the merge? What else may have disappeared? The reason the EU disappeared is that it is not in the c_codes data frame, and as you may recall, the merge function does the equivalent of a set intersection. That is, the key must be in BOTH data frames for it to be in the result. We can do our merge using an outer join to preserve the data, then see which countries have no text and which texts have no country name.

undfe = undf.merge(c_codes[['code_3', 'country', 'continent', 'sub_region']],
                   how='outer')
undfe.head()

	session	year	code_3	text	text_len	country	continent	sub_region
0	44.0	1989.0	MDV	It is indeed a pleasure for me and the member...	3011.0	Maldives	Asia	Southern Asia
1	68.0	2013.0	MDV	I wish to begin by \nextending my heartfelt co...	2252.0	Maldives	Asia	Southern Asia
2	63.0	2008.0	MDV	I am delivering this \nstatement on behalf of ...	1909.0	Maldives	Asia	Southern Asia
3	46.0	1991.0	MDV	Allow me at the outset on behalf of the deleg...	2330.0	Maldives	Asia	Southern Asia
4	41.0	1986.0	MDV	It is indeed a pleasure for me and all the mem...	2630.0	Maldives	Asia	Southern Asia

Now let’s see which country names are not filled in.

undfe[undfe.country.isna()].code_3.unique()

array(['YDYE', 'CSK', 'YUG', 'DDR', 'EU'], dtype=object)

undfe[undfe.text.isna()].code_3.unique()

array(['ALA', 'ASM', 'AIA', 'ATA', 'ABW', 'BMU', 'BES', 'BVT', 'IOT',
       'CYM', 'CXR', 'CCK', 'COK', 'CUW', 'FLK', 'FRO', 'GUF', 'PYF',
       'ATF', 'GIB', 'GRL', 'GLP', 'GUM', 'GGY', 'HMD', 'HKG', 'IMN',
       'JEY', 'MAC', 'MTQ', 'MYT', 'MSR', 'NCL', 'NIU', 'NFK', 'MNP',
       'PCN', 'PRI', 'REU', 'BLM', 'SHN', 'MAF', 'SPM', 'SRB', 'SXM',
       'SGS', 'SJM', 'TWN', 'TKL', 'TCA', 'UMI', 'VGB', 'VIR', 'WLF',
       'ESH'], dtype=object)

undfe[undfe.text.isna()].country.unique()

array(['Åland Islands', 'American Samoa', 'Anguilla', 'Antarctica',
       'Aruba', 'Bermuda', 'Bonaire, Sint Eustatius and Saba',
       'Bouvet Island', 'British Indian Ocean Territory',
       'Cayman Islands', 'Christmas Island', 'Cocos (Keeling) Islands',
       'Cook Islands', 'Curaçao', 'Falkland Islands (Malvinas)',
       'Faroe Islands', 'French Guiana', 'French Polynesia',
       'French Southern Territories', 'Gibraltar', 'Greenland',
       'Guadeloupe', 'Guam', 'Guernsey',
       'Heard Island and McDonald Islands', 'Hong Kong', 'Isle of Man',
       'Jersey', 'Macao', 'Martinique', 'Mayotte', 'Montserrat',
       'New Caledonia', 'Niue', 'Norfolk Island',
       'Northern Mariana Islands', 'Pitcairn', 'Puerto Rico', 'Réunion',
       'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha',
       'Saint Martin (French part)', 'Saint Pierre and Miquelon',
       'Serbia', 'Sint Maarten (Dutch part)',
       'South Georgia and the South Sandwich Islands',
       'Svalbard and Jan Mayen', 'Taiwan, Province of China', 'Tokelau',
       'Turks and Caicos Islands', 'United States Minor Outlying Islands',
       'Virgin Islands (British)', 'Virgin Islands (U.S.)',
       'Wallis and Futuna', 'Western Sahara'], dtype=object)

Fill in the country names for YDYE, CSK, YUG, DDR, and EU by hand.

undfe.loc[undfe.code_3 == 'EU', 'country'] = 'European Union'

by_country = undfe.groupby('country',as_index=False)['text'].count()
by_country.loc[by_country.text.idxmin()]

country    South Sudan
text                 5
Name: 161, dtype: object

c_codes[c_codes.code_2 == 'EU']

	country	code_2	code_3	country_code	iso_3166_2	continent	sub_region	region_code	sub_region_code

I suspect that EU indicates the European Union, which has a place in the UN but is not a country.

South Sudan has only spoken 5 times. Why is that? There is a very logical explanation, but it only makes you want to check out the 5 or 10 countries that have spoken the least.

But why did EU seem to disappear? When we do a merge, if the key is missing, then the row is not included in the final result.

len(undfe)

len(undf.code_3.unique())

len(undfe.code_3.unique())

set(undf.code_3.unique()) - set(undfe.code_3.unique())

{'CSK', 'DDR', 'EU', 'YDYE', 'YUG'}

Can you figure out what each of the above stands for? Why are they not in the list presented earlier?

At this point, you may want to edit the CSV file and add the data for these countries to the file. Then, you can rerun the whole notebook and we will not lose as much data.

8.3.1. Tidy Data¶

A lot of the work in data science revolves around getting data into the proper format for analysis. A lot of data comes in messy formats for many different reasons. But if we apply some basic principles from the world of database design, data modeling, and some common sense (as outlined in the Hadley Wickham paper), we can whip our data into shape. Wickham says that tidy data has the following attributes.

Each variable belongs in a column and contains values.
Each observation forms a row.
Each type of observational unit forms a table.

How does our United Nations data stack up? Pretty well. We have four columns: session, year, country, and text. If we think of the text of the speech as the thing we can observe, then each row does form an observation, and session, year, and country are attributes that identify this particular observation.

Some of the common kinds of messiness that Wickham identifies include the following.

Column headers are values not variable names. Imagine this table if we had one row for each year and a column for each country’s text! Now that would not be tidy!
Multiple variables are stored in one column. We’ve seen this untidiness in the movie data a couple of chapters ago. We’ll revisit that very soon to deal with it correctly.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

Many of the problems with untidy data stem from not knowing how to handle relationships between multiple entities. Most of the time, things that we want to observe interact with other things we can observe, and when we try to combine them into a single data frame, that causes trouble. There are three kinds of relationships that we should consider.

one-to-one relationships
one-to-many relationships
many-to-many relationships

An example of a one-to-one relationship would be a person and their passport. A person can have one passport, and a given passport belongs to only one person. There is data that we can collect about a person and that could be stored in a DataFrame. There is also data that we can collect from a passport, such as the countries that person has visited, the place the passport was issued, and this could also be stored in a DataFrame.

An example of a one-to-many relationship is a customer and the things they have ordered from Amazon. A particular customer may have ordered many things, but a given order can only belong to a single customer.

An example of a many-to-many relationship is a student and a class. A student can be enrolled in more than one class, and a class can have many students who are enrolled in it.

Whenever you see a DataFrame that has a column that contains a list or a dictionary, that is a sure sign of untidiness. It is also something that can be fixed an in the end will make your analysis easier.

8.3.2. Tidying the Movie Genres¶

Let’s look at the genres column of the movies dataset. You may recall that it looks odd. Here is the result of df.iloc[0].genres.

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

It looks like a list of dictionary literals, except that it is in double-quotes like a string. Let’s first figure out how we can get it to be an actual list of dictionaries. Then, we’ll figure out what to do with it. Python has a nifty function called eval that allows you to evaluate a Python expression that is a string.

eval(df.iloc[0].genres)

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

Even better, we can assign the result of eval to a variable, and then we can use the list and dictionary index syntax to access parts of the result, just like we learned about when we discussed JSON in an earlier chapter.

glist = eval(df.iloc[0].genres)
glist[1]['name']

'Comedy'

One way we could solve this is to duplicate all of the rows for as many genres as the movie has, storing one genre on each line, but that would mean we would have to needlessly duplicate all of the other information on our first movie three times.

A better strategy for solving this problem is to create a new DataFrame with just two columns: one containing the movie’s unique id number, and a second containing the genre. This allows you to use the merge method on the two data frames, but only temporarily when you need to know the genre of a particular movie.

Illustration of the merging of two data frames. One data frame contains movie info, and the other contains genre. The data frames are merged into a single data frame with two columns: genre and movie info.

To construct this table, we need to iterate over all the rows of the DataFrame and gather the genres for this movie. For each genre of the movie, we will add an item to a list that contains the imdb_id of the movie and add an item to a list that contains the name of the genre. These two lists are in sync with each other so that the i ^th element of each list will represent the same movie.

Here is some code you can use to construct the two lists.

movie_list = []
genre_list = []

def map_genres(row):
    try:
        glist = eval(row.genres)
    except:
        glist = []
        print(f"bad data for {row.title}")
    for g in glist:
        movie_list.append(row.imdb_id)
        genre_list.append(g['name'])

_ = df.apply(map_genres, axis=1)

Using these two lists, construct a new DataFrame with a column for imdb_id and genre.

Now let’s calculate the average revenue for the Comedy genre. We’ll do this is a couple of steps.

We will reduce the genre DataFrame so it only has the Comedies left.
Then we will merge the movie data frame with the genres DataFrame using the imdb_id column.
We will be left with a DataFrame that only contains the rows for the movies that are comedies. You can think of a merge like this as being the intersection of the set of comedies and the set of all movies.

Problems to work on

What is the total revenue for each genre?
What is the average vote_average for each genre?
What genre has the most votes?
Use a similar process to create a data frame of collections and their movies. Which collection has the most movies?
Again a similar process can be used for spoken_languages. How many movies are there for each language? Is English the most popular movie language?

Lesson Feedback

1. Comfort Zone
2. Learning Zone
3. Panic Zone

1. Very little time
2. A reasonable amount of time
3. More time than is reasonable

1. Don't seem worth learning
2. May be worth learning
3. Are definitely worth learning

1. Definitely within reach
2. Within reach if I try my hardest
3. Out of reach no matter how hard I try

You have attempted of activities on this page