5.5. Numbers as Indices¶
Enough about movie budgets, it’s time to budget my time instead. Because I schedule my day to the minute, I like to be able to look up movies by their runtime, so that when I have a spare two hours and 34 minutes, I can find all the movies that would fit precisely in that time slot. (Popcorn-making time is budgeted separately).
Before you start, here is a refresher on the index operator in Pandas.
Selecting Columns of a DataFrame
df[<string>]
gets me a column and returns the Series corresponding to that column.df[<list of strings>]
gets me a bunch of columns and returns a DataFrame.
Selecting Rows of a DataFrame
df[<series/list of Boolean>]
gets me the rows for each element in the list like thing you passed me that isTrue
. However, I think this is confusing and whenever you want to select some rows of a DataFrame you should usedf.loc[]
.df.loc[<series/list of Boolean>]
behaves just likedf[<series/list of Boolean>]
.df.loc[<string>]
uses the non-numeric row index and returns the row(s) for that index value.df.loc[<string1>:<string2>]
uses the non-numeric index and returns a data frame composed of the rows starting with string1 and ending with each string2.df.loc[<list/Series of strings>]
returns a data frame composed of each row from df with an index value that matches a string in the list.
If you use an integer in any of the last four examples, it works just like the
string, but the index values are numeric instead. What is important (and
confusing) about this is that they use the index, not the position. So, if you
create a data frame with 4 rows of some data, it will have an index that is
created by default where the first row starts with 0, the next row is 1 and so
on. However, if you sort the data frame such that the last row becomes the first
and the first row becomes the last, using df.loc[0]
on the sorted data frame
will return the last row.
If you want to be strictly positional, you should use df.iloc[0]
, which will
return the first row regardless of the index value. df.iloc[0:5]
is the same
as doing df.head()
, and df.iloc[[1, 3, 5, 7]]
will return four rows: the
2nd, 4th, 6th and 8th.
import pandas as pd
df = pd.DataFrame({'a':list("pythonrocks"), 'b':[1,2,3,4,5,6,7,8,9,10,11]})
df = df.set_index('a')
df.loc['p':'n']
b | |
---|---|
a | |
p | 1 |
y | 2 |
t | 3 |
h | 4 |
o | 5 |
n | 6 |
OK, but what if we do this:
df.loc['p':'o']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[2], line 1
----> 1 df.loc['p':'o']
File ~/.virtualenvs/build/lib/python3.11/site-packages/pandas/core/indexing.py:1191, in _LocationIndexer.__getitem__(self, key)
1189 maybe_callable = com.apply_if_callable(key, self.obj)
1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable)
-> 1191 return self._getitem_axis(maybe_callable, axis=axis)
File ~/.virtualenvs/build/lib/python3.11/site-packages/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis)
1409 if isinstance(key, slice):
1410 self._validate_key(key, axis)
-> 1411 return self._get_slice_axis(key, axis=axis)
1412 elif com.is_bool_indexer(key):
1413 return self._getbool_axis(key, axis=axis)
File ~/.virtualenvs/build/lib/python3.11/site-packages/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
1440 return obj.copy(deep=False)
1442 labels = obj._get_axis(axis)
-> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
1445 if isinstance(indexer, slice):
1446 return self.obj._slice(indexer, axis=axis)
File ~/.virtualenvs/build/lib/python3.11/site-packages/pandas/core/indexes/base.py:6708, in Index.slice_indexer(self, start, end, step)
6664 def slice_indexer(
6665 self,
6666 start: Hashable | None = None,
6667 end: Hashable | None = None,
6668 step: int | None = None,
6669 ) -> slice:
6670 """
6671 Compute the slice indexer for input labels and step.
6672
(...) 6706 slice(1, 3, None)
6707 """
-> 6708 start_slice, end_slice = self.slice_locs(start, end, step=step)
6710 # return a slice
6711 if not is_scalar(start_slice):
File ~/.virtualenvs/build/lib/python3.11/site-packages/pandas/core/indexes/base.py:6940, in Index.slice_locs(self, start, end, step)
6938 end_slice = None
6939 if end is not None:
-> 6940 end_slice = self.get_slice_bound(end, "right")
6941 if end_slice is None:
6942 end_slice = len(self)
File ~/.virtualenvs/build/lib/python3.11/site-packages/pandas/core/indexes/base.py:6867, in Index.get_slice_bound(self, label, side)
6865 slc = lib.maybe_booleans_to_slice(slc.view("u1"))
6866 if isinstance(slc, np.ndarray):
-> 6867 raise KeyError(
6868 f"Cannot get {side} slice bound for non-unique "
6869 f"label: {repr(original_label)}"
6870 )
6872 if isinstance(slc, slice):
6873 if side == "left":
KeyError: "Cannot get right slice bound for non-unique label: 'o'"
Pandas raises an error because there are two ‘o’s in the index. It doesn’t know which one you mean, first? last? If you argue it should use the last then consider the performance implications if this was a really large index? In that case it would be very time consuming to search the index for the last occurance.
On the other hand, if we sort the index then the last instance can be found quite quickly, and with a sorted index loc will work for this example.
df = df.sort_index()
df.loc['c':'o']
b | |
---|---|
a | |
c | 9 |
h | 4 |
k | 10 |
n | 6 |
o | 5 |
o | 8 |
5.5.1. Practice Questions¶
Create a Series called time_scheduler
that is indexed by runtime and has the
movie’s title as its values. Note that you will need to use sort_index()
in
order to be able to look up movies by their duration. Base yourself on df
rather than budget_df
.
While you’re at it, remove any movie that is less than 10 minutes (you can’t get into it if it’s too short) or longer than 3 hours (who’s got time for that?).
Hint: You may have to use pd.to_numeric
to force the runtimes to be
numbers (instead of numbers in a string).
Here is a simpler example that shows the movies that are 7 minutes long
import pandas as pd
df = pd.read_csv("https://runestone.academy/ns/books/published/httlads/_static/movies_metadata.csv").dropna(axis=1, how='all')
time_scheduler = df.set_index('runtime')
time_scheduler = time_scheduler[['title', 'release_date']]
time_scheduler.loc[7].head()
title | release_date | |
---|---|---|
runtime | ||
7.0 | Balance | 1989-01-01 |
7.0 | Killer Bean 2: The Party | 2000-08-08 |
7.0 | The Employment | 2008-01-01 |
7.0 | Moscow Clad in Snow | 1909-04-09 |
7.0 | Paperman | 2012-11-02 |
Now let’s find all those two-hour-and-34-minute movies.
But what is the 155th shortest movie in this collection?
Lesson Feedback
-
During this lesson I was primarily in my...
- 1. Comfort Zone
- 2. Learning Zone
- 3. Panic Zone
-
Completing this lesson took...
- 1. Very little time
- 2. A reasonable amount of time
- 3. More time than is reasonable
-
Based on my own interests and needs, the things taught in this lesson...
- 1. Don't seem worth learning
- 2. May be worth learning
- 3. Are definitely worth learning
-
For me to master the things taught in this lesson feels...
- 1. Definitely within reach
- 2. Within reach if I try my hardest
- 3. Out of reach no matter how hard I try