如何在Python Pandas中使用字典顺序切片选择数据子集？

介绍

熊猫具有双重选择功能，可以使用索引位置或索引标签选择数据子集。在这篇文章中，我将向您展示如何“使用词典分类法选择数据的子集”。

Google充满了数据集。在kaggle.com中搜索电影数据集。这篇文章使用kaggle提供的电影数据集。

怎么做

1.导入仅包含此示例所需列的电影数据集。

import pandas as pd
import numpy as np
movies = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/movies_data.csv",index_col="title",
usecols=["title","budget","vote_average","vote_count"])
movies.sample(n=5)

标题	预算	平均投票	投票数
小声音	0	6.6	61
大人2	80000000	5.8	1155
一生中最美好的时光	2100000	7.6	143
象牙	2800000	5.1	366
铬铁矿行动	0	5.8	29

2.我总是建议对索引进行排序，尤其是当索引由字符串组成时。如果在对索引进行排序时处理庞大的数据集，则会注意到差异。

如果我不对索引排序怎么办？

没问题，您的代码将永远运行。只是开个玩笑，如果索引标签未排序，那么大熊猫必须一一遍历所有标签以匹配您的查询。试想一下，没有索引页的牛津词典，您要做什么？索引排序后，您可以快速跳转到要提取的标签，Pandastoo就是这种情况。

让我们首先检查索引是否已排序。

# check if the index is sorted or not ?
movies.index.is_monotonic

False

3.显然，索引未排序。我们将尝试选择以A％开头的电影。这就像写作

select * from movies where title like'A%'

movies.loc["Aa":"Bb"]

---------------------------------------------------------------------------
ValueErrorTraceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4844try:
-> 4845return self._searchsorted_monotonic(label, side) 4846except ValueError:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(se lf, label, side)
4805
-> 4806raise ValueError("index must be monotonic increasing or decreasing")
4807

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
in
----> 1 movies.loc["Aa": "Bb"]

~\anaconda3\lib\site-packages\pandas\core\indexing.py in getitem (self, key)
1766
1767maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768return self._getitem_axis(maybe_callable, axis=axis) 1769
1770def _is_scalar_access(self, key: Tuple):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1910if isinstance(key, slice):
1911self._validate_key(key, axis)
-> 1912return self._get_slice_axis(key, axis=axis) 1913elif com.is_bool_indexer(key):
1914return self._getbool_axis(key, axis=axis)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_ob j, axis)
1794
1795labels = obj._get_axis(axis)
-> 1796indexer = labels.slice_indexer(
1797slice_obj.start, slice_obj.stop, slice_obj.step, kind=self.name 1798)

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_indexer(self, start, end, step, kind)
4711slice(1, 3)
4712"""
-> 4713start_slice, end_slice = self.slice_locs(start, end, step=step, kind=ki nd)
4714
4715# return a slice

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in slice_locs(self, start, en d, step, kind)
4924start_slice = None
4925if start is not None:
-> 4926start_slice = self.get_slice_bound(start, "left", kind) 4927if start_slice is None:
4928start_slice = 0

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4846except ValueError:
4847# raise the original KeyError
-> 4848raise err
4849
4850if isinstance(slc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_slice_bound(self, labe l, side, kind)
4840# we need to look up the label
4841try:
-> 4842slc = self.get_loc(label) 4843except KeyError as err:
4844try:

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method,

tolerance)
2646return self._engine.get_loc(key)
2647except KeyError:
-> 2648return self._engine.get_loc(self._maybe_cast_indexer(key))
2649indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2650if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer() KeyError: 'Aa'

4.按升序对索引进行排序，然后尝试使用相同的命令来利用按字典顺序进行排序的优势。

True

5.现在，我们的数据已准备就绪，可以进行字典切片。现在让我们选择所有以字母A到字母B开头的电影。

标题	预算	平均投票	投票数
放弃	25000000	4.6	45
弃	0	5.8	27
绑架	35000000	5.6	961
香港仔	0	7.0	6
关于昨晚	12500000	6.0	210
...	...	...	...
为猿人星球而战	1700000	5.5	215
年度之战	20000000	5.9	88
战斗：洛杉矶	70000000	5.5	1448
战地地球	44000000	3.0	255
战舰	209000000	5.5	2114

标题	预算	平均投票	投票数
Æ通量	62000000	5.4	703
xXx：国情	60000000	4.7	549
X	70000000	5.8	1424
存在	15000000	6.7	475
[REC]²	5600000	6.4	489

预算投票_平均投票_计数标题

毫无疑问地看到空的DataFrame，因为数据以相反的顺序排序。让我们反转字母并再次运行。

标题	预算	平均投票	投票数
B女孩	0	5.5	7
阿育吠陀：存在的艺术	300000	5.5	3
我们走了	17000000	6.7	189
苏醒	86000000	6.3	395
复仇者联盟：奥创纪元	280000000	7.3	6767
...	...	...	...
关于昨晚	12500000	6.0	210
香港仔	0	7.0	6
绑架	35000000	5.6	961
弃	0	5.8	27
放弃	25000000	4.6	45

基础教程