extract
and ffill
Start with extract
and ffill
, then remove redundant rows.
df['country'] = (
df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill())
df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)
data country
0 London UK
1 Wales UK
2 Liverpool UK
3 Chicago US
4 New York US
5 San Francisco US
6 Seattle US
7 Sydney Australia
8 Perth Australia
Where,
df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill()
0 UK
1 UK
2 UK
3 UK
4 US
5 US
6 US
7 US
8 US
9 Australia
10 Australia
11 Australia
Name: country, dtype: object
The pattern '(.*)\s+\(c\)'
matches strings of the form "country (c)" and extracts the country name. Anything not matching this pattern is replaced with NaN, so you can conveniently forward fill on rows.
split
with np.where
and ffill
This splits on "(c)".
u = df['data'].str.split(r'\s+\(c\)')
df['country'] = pd.Series(np.where(u.str.len() == 2, u.str[0], np.nan)).ffill()
df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)
data country
0 London UK
1 Wales UK
2 Liverpool UK
3 Chicago US
4 New York US
5 San Francisco US
6 Seattle US
7 Sydney Australia
8 Perth Australia