16

I have one column containing all the data which looks something like this (values that need to be separated have a mark like (c)):

UK (c)
London
Wales
Liverpool
US (c)
Chicago
New York
San Francisco
Seattle
Australia (c)
Sydney
Perth

And I want it split into two columns looking like this:

London          UK
Wales           UK
Liverpool       UK
Chicago         US
New York        US
San Francisco   US
Seattle         US
Sydney          Australia
Perth           Australia

Question 2: What if the countries did not have a pattern like (c)?

1
  • 1
    It's a somewhat interesting string manipulation problem and by this tag's usual standards a relatively decent question with sample data and clearly specified expected output. I'm not complaining ¯\_(ツ)_/¯
    – cs95
    Commented Jun 28, 2019 at 2:45

5 Answers 5

10

Step by step with endswith and ffill + str.strip

df['country']=df.loc[df.city.str.endswith('(c)'),'city']
df.country=df.country.ffill()
df=df[df.city.ne(df.country)]
df.country=df.country.str.strip('(c)')
2
  • What if the countries did not have a pattern like (c)?
    – Tsatsa
    Commented Jun 27, 2019 at 15:17
  • 1
    @Tsatsa in that case you may need build a country list , and using isin
    – BENY
    Commented Jun 27, 2019 at 15:18
7

extract and ffill

Start with extract and ffill, then remove redundant rows.

df['country'] = (
    df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill())
df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)

            data    country
0         London         UK
1          Wales         UK
2      Liverpool         UK
3        Chicago         US
4       New York         US
5  San Francisco         US
6        Seattle         US
7         Sydney  Australia
8          Perth  Australia

Where,

df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill()

0            UK
1            UK
2            UK
3            UK
4            US
5            US
6            US
7            US
8            US
9     Australia
10    Australia
11    Australia
Name: country, dtype: object

The pattern '(.*)\s+\(c\)' matches strings of the form "country (c)" and extracts the country name. Anything not matching this pattern is replaced with NaN, so you can conveniently forward fill on rows.


split with np.where and ffill

This splits on "(c)".

u = df['data'].str.split(r'\s+\(c\)')
df['country'] = pd.Series(np.where(u.str.len() == 2, u.str[0], np.nan)).ffill()

df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)

            data    country
0         London         UK
1          Wales         UK
2      Liverpool         UK
3        Chicago         US
4       New York         US
5  San Francisco         US
6        Seattle         US
7         Sydney  Australia
8          Perth  Australia
1
  • extract('(.*)\s+\(c\)') saves you from .str.strip(). Commented Jun 27, 2019 at 14:09
6

You can first use str.extract to locate the cities ending in (c) and extract the country name, and ffill to populate a new country column.

The same extracted matches can be use to locate the rows to be dropped, i.e. rows which are notna:

m = df.city.str.extract('^(.*?)(?=\(c\)$)')
ix = m[m.squeeze().notna()].index
df['country'] = m.ffill()
df.drop(ix)

            city     country
1          London         UK 
2           Wales         UK 
3       Liverpool         UK 
5         Chicago         US 
6        New York         US 
7   San Francisco         US 
8         Seattle         US 
10         Sydney  Australia 
11          Perth  Australia 
5

You can use np.where with str.contains too:

mask = df['places'].str.contains('(c)', regex = False)
df['country'] = np.where(mask, df['places'], np.nan)
df['country'] = df['country'].str.replace('\(c\)', '').ffill()
df = df[~mask]
df
            places     country
1          London         UK 
2           Wales         UK 
3       Liverpool         UK 
5         Chicago         US 
6        New York         US 
7   San Francisco         US 
8         Seattle         US 
10         Sydney  Australia 
11          Perth  Australia 

The str contains looks for (c) and if present will return True for that index. Where this condition is True, the country value will be added to the country columns

3

You could do the following:

data = ['UK (c)','London','Wales','Liverpool','US (c)','Chicago','New York','San Francisco','Seattle','Australia (c)','Sydney','Perth']
df = pd.DataFrame(data, columns = ['city'])
df['country'] = df.city.apply(lambda x : x.replace('(c)','') if '(c)' in x else None)
df.fillna(method='ffill', inplace=True)
df = df[df['city'].str.contains('\(c\)')==False]

Output

+-----+----------------+-----------+
|     |     city       |  country  |
+-----+----------------+-----------+
|  1  | London         | UK        |
|  2  | Wales          | UK        |
|  3  | Liverpool      | UK        |
|  5  | Chicago        | US        |
|  6  | New York       | US        |
|  7  | San Francisco  | US        |
|  8  | Seattle        | US        |
| 10  | Sydney         | Australia |
| 11  | Perth          | Australia |
+-----+----------------+-----------+

Not the answer you're looking for? Browse other questions tagged or ask your own question.