How do I create a new column in a dataframe from an existing column using conditions?

Question

I have one column containing all the data which looks something like this (values that need to be separated have a mark like (c)):

UK (c)
London
Wales
Liverpool
US (c)
Chicago
New York
San Francisco
Seattle
Australia (c)
Sydney
Perth

And I want it split into two columns looking like this:

London          UK
Wales           UK
Liverpool       UK
Chicago         US
New York        US
San Francisco   US
Seattle         US
Sydney          Australia
Perth           Australia

Question 2: What if the countries did not have a pattern like (c)?

It's a somewhat interesting string manipulation problem and by this tag's usual standards a relatively decent question with sample data and clearly specified expected output. I'm not complaining ¯\_(ツ)_/¯ — cs95, Commented Jun 28, 2019 at 2:45

BENY · Accepted Answer · 2019-06-27 14:16:38Z

10

Step by step with endswith and ffill + str.strip

df['country']=df.loc[df.city.str.endswith('(c)'),'city']
df.country=df.country.ffill()
df=df[df.city.ne(df.country)]
df.country=df.country.str.strip('(c)')

edited Jun 27, 2019 at 14:16

answered Jun 27, 2019 at 14:10

BENY

322k22 gold badges173 silver badges247 bronze badges

What if the countries did not have a pattern like (c)?
– Tsatsa
Commented Jun 27, 2019 at 15:17
1

@Tsatsa in that case you may need build a country list , and using isin
– BENY
Commented Jun 27, 2019 at 15:18

Add a comment |

cs95 · Accepted Answer · 2019-06-27 14:16:38Z

`extract` and `ffill`

Start with extract and ffill, then remove redundant rows.

df['country'] = (
    df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill())
df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)

            data    country
0         London         UK
1          Wales         UK
2      Liverpool         UK
3        Chicago         US
4       New York         US
5  San Francisco         US
6        Seattle         US
7         Sydney  Australia
8          Perth  Australia

Where,

df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill()

0            UK
1            UK
2            UK
3            UK
4            US
5            US
6            US
7            US
8            US
9     Australia
10    Australia
11    Australia
Name: country, dtype: object

The pattern '(.*)\s+\(c\)' matches strings of the form "country (c)" and extracts the country name. Anything not matching this pattern is replaced with NaN, so you can conveniently forward fill on rows.

`split` with `np.where` and `ffill`

This splits on "(c)".

u = df['data'].str.split(r'\s+\(c\)')
df['country'] = pd.Series(np.where(u.str.len() == 2, u.str[0], np.nan)).ffill()

df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)

            data    country
0         London         UK
1          Wales         UK
2      Liverpool         UK
3        Chicago         US
4       New York         US
5  San Francisco         US
6        Seattle         US
7         Sydney  Australia
8          Perth  Australia

extract('(.*)\s+\(c\)') saves you from .str.strip().
– Quang Hoang
Commented Jun 27, 2019 at 14:09 — Quang Hoang, Commented Jun 27, 2019 at 14:09

yatu · Accepted Answer · 2019-06-27 14:17:52Z

You can first use str.extract to locate the cities ending in (c) and extract the country name, and ffill to populate a new country column.

The same extracted matches can be use to locate the rows to be dropped, i.e. rows which are notna:

m = df.city.str.extract('^(.*?)(?=\(c\)$)')
ix = m[m.squeeze().notna()].index
df['country'] = m.ffill()
df.drop(ix)

            city     country
1          London         UK 
2           Wales         UK 
3       Liverpool         UK 
5         Chicago         US 
6        New York         US 
7   San Francisco         US 
8         Seattle         US 
10         Sydney  Australia 
11          Perth  Australia

Mohit Motwani · Accepted Answer · 2019-06-27 14:13:51Z

You can use np.where with str.contains too:

mask = df['places'].str.contains('(c)', regex = False)
df['country'] = np.where(mask, df['places'], np.nan)
df['country'] = df['country'].str.replace('\(c\)', '').ffill()
df = df[~mask]
df
            places     country
1          London         UK 
2           Wales         UK 
3       Liverpool         UK 
5         Chicago         US 
6        New York         US 
7   San Francisco         US 
8         Seattle         US 
10         Sydney  Australia 
11          Perth  Australia

The str contains looks for (c) and if present will return True for that index. Where this condition is True, the country value will be added to the country columns

Sebastien D · Accepted Answer · 2019-06-27 14:23:59Z

You could do the following:

data = ['UK (c)','London','Wales','Liverpool','US (c)','Chicago','New York','San Francisco','Seattle','Australia (c)','Sydney','Perth']
df = pd.DataFrame(data, columns = ['city'])
df['country'] = df.city.apply(lambda x : x.replace('(c)','') if '(c)' in x else None)
df.fillna(method='ffill', inplace=True)
df = df[df['city'].str.contains('\(c\)')==False]

Output

+-----+----------------+-----------+
|     |     city       |  country  |
+-----+----------------+-----------+
|  1  | London         | UK        |
|  2  | Wales          | UK        |
|  3  | Liverpool      | UK        |
|  5  | Chicago        | US        |
|  6  | New York       | US        |
|  7  | San Francisco  | US        |
|  8  | Seattle        | US        |
| 10  | Sydney         | Australia |
| 11  | Perth          | Australia |
+-----+----------------+-----------+

Collectives™ on Stack Overflow

How do I create a new column in a dataframe from an existing column using conditions?

5 Answers 5

`extract` and `ffill`

`split` with `np.where` and `ffill`

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
series
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

extract and ffill

split with np.where and ffill

Not the answer you're looking for? Browse other questions tagged pythonpandasdataframeseries or ask your own question.

Related

`extract` and `ffill`

`split` with `np.where` and `ffill`

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
series
or ask your own question.