import pandas as pd
df=pd.DataFrame({'state':['NY','CA','CT','NJ','NY','FL','FL','CA'],\
'source':['source_a','source_a','source_a','source_b','source_b','source_b',\
'source_d','source_c']})
df
To create the unique list for each of ‘state’ value, use ‘groupby’ and apply(list)
df_grouped=df.groupby('state')['source'].unique().apply(list).reset_index()
df_grouped
state | source | |
---|---|---|
0 | CA | [source_a, source_c] |
1 | CT | [source_a] |
2 | FL | [source_b, source_d] |
3 | NJ | [source_b] |
4 | NY | [source_a, source_b] |
Merge with original dataframe
result=pd.merge(df,df_grouped,on='state',how='left')
result
state | source_x | source_y | |
---|---|---|---|
0 | NY | source_a | [source_a, source_b] |
1 | CA | source_a | [source_a, source_c] |
2 | CT | source_a | [source_a] |
3 | NJ | source_b | [source_b] |
4 | NY | source_b | [source_a, source_b] |
5 | FL | source_b | [source_b, source_d] |
6 | FL | source_d | [source_b, source_d] |
7 | CA | source_c | [source_a, source_c] |
Rename ‘source_x’ and ‘source_y’ columns
result.rename(columns={'source_x':'source_original','source_y':'unique_list'},inplace=True)
result
state | source_original | unique_list | |
---|---|---|---|
0 | NY | source_a | [source_a, source_b] |
1 | CA | source_a | [source_a, source_c] |
2 | CT | source_a | [source_a] |
3 | NJ | source_b | [source_b] |
4 | NY | source_b | [source_a, source_b] |
5 | FL | source_b | [source_b, source_d] |
6 | FL | source_d | [source_b, source_d] |
7 | CA | source_c | [source_a, source_c] |