In this post, we demonstrate how to use the merge_asof() function in the Pandas library to merge two dataframes, df1 and df2, where df1 contains names and IDs for certain individuals and df2 contains historical addresses for those same individuals. The goal is to match each person in df1 with a corresponding address in df2 that is closest to a specified cutoff date.
The cutoff date can be any date, such as today or last year, and is used to determine which address from the historical data in df2 is the most relevant to use in the merge. For example, if the cutoff date is January 1, 2022, and an individual has three historical addresses, the address that is closest in time (and not exceeding) the cutoff date will be used in the merge.
To begin, the code imports the Pandas library and creates two dataframes: df1, which contains names and IDs for individuals, and df2, which contains historical addresses for those same individuals.
Next, the code converts the ‘Effective Date’ column in df2 to a datetime format using the to_datetime() function.
Then, the code sorts the rows of df2 by the ‘Effective Date’ column in ascending order using the sort_values() function. This is important because the merge_asof() function expects the data to be sorted in order to properly match rows based on the cutoff date.
Finally, the code uses the merge_asof() function to merge df1 and df2 based on the ‘ID’ column and the cutoff date. The merge_asof() function returns a new dataframe that includes the matching rows from both df1 and df2, along with any additional columns that are specified in the function.
Overall, this code demonstrates how to use the merge_asof() function in Pandas to merge two dataframes based on the closest date in time. [Further explained below]
import pandas as pd
df1
Name | ID | |
---|---|---|
0 | Jane Smith | 2 |
1 | John Doe | 1 |
2 | Johnathan Adams | 5 |
3 | Kimberly Smith | 8 |
4 | Mary Brown | 4 |
5 | Mike Jones | 3 |
6 | Sara Davis | 6 |
7 | Will Johnson | 7 |
df2
ID | Name | Address 1 | Address 2 | City | State | Country | Zip | Effective Date | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | John Doe | 123 Main St | Suite 100 | San Francisco | CA | USA | 94105 | 2016-05-02 |
1 | 1 | John Doe | 222 Post St, | NaN | San Francisco | CA | USA | 94108 | 2018-01-01 |
2 | 1 | John Doe | 456 Market St, | NaN | San Francisco | CA | USA | 94103 | 2021-03-22 |
3 | 2 | Jane Smith | 456 Elm St | Apt 4 | New York | NY | USA | 10012 | 2021-09-05 |
4 | 2 | Jane Smith | 620 Reiss Pl | Apt 2D | Bronx | NY | USA | 12345 | 2001-10-23 |
5 | 2 | Jane Smith | 3484 Fort Independence St | NaN | Glen Oaks | NY | USA | 11004 | 2008-11-15 |
6 | 3 | Mike Jones | 789 Pine St | Floor 2 | Seattle | WA | USA | 98101 | 2015-10-09 |
7 | 3 | Mike Jones | 26142A Langston Ave | NaN | New York | NY | USA | 59712 | 2020-01-28 |
8 | 3 | Mike Jones | 620 Reiss Pl | NaN | Staten Island | NY | USA | 56712 | 2020-02-09 |
9 | 3 | Mike Jones | 379 Amboy St | NaN | Rockville Centre | NY | USA | 56921 | 2020-01-24 |
10 | 4 | Mary Brown | 111 1st Ave | Apt 5 | Los Angeles | CA | USA | 90012 | 2019-08-06 |
11 | 4 | Mary Brown | 5530 99th St | NaN | White Plains | NY | USA | 15637 | 2019-01-23 |
12 | 5 | Johnathan Adams | 222 2nd St | Ste 200 | Chicago | IL | USA | 60606 | 2015-06-07 |
13 | 5 | Johnathan Adams | 477 De Mott Ave | NaN | New Hyde Park | NY | USA | 10023 | 2019-01-13 |
14 | 6 | Sara Davis | 333 3rd Ave | Floor 4 | Houston | TX | USA | 77001 | 2020-08-03 |
15 | 6 | Sara Davis | 76 Shotwell Ave | NaN | Bronx | NY | USA | 15893 | 2019-01-05 |
16 | 7 | Will Johnson | 444 4th St | Apt 6 | Philadelphia | PA | USA | 19106 | 2015-05-09 |
17 | 8 | Kimberly Smith | 555 5th Ave | Apt 7 | Phoenix | AZ | USA | 85001 | 2017-12-06 |
So for presentation purposes, let’s say the cutoff date is 1/1/2022. Notice John Doe (ID1) has three historical addresses. The one where he resided that is closest in time (and not to exceed) the cutoff date is the “456 Market St” address with and effective date of “2021-03-22”.
First, we ensure that the “Effective Date” column is in a datetime format.
df2['Effective Date']=pd.to_datetime(df2['Effective Date'])
Next, we need to sort by the Effective Date in an ascending order.
df2.sort_values(by='Effective Date',inplace=True)
df2
ID | Name | Address 1 | Address 2 | City | State | Country | Zip | Effective Date | |
---|---|---|---|---|---|---|---|---|---|
4 | 2 | Jane Smith | 620 Reiss Pl | Apt 2D | Bronx | NY | USA | 12345 | 2001-10-23 |
5 | 2 | Jane Smith | 3484 Fort Independence St | NaN | Glen Oaks | NY | USA | 11004 | 2008-11-15 |
16 | 7 | Will Johnson | 444 4th St | Apt 6 | Philadelphia | PA | USA | 19106 | 2015-05-09 |
12 | 5 | Johnathan Adams | 222 2nd St | Ste 200 | Chicago | IL | USA | 60606 | 2015-06-07 |
6 | 3 | Mike Jones | 789 Pine St | Floor 2 | Seattle | WA | USA | 98101 | 2015-10-09 |
0 | 1 | John Doe | 123 Main St | Suite 100 | San Francisco | CA | USA | 94105 | 2016-05-02 |
17 | 8 | Kimberly Smith | 555 5th Ave | Apt 7 | Phoenix | AZ | USA | 85001 | 2017-12-06 |
1 | 1 | John Doe | 222 Post St, | NaN | San Francisco | CA | USA | 94108 | 2018-01-01 |
15 | 6 | Sara Davis | 76 Shotwell Ave | NaN | Bronx | NY | USA | 15893 | 2019-01-05 |
13 | 5 | Johnathan Adams | 477 De Mott Ave | NaN | New Hyde Park | NY | USA | 10023 | 2019-01-13 |
11 | 4 | Mary Brown | 5530 99th St | NaN | White Plains | NY | USA | 15637 | 2019-01-23 |
10 | 4 | Mary Brown | 111 1st Ave | Apt 5 | Los Angeles | CA | USA | 90012 | 2019-08-06 |
9 | 3 | Mike Jones | 379 Amboy St | NaN | Rockville Centre | NY | USA | 56921 | 2020-01-24 |
7 | 3 | Mike Jones | 26142A Langston Ave | NaN | New York | NY | USA | 59712 | 2020-01-28 |
8 | 3 | Mike Jones | 620 Reiss Pl | NaN | Staten Island | NY | USA | 56712 | 2020-02-09 |
14 | 6 | Sara Davis | 333 3rd Ave | Floor 4 | Houston | TX | USA | 77001 | 2020-08-03 |
2 | 1 | John Doe | 456 Market St, | NaN | San Francisco | CA | USA | 94103 | 2021-03-22 |
3 | 2 | Jane Smith | 456 Elm St | Apt 4 | New York | NY | USA | 10012 | 2021-09-05 |
Also, we need to add the cutoff date to the first dataframe (df1) as a separate column (name it “Effective Date” to be consistent with df2
df1['Effective Date']=pd.to_datetime('1/1/2022')
df1.sort_values(by='Effective Date',inplace=True)
df1
Name | ID | Effective Date | |
---|---|---|---|
0 | Jane Smith | 2 | 2022-01-01 |
1 | John Doe | 1 | 2022-01-01 |
2 | Johnathan Adams | 5 | 2022-01-01 |
3 | Kimberly Smith | 8 | 2022-01-01 |
4 | Mary Brown | 4 | 2022-01-01 |
5 | Mike Jones | 3 | 2022-01-01 |
6 | Sara Davis | 6 | 2022-01-01 |
7 | Will Johnson | 7 | 2022-01-01 |
And now merge
merged_df=pd.merge_asof(df1,df2, on = 'Effective Date', by='ID',direction='backward')
merged_df
Name_x | ID | Effective Date | Name_y | Address 1 | Address 2 | City | State | Country | Zip | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Jane Smith | 2 | 2022-01-01 | Jane Smith | 456 Elm St | Apt 4 | New York | NY | USA | 10012 |
1 | John Doe | 1 | 2022-01-01 | John Doe | 456 Market St, | NaN | San Francisco | CA | USA | 94103 |
2 | Johnathan Adams | 5 | 2022-01-01 | Johnathan Adams | 477 De Mott Ave | NaN | New Hyde Park | NY | USA | 10023 |
3 | Kimberly Smith | 8 | 2022-01-01 | Kimberly Smith | 555 5th Ave | Apt 7 | Phoenix | AZ | USA | 85001 |
4 | Mary Brown | 4 | 2022-01-01 | Mary Brown | 111 1st Ave | Apt 5 | Los Angeles | CA | USA | 90012 |
5 | Mike Jones | 3 | 2022-01-01 | Mike Jones | 620 Reiss Pl | NaN | Staten Island | NY | USA | 56712 |
6 | Sara Davis | 6 | 2022-01-01 | Sara Davis | 333 3rd Ave | Floor 4 | Houston | TX | USA | 77001 |
7 | Will Johnson | 7 | 2022-01-01 | Will Johnson | 444 4th St | Apt 6 | Philadelphia | PA | USA | 19106 |
And we get the expected results. Note: this merge can also be performed by finding the nearest match in the future, so the closest one past the cutoff date, and similarly on the nearest. This is accomplished by changing the direction parameter to forward or nearest. Additionally documentation on this function can be accessed via the link below.
Also, to download anaconda along with its packages (python, and everything else you would need), go to this site, follow installation instructions: Download Anaconda link