Merge/Join tables on the closest date in time

In this post, we demonstrate how to use the merge_asof() function in the Pandas library to merge two dataframes, df1 and df2, where df1 contains names and IDs for certain individuals and df2 contains historical addresses for those same individuals. The goal is to match each person in df1 with a corresponding address in df2 that is closest to a specified cutoff date.

The cutoff date can be any date, such as today or last year, and is used to determine which address from the historical data in df2 is the most relevant to use in the merge. For example, if the cutoff date is January 1, 2022, and an individual has three historical addresses, the address that is closest in time (and not exceeding) the cutoff date will be used in the merge.

To begin, the code imports the Pandas library and creates two dataframes: df1, which contains names and IDs for individuals, and df2, which contains historical addresses for those same individuals.

Next, the code converts the ‘Effective Date’ column in df2 to a datetime format using the to_datetime() function.

Then, the code sorts the rows of df2 by the ‘Effective Date’ column in ascending order using the sort_values() function. This is important because the merge_asof() function expects the data to be sorted in order to properly match rows based on the cutoff date.

Finally, the code uses the merge_asof() function to merge df1 and df2 based on the ‘ID’ column and the cutoff date. The merge_asof() function returns a new dataframe that includes the matching rows from both df1 and df2, along with any additional columns that are specified in the function.

Overall, this code demonstrates how to use the merge_asof() function in Pandas to merge two dataframes based on the closest date in time. [Further explained below]

import pandas as pd

df1

	Name	ID
0	Jane Smith	2
1	John Doe	1
2	Johnathan Adams	5
3	Kimberly Smith	8
4	Mary Brown	4
5	Mike Jones	3
6	Sara Davis	6
7	Will Johnson	7

df2

	ID	Name	Address 1	Address 2	City	State	Country	Zip	Effective Date
0	1	John Doe	123 Main St	Suite 100	San Francisco	CA	USA	94105	2016-05-02
1	1	John Doe	222 Post St,	NaN	San Francisco	CA	USA	94108	2018-01-01
2	1	John Doe	456 Market St,	NaN	San Francisco	CA	USA	94103	2021-03-22
3	2	Jane Smith	456 Elm St	Apt 4	New York	NY	USA	10012	2021-09-05
4	2	Jane Smith	620 Reiss Pl	Apt 2D	Bronx	NY	USA	12345	2001-10-23
5	2	Jane Smith	3484 Fort Independence St	NaN	Glen Oaks	NY	USA	11004	2008-11-15
6	3	Mike Jones	789 Pine St	Floor 2	Seattle	WA	USA	98101	2015-10-09
7	3	Mike Jones	26142A Langston Ave	NaN	New York	NY	USA	59712	2020-01-28
8	3	Mike Jones	620 Reiss Pl	NaN	Staten Island	NY	USA	56712	2020-02-09
9	3	Mike Jones	379 Amboy St	NaN	Rockville Centre	NY	USA	56921	2020-01-24
10	4	Mary Brown	111 1st Ave	Apt 5	Los Angeles	CA	USA	90012	2019-08-06
11	4	Mary Brown	5530 99th St	NaN	White Plains	NY	USA	15637	2019-01-23
12	5	Johnathan Adams	222 2nd St	Ste 200	Chicago	IL	USA	60606	2015-06-07
13	5	Johnathan Adams	477 De Mott Ave	NaN	New Hyde Park	NY	USA	10023	2019-01-13
14	6	Sara Davis	333 3rd Ave	Floor 4	Houston	TX	USA	77001	2020-08-03
15	6	Sara Davis	76 Shotwell Ave	NaN	Bronx	NY	USA	15893	2019-01-05
16	7	Will Johnson	444 4th St	Apt 6	Philadelphia	PA	USA	19106	2015-05-09
17	8	Kimberly Smith	555 5th Ave	Apt 7	Phoenix	AZ	USA	85001	2017-12-06

So for presentation purposes, let’s say the cutoff date is 1/1/2022. Notice John Doe (ID1) has three historical addresses. The one where he resided that is closest in time (and not to exceed) the cutoff date is the “456 Market St” address with and effective date of “2021-03-22”.

First, we ensure that the “Effective Date” column is in a datetime format.

df2['Effective Date']=pd.to_datetime(df2['Effective Date'])

Next, we need to sort by the Effective Date in an ascending order.

df2.sort_values(by='Effective Date',inplace=True)

df2

	ID	Name	Address 1	Address 2	City	State	Country	Zip	Effective Date
4	2	Jane Smith	620 Reiss Pl	Apt 2D	Bronx	NY	USA	12345	2001-10-23
5	2	Jane Smith	3484 Fort Independence St	NaN	Glen Oaks	NY	USA	11004	2008-11-15
16	7	Will Johnson	444 4th St	Apt 6	Philadelphia	PA	USA	19106	2015-05-09
12	5	Johnathan Adams	222 2nd St	Ste 200	Chicago	IL	USA	60606	2015-06-07
6	3	Mike Jones	789 Pine St	Floor 2	Seattle	WA	USA	98101	2015-10-09
0	1	John Doe	123 Main St	Suite 100	San Francisco	CA	USA	94105	2016-05-02
17	8	Kimberly Smith	555 5th Ave	Apt 7	Phoenix	AZ	USA	85001	2017-12-06
1	1	John Doe	222 Post St,	NaN	San Francisco	CA	USA	94108	2018-01-01
15	6	Sara Davis	76 Shotwell Ave	NaN	Bronx	NY	USA	15893	2019-01-05
13	5	Johnathan Adams	477 De Mott Ave	NaN	New Hyde Park	NY	USA	10023	2019-01-13
11	4	Mary Brown	5530 99th St	NaN	White Plains	NY	USA	15637	2019-01-23
10	4	Mary Brown	111 1st Ave	Apt 5	Los Angeles	CA	USA	90012	2019-08-06
9	3	Mike Jones	379 Amboy St	NaN	Rockville Centre	NY	USA	56921	2020-01-24
7	3	Mike Jones	26142A Langston Ave	NaN	New York	NY	USA	59712	2020-01-28
8	3	Mike Jones	620 Reiss Pl	NaN	Staten Island	NY	USA	56712	2020-02-09
14	6	Sara Davis	333 3rd Ave	Floor 4	Houston	TX	USA	77001	2020-08-03
2	1	John Doe	456 Market St,	NaN	San Francisco	CA	USA	94103	2021-03-22
3	2	Jane Smith	456 Elm St	Apt 4	New York	NY	USA	10012	2021-09-05

Also, we need to add the cutoff date to the first dataframe (df1) as a separate column (name it “Effective Date” to be consistent with df2

df1['Effective Date']=pd.to_datetime('1/1/2022')

df1.sort_values(by='Effective Date',inplace=True)

df1

	Name	ID	Effective Date
0	Jane Smith	2	2022-01-01
1	John Doe	1	2022-01-01
2	Johnathan Adams	5	2022-01-01
3	Kimberly Smith	8	2022-01-01
4	Mary Brown	4	2022-01-01
5	Mike Jones	3	2022-01-01
6	Sara Davis	6	2022-01-01
7	Will Johnson	7	2022-01-01

And now merge

merged_df=pd.merge_asof(df1,df2, on = 'Effective Date', by='ID',direction='backward')

merged_df

	Name_x	ID	Effective Date	Name_y	Address 1	Address 2	City	State	Country	Zip
0	Jane Smith	2	2022-01-01	Jane Smith	456 Elm St	Apt 4	New York	NY	USA	10012
1	John Doe	1	2022-01-01	John Doe	456 Market St,	NaN	San Francisco	CA	USA	94103
2	Johnathan Adams	5	2022-01-01	Johnathan Adams	477 De Mott Ave	NaN	New Hyde Park	NY	USA	10023
3	Kimberly Smith	8	2022-01-01	Kimberly Smith	555 5th Ave	Apt 7	Phoenix	AZ	USA	85001
4	Mary Brown	4	2022-01-01	Mary Brown	111 1st Ave	Apt 5	Los Angeles	CA	USA	90012
5	Mike Jones	3	2022-01-01	Mike Jones	620 Reiss Pl	NaN	Staten Island	NY	USA	56712
6	Sara Davis	6	2022-01-01	Sara Davis	333 3rd Ave	Floor 4	Houston	TX	USA	77001
7	Will Johnson	7	2022-01-01	Will Johnson	444 4th St	Apt 6	Philadelphia	PA	USA	19106

And we get the expected results. Note: this merge can also be performed by finding the nearest match in the future, so the closest one past the cutoff date, and similarly on the nearest. This is accomplished by changing the direction parameter to forward or nearest. Additionally documentation on this function can be accessed via the link below.

pandas.merge_asof

Also, to download anaconda along with its packages (python, and everything else you would need), go to this site, follow installation instructions: Download Anaconda link

Merge/Join tables on the closest date in time

Join the Discussion on Discord