If you have an unstructured file with messy strings containing amounts, and you want to extract just the amounts, you might consider the following approach:
First, you will need to identify the patterns in the strings that indicate the presence of an amount. For example, you might look for strings that contain a currency symbol, such as “$” or “€”, or strings that contain numbers followed by a unit of measurement, such as “kg” or “lbs”.
Once you have identified the patterns that indicate the presence of an amount, you can use regular expressions or string manipulation techniques to extract the amounts from the strings. This may involve using functions such as re.search or re.findall in Python, or using string functions such as split or replace to extract the amounts.
After extracting the amounts, you may want to convert them to a numeric data type, such as float or integer, in order to perform calculations with them. You can do this using functions such as int() or float(), depending on the data type you desire.
Finally, you can store the extracted and converted amounts in a new column in your DataFrame, or use them for further analysis or processing as needed.
import pandas as pd
import re
df=pd.DataFrame({"Description":['A','B','C','D','E'],"Amount":['trx .11','balance 536,002.63','adj 85.85','manual adj 0.00','fx va vb 55.63']})
df
Description | Amount | |
---|---|---|
0 | A | trx .11 |
1 | B | balance 536,002.63 |
2 | C | adj 85.85 |
3 | D | manual adj 0.00 |
4 | E | fx va vb 55.63 |
match='(\$?(?<!\d)(?:\d{1,3}(?:,\d{3})*|\d{4,})?\.?\d+)'
df['Amount_extract']=df['Amount'].str.extract(match)
df
Description | Amount | Amount_extract | |
---|---|---|---|
0 | A | trx .11 | .11 |
1 | B | balance 536,002.63 | 536,002.63 |
2 | C | adj 85.85 | 85.85 |
3 | D | manual adj 0.00 | 0.00 |
4 | E | fx va vb 55.63 | 55.63 |