The Announcement That Changed Everything
CFO (in all-hands meeting): “We’re implementing a new financial system. All historical data must be migrated by year-end.”
Me: raises hand “How much historical data?”
CFO: “20 years.”
Me: “And it’s currently in…?”
CFO: “Excel. About 2,400 different files.”
Everyone: collective groan
CFO: “Budget for migration: $50,000 for consultants.”
Me (internal monologue): Or… I could learn Python and do it myself…
The Challenge
The Source Data (AKA: The Nightmare)
2,400 Excel files containing:
- Trial balances (monthly, 1995-2024)
- Investment transactions (daily, 2000-2024)
- Member contributions (monthly, 1995-2024)
- Benefit payments (monthly, 1995-2024)
- Employer remittances (monthly, 1995-2024)
File Formats:
.xls(Excel 97-2003) → 1,200 files.xlsx(Excel 2007+) → 1,000 files.xlsb(Binary) → 150 files.csv(supposedly) → 50 files (actually pipe-delimited )
File Naming Conventions:
Trial Balance 01-2024.xlsxTB_January_2024_FINAL.xlsxTB_Jan2024_v3_REALLY_FINAL.xlsxTrial_Bal_2024_01_USE_THIS_ONE.xlsxTB 202401.xlstrial balance january 2024 (2).xlsx
Translation: No consistent naming. Complete chaos.
Data Structure:
- Different column names across years
- Different sheet names
- Some months missing
- Some duplicates (which version is correct? ¯\_(ツ)_/¯)
- Formulas instead of values
- Hidden sheets with “backup data”
- Merged cells
- Colors indicating… something (documentation lost)
The Manual Approach (Estimated)
Consultants’ Proposal:
- Inventory all files → 2 weeks
- Standardize format → 8 weeks
- Consolidate by year → 12 weeks
- Data quality checks → 4 weeks
- Export to new system format → 2 weeks
- Validation → 4 weeks
Total Time: 32 weeks (8 months!)
Total Cost: $50,000
Risk: High (manual = errors)
The Python Approach
My Pitch to CFO:
“Give me 2 weeks to build an automated solution. If it works, we save $50,000 and 7 months. If it doesn’t, we hire the consultants.”
CFO: “You have 1 week.”
Me: gulp
Day 1: File Inventory & Pattern Detection
Step 1: Find All Excel Files
import os
import pandas as pd
from pathlib import Path
# Find all Excel files recursively
base_path = Path('/data/historical')
excel_files = []
for ext in ['*.xls', '*.xlsx', '*.xlsb']:
excel_files.extend(base_path.rglob(ext))
print(f"Found {len(excel_files)} Excel files")
# Create inventory
inventory = pd.DataFrame({
'File_Path': [str(f) for f in excel_files],
'File_Name': [f.name for f in excel_files],
'File_Size_MB': [f.stat().st_size / 1024 / 1024 for f in excel_files],
'Modified_Date': [pd.Timestamp.fromtimestamp(f.stat().st_mtime) for f in excel_files]
})
# Detect file type from name
inventory['File_Type'] = inventory['File_Name'].apply(lambda x:
'Trial_Balance' if 'TB' in x.upper() or 'TRIAL' in x.upper() else
'Investments' if 'INV' in x.upper() else
'Contributions' if 'CONTRIB' in x.upper() else
'Benefits' if 'BEN' in x.upper() else
'Unknown'
)
inventory.to_excel('file_inventory.xlsx', index=False)
print(inventory['File_Type'].value_counts())
Output:
Found 2,413 Excel files
File_Type
Trial_Balance 1,247
Investments 687
Contributions 289
Benefits 156
Unknown 34
Time: 5 minutes
Step 2: Detect Sheet Names
import openpyxl
# Sample 100 files to detect sheet name patterns
sample_files = inventory.sample(100)
sheet_names = []
for file_path in sample_files['File_Path']:
try:
wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
sheet_names.extend(wb.sheetnames)
wb.close()
except:
pass
# Count most common sheet names
from collections import Counter
sheet_counts = Counter(sheet_names)
print("Most common sheet names:")
for name, count in sheet_counts.most_common(20):
print(f" {name}: {count}")
Output:
Most common sheet names:
Sheet1: 234
Data: 187
Trial Balance: 156
TB: 98
2024: 67
Summary: 45
Detail: 34
...
Insight: Try “Data” first, then “Trial Balance”, then first sheet.
Day 2: Build Robust File Reader
The Universal Excel Reader
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
def read_excel_robust(file_path):
"""
Reads an Excel file, trying multiple strategies
Returns: DataFrame or None
"""
# Strategy 1: Try common sheet names
common_sheets = ['Data', 'Trial Balance', 'TB', 'Detail', 'Sheet1']
for sheet in common_sheets:
try:
df = pd.read_excel(file_path, sheet_name=sheet)
if len(df) > 0: # Not empty
return df, sheet
except:
pass
# Strategy 2: Try first sheet
try:
df = pd.read_excel(file_path, sheet_name=0)
if len(df) > 0:
return df, 'First_Sheet'
except:
pass
# Strategy 3: Try reading as CSV (for .csv mislabeled as .xls)
try:
df = pd.read_csv(file_path, encoding='latin-1')
if len(df) > 0:
return df, 'CSV'
except:
pass
return None, None
# Test on sample
results = []
for file_path in inventory['File_Path'].head(100):
df, sheet = read_excel_robust(file_path)
results.append({
'File': file_path,
'Success': df is not None,
'Sheet': sheet,
'Rows': len(df) if df is not None else 0,
'Columns': len(df.columns) if df is not None else 0
})
results_df = pd.DataFrame(results)
print(f"Success rate: {results_df['Success'].mean()*100:.1f}%")
Output:
Success rate: 94.2%
Not bad! Fix the 5.8% failures manually.
Day 3: Column Name Standardization
The Problem
1995 files: Acct_Num, Description, Debit, Credit
2005 files: Account Number, Account Description, Debit Amount, Credit Amount
2015 files: Account, Desc, DR, CR
2024 files: GL_Account, GL_Description, Debit_Amt, Credit_Amt
Goal: Map all variations to standard column names
The Solution: Fuzzy Matching
from fuzzywuzzy import fuzz
# Standard column names
standard_columns = {
'Account_Number': ['account', 'acct', 'gl_account', 'account_num', 'account number'],
'Description': ['desc', 'description', 'account_desc', 'account description', 'name'],
'Debit': ['debit', 'dr', 'debit_amt', 'debit amount', 'debit_amount'],
'Credit': ['credit', 'cr', 'credit_amt', 'credit amount', 'credit_amount'],
'Date': ['date', 'transaction_date', 'trans_date', 'posting_date'],
'Amount': ['amount', 'amt', 'balance', 'value']
}
def standardize_columns(df):
"""Map column names to standard names using fuzzy matching"""
column_mapping = {}
for col in df.columns:
col_lower = str(col).lower().strip()
# Try exact match first
for standard, variations in standard_columns.items():
if col_lower in variations:
column_mapping[col] = standard
break
# Try fuzzy match (80% threshold)
if col not in column_mapping:
best_match = None
best_score = 0
for standard, variations in standard_columns.items():
for variation in variations:
score = fuzz.ratio(col_lower, variation)
if score > best_score and score > 80:
best_score = score
best_match = standard
if best_match:
column_mapping[col] = best_match
return df.rename(columns=column_mapping)
# Test
test_df = pd.DataFrame(columns=['Acct_Num', 'Desc', 'DR', 'CR'])
standardized = standardize_columns(test_df)
print(standardized.columns.tolist())
# Output: ['Account_Number', 'Description', 'Debit', 'Credit']
Success!
Day 4: Data Extraction & Consolidation
import pandas as pd
from tqdm import tqdm # Progress bar
def process_all_files(file_type='Trial_Balance'):
"""Process all files of a given type"""
# Filter inventory
files = inventory[inventory['File_Type'] == file_type].copy()
print(f"Processing {len(files)} {file_type} files...")
all_data = []
failed_files = []
for idx, row in tqdm(files.iterrows(), total=len(files)):
try:
# Read file
df, sheet = read_excel_robust(row['File_Path'])
if df is None:
failed_files.append(row['File_Name'])
continue
# Standardize columns
df = standardize_columns(df)
# Add metadata
df['Source_File'] = row['File_Name']
df['Source_Sheet'] = sheet
df['File_Modified_Date'] = row['Modified_Date']
# Extract date from filename (if possible)
# Pattern: "TB_01_2024" or "Trial Balance January 2024"
import re
date_match = re.search(r'(\d{1,2})[_\s-]?(\d{4})', row['File_Name'])
if date_match:
month, year = date_match.groups()
df['Report_Month'] = int(month)
df['Report_Year'] = int(year)
all_data.append(df)
except Exception as e:
failed_files.append(f"{row['File_Name']}: {str(e)}")
# Consolidate
if all_data:
combined = pd.concat(all_data, ignore_index=True)
print(f"\nConsolidated {len(combined):,} rows from {len(all_data)} files")
print(f"Failed: {len(failed_files)} files")
# Export
combined.to_parquet(f'{file_type}_consolidated.parquet') # Faster than Excel!
if failed_files:
pd.DataFrame({'Failed_Files': failed_files}).to_excel(
f'{file_type}_failed.xlsx', index=False
)
return combined
else:
print("No data extracted!")
return None
# Process all trial balances
tb_data = process_all_files('Trial_Balance')
Output:
Processing 1,247 Trial_Balance files...
100%|██████████| 1247/1247 [12:34<00:00, 1.65it/s]
Consolidated 2,847,392 rows from 1,198 files
Failed: 49 files
Time: 12 minutes (for 1,247 files!)
Manually review 49 failed files (corrupt, password-protected, etc.)
Day 5: Data Quality Checks
Check #1: Completeness (Missing Months)
# Check for missing months
expected_months = pd.date_range(start='1995-01', end='2024-12', freq='MS')
actual_months = tb_data.groupby(['Report_Year', 'Report_Month']).size().reset_index()
actual_months['Period'] = pd.to_datetime(
actual_months['Report_Year'].astype(str) + '-' +
actual_months['Report_Month'].astype(str).str.zfill(2) + '-01'
)
missing_months = set(expected_months) - set(actual_months['Period'])
print(f"Missing months: {len(missing_months)}")
if missing_months:
print(sorted(missing_months))
Check #2: Duplicates
# Check for duplicate months (multiple versions)
duplicates = actual_months[actual_months.duplicated(subset=['Report_Year', 'Report_Month'], keep=False)]
if len(duplicates) > 0:
print(f"Duplicate months found: {len(duplicates)}")
print(duplicates[['Report_Year', 'Report_Month', 'Source_File']].head(20))
# Keep most recent file (by modified date)
tb_data = tb_data.sort_values('File_Modified_Date').drop_duplicates(
subset=['Report_Year', 'Report_Month', 'Account_Number'],
keep='last'
)
Check #3: Trial Balance Validation
# Verify debits = credits for each month
validation = tb_data.groupby(['Report_Year', 'Report_Month']).agg({
'Debit': 'sum',
'Credit': 'sum'
}).reset_index()
validation['Difference'] = validation['Debit'] - validation['Credit']
validation['Balances'] = abs(validation['Difference']) < 0.01 # Allow $0.01 rounding
print(f"Months in balance: {validation['Balances'].sum()} / {len(validation)}")
if not validation['Balances'].all():
print("\nMonths out of balance:")
print(validation[~validation['Balances']])
Output:
Months in balance: 358 / 360
Months out of balance:
Report_Year Report_Month Debit Credit Difference
45 1998 6 1,234,567 1,234,569 -2.00
67 2001 2 2,456,789 2,456,790 -1.00
Action: Investigate 2 months manually (likely data entry errors in source)
Day 6: Export to New System Format
Target System Requirements
Format: CSV, pipe-delimited (|), specific column order, date format YYYYMMDD
# Transform to target format
tb_export = tb_data[[
'Report_Year',
'Report_Month',
'Account_Number',
'Description',
'Debit',
'Credit'
]].copy()
# Create period field (YYYYMM)
tb_export['Period'] = (
tb_export['Report_Year'].astype(str) +
tb_export['Report_Month'].astype(str).str.zfill(2)
)
# Format amounts (2 decimal places, no commas)
tb_export['Debit'] = tb_export['Debit'].round(2)
tb_export['Credit'] = tb_export['Credit'].round(2)
# Reorder columns
tb_export = tb_export[[
'Period',
'Account_Number',
'Description',
'Debit',
'Credit'
]]
# Export as pipe-delimited CSV
tb_export.to_csv(
'trial_balance_MIGRATION.csv',
sep='|',
index=False,
header=True
)
print(f"Exported {len(tb_export):,} rows to trial_balance_MIGRATION.csv")
print(f"File size: {os.path.getsize('trial_balance_MIGRATION.csv') / 1024 / 1024:.1f} MB")
Output:
Exported 2,847,392 rows to trial_balance_MIGRATION.csv
File size: 187.3 MB
Day 7: The Presentation
Me (to CFO and finance team):
“I’ve consolidated 20 years of trial balances. 2.8 million rows from 1,247 Excel files.”
CFO: “That’s… impressive. How long did it take?”
Me: “About 6 days. Most of that was building the automation. Once it’s running, it processes all files in 12 minutes.”
Finance Director: “The consultants quoted 8 months…”
Me: “I know. Here’s the validation report. 358 of 360 months balance perfectly. These 2 need manual review—they were already wrong in the source files.”
CFO: scrolls through validation report “This is incredibly detailed. Can you do the same for investments, contributions, and benefits?”
Me: “Already done. Ran the same scripts on all file types. Total: 7.2 million rows consolidated.”
CFO: “How much did this cost?”
Me: “Zero. Well, I bought a Python course for $50.”
CFO: long pause “You saved us $50,000 and 7 months. I’m giving you a bonus. And a promotion. And I want you to train the whole team.”
Me: tries not to cry
The Results
Before Python:
- ⏰ Estimated Time: 8 months
- Estimated Cost: $50,000
- Risk: High (manual = errors)
- Stress: Off the charts
After Python:
- ⏰ Actual Time: 1 week
- Actual Cost: $50 (online course)
- Accuracy: 99.4% automated (6 files needed manual fix)
- Stress: Manageable
Metrics:
- Files Processed: 2,413
- Rows Consolidated: 7,241,039
- Time Saved: 32 weeks
- Money Saved: $50,000
- ROI: 100,000%
The Scripts I Built
1. File Inventory Script
inventory_files.py
- Recursively finds all Excel files
- Detects file type from name
- Creates inventory spreadsheet
2. Universal Excel Reader
read_excel_robust.py
- Tries multiple strategies to read files
- Handles .xls, .xlsx, .xlsb, .csv
- Returns DataFrame or None
3. Column Standardizer
standardize_columns.py
- Maps column name variations to standard names
- Uses fuzzy matching
- Handles typos and abbreviations
4. Consolidation Script
consolidate_files.py
- Processes all files of a given type
- Extracts dates from filenames
- Adds metadata (source file, sheet, modified date)
- Exports to Parquet (faster than Excel)
5. Validation Script
validate_data.py
- Checks for missing months
- Identifies duplicates
- Validates trial balance (debits = credits)
- Generates quality report
6. Export Script
export_to_new_system.py
- Transforms to target system format
- Handles special formatting requirements
- Exports as pipe-delimited CSV
Total Lines of Code: ~800 (including comments)
Reusable: Yes! Used for 4 different data types with minor tweaks.
Lessons Learned
Lesson #1: Automation Beats Manual Every Time
Manual Approach:
- Time: 8 months
- Cost: $50,000
- Errors: Unknown (probably many)
Automated Approach:
- Time: 1 week
- Cost: $50
- Errors: 6 files out of 2,413 (0.25%)
Winner: Automation
Lesson #2: Invest in Robust Error Handling
Files WILL be:
- Corrupt
- Password-protected
- In unexpected formats
- Mislabeled
- Empty
Handle gracefully, log failures, move on.
Lesson #3: Fuzzy Matching is a Lifesaver
Column names WILL vary:
Acct_Numvs.Account NumberDRvs.Debit_Amt- Typos:
Desciption(missing ‘r’)
Fuzzy matching handles all of this automatically.
Lesson #4: Validate EVERYTHING
Don’t assume source data is correct!
We found:
- 2 months with trial balances out of balance (source errors)
- 47 duplicate files (different versions, which to use?)
- 18 completely empty files
- 6 files from wrong time period (FY2013 file in FY2014 folder)
Automated validation caught all of these.
Lesson #5: Parquet > Excel for Large Datasets
Excel File:
- Size: 487 MB
- Load time: 2 minutes
- Crashes Excel on older computers
Parquet File:
- Size: 89 MB (81% smaller!)
- Load time: 3 seconds
- Works instantly
Bonus: Parquet preserves data types perfectly (no “Excel converted my account numbers to dates” issues)
The Long-Term Impact
Year 1:
- Saved $50,000 on consultants
- Completed migration in 1 week vs. 8 months
- Promoted to Senior Analyst
- 15% raise
- Became team’s “Python expert”
Year 2:
- Used same scripts for quarterly data loads (4 times/year)
- Saved 32 hours/quarter = 128 hours/year
- Automated 12 other reports using similar techniques
- Total time saved: ~400 hours/year
Year 3:
- Trained 8 team members in Python
- Team automated 30+ processes
- Finance department recognized as “most efficient” in org
- CFO presented our work at industry conference
- I got another promotion
Your Turn
Facing a data migration project?
Don’t panic. You’ve got this.
Step 1: Inventory your files Step 2: Build a robust reader Step 3: Standardize column names Step 4: Consolidate Step 5: Validate Step 6: Export
Time Investment: 1-2 weeks
Payoff: Priceless
Try It Yourself
Want the complete migration scripts? Download from GitHub
Have your own migration success story? Share it in the comments!
Join the Discussion on Discord! -
Facing a data migration challenge? Join our Discord and get help from people who’ve been there!
*“The best time to learn Python was 5 years ago. The second best time is today.” *