forked from datafold/data-diff
-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Noticed data-diff taking long time for large data sets.
Ex: We have a table where we are diffing 2 days worth of data consisting of 6500577 records and data-diff took about 8398 secs to complete.
Expected: As per this technical doc a table with 100M rows will finish in 20 mins so a table with 6M rows should finish in less than 2 mins.
Actual: In the above case it is taking 8398 secs to finish.
Here are the data-diff parameters that we used.
import data_diff
segment_kwargs = {
"key_columns": table_info['primary_keys'],
"extra_columns": table_info['extra_columns'],
"table_path": (table_info["table_schema"], table_info["table_name"]),
}
if table_info['bookmark_column']:
segment_kwargs['update_column'] = table_info['bookmark_column']
segment_kwargs['min_update'] = table_info['min_update']
segment_kwargs['max_update'] = table_info['max_update']
diff_mysql = connect(
db_conf={
"database": table_info['table_schema'],
"driver": "mysql",
"host": table_info['hostname'],
"port": 3306,
"password": mysql_password,
"username": mysql_user,
}
)
diff_bigquery = connect(
db_conf={
"driver": "bigquery",
"bigquery_credentials": bq_credentials,
"project": bq_project,
"dataset": table_info['table_schema'],
}
)
for difference in data_diff.diff_tables(
table1=table_segment.TableSegment(diff_mysql, **segment_kwargs),
table2=table_segment.TableSegment(diff_bigquery, **segment_kwargs),
algorithm=Algorithm.HASHDIFF,
threaded=False,
auto_bisection_factor=True,
): Table details:
Total rows to process between min and max bookmark_column's timestamp : 6500577
Table: {'hostname': 'xxxxxxxx.com', 'table_schema': 'aaaaaa', 'table_name': 'xxxxreport', 'bookmark_column': 'bbbb_timestamp', 'primary_keys': ('idxxxxxreport_int',), 'extra_columns': ('id1_int', 'id2_int', 'id3_int', 'metric_value'), 'min_update': '2025-05-07 23:24:37', 'max_update': '2025-05-09 23:24:37'}