Skip to content

Performance issue with large tables #28

@ramyamasani

Description

@ramyamasani

Noticed data-diff taking long time for large data sets.

Ex: We have a table where we are diffing 2 days worth of data consisting of 6500577 records and data-diff took about 8398 secs to complete.

Expected: As per this technical doc a table with 100M rows will finish in 20 mins so a table with 6M rows should finish in less than 2 mins.
Actual: In the above case it is taking 8398 secs to finish.

Here are the data-diff parameters that we used.

import data_diff

 segment_kwargs = {
            "key_columns": table_info['primary_keys'],
            "extra_columns": table_info['extra_columns'],
            "table_path": (table_info["table_schema"], table_info["table_name"]),
        }
        if table_info['bookmark_column']:
                segment_kwargs['update_column'] = table_info['bookmark_column']
                segment_kwargs['min_update'] = table_info['min_update']
                segment_kwargs['max_update'] = table_info['max_update']
        

 diff_mysql = connect(
            db_conf={
                "database": table_info['table_schema'],
                "driver": "mysql",
                "host": table_info['hostname'],
                "port": 3306,
                "password": mysql_password,
                "username": mysql_user,
            }
        )

        diff_bigquery = connect(
            db_conf={
                "driver": "bigquery",
                "bigquery_credentials": bq_credentials,
                "project": bq_project,
                "dataset": table_info['table_schema'],
            }
        )

for difference in data_diff.diff_tables(
    table1=table_segment.TableSegment(diff_mysql, **segment_kwargs),
    table2=table_segment.TableSegment(diff_bigquery, **segment_kwargs),
    algorithm=Algorithm.HASHDIFF,
    threaded=False,
    auto_bisection_factor=True,
): 

Table details:
Total rows to process between min and max bookmark_column's timestamp : 6500577

Table: {'hostname': 'xxxxxxxx.com', 'table_schema': 'aaaaaa', 'table_name': 'xxxxreport', 'bookmark_column': 'bbbb_timestamp', 'primary_keys': ('idxxxxxreport_int',), 'extra_columns': ('id1_int', 'id2_int', 'id3_int', 'metric_value'), 'min_update': '2025-05-07 23:24:37', 'max_update': '2025-05-09 23:24:37'}

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions