Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding query_id attr #67

Closed
wants to merge 4 commits into from
Closed

Conversation

1ng4lipt
Copy link

@1ng4lipt 1ng4lipt commented Aug 3, 2023

To identify different queries with similar query_id or initial_query_id via system.query_log table you can have an ability to set query_id by yourself using query_id param in Airflow operators you using.

It is possible by clickhouse-driver execute() method - https://clickhouse-driver.readthedocs.io/en/latest/api.html?highlight=query_id#clickhouse_driver.Client.execute_iter

This commit allows set query_id param as whatever you want, f.e. query_id = DAG_ID + TASK_ID + EXECUTION_DATE and then simply identify a query to build lineage. This can be helpful in situations when you manage equally named tables and the only way to identify their queries is to use a modfied query_id.

To identify different queries with similar query_id or initial_query_id via system.query_log table you can have an ability to set query_id by yourself using query_id param in Airflow operators you using. 

This commit allows set query_id param as whatever you want, f.e. query_id = DAG_ID + TASK_ID + EXECUTION_DATE and then simply identify a query to build lineage. This can be helpful in situations when you manage equally named tables and the only way to identify their queries is to use a modfied query_id.
Copy link
Owner

@bryzgaloff bryzgaloff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @1ng4lipt and thank you for your contribution 🙏

  1. Which clickhouse-driver version has query_id been supported starting from? We need to update our requirements constraints accordingly.

  2. What happens if two queries in ClickHouse has the same query ID? Will it result in an exception or just system.query_log will have two entries with the same IDs?

  3. To make it more comprehensive, could you please also add support to ClickHouseOperator?


    Current version of ClickHouseOperator is outdated compared to ClickHouseHook: could you please add with_column_types, types_check, and query_id to ClickHouseOperator.__init__ and propagate them to hook.run call in ClickHouseOperator.execute? BTW are there any other clickhouse-driver's execute arguments which the current implementation misses? Let's add them too maybe?

If you are unable to do it within this PR, please let me know: I will create an issue to backlog this enhancement and we may do it as a part of another PR later.

  1. Please also add tests:
    class ClickHouseOperatorTestCase(unittest.TestCase):

@ne1r0n
Copy link
Contributor

ne1r0n commented Aug 4, 2023

Hi @1ng4lipt,
Just in case, I would recommend you to look at log_comment, unless of course you have done this yet.
It looks like it does what you need, the question is how to specify it.
One of the options is just to add this to the end of your query, like SELECT .. FROM .. SETTINGS log_comment = 'my comment'

@bryzgaloff
Copy link
Owner

Hi @ne1r0n thank you for a suggestion: definitely looks useable for ClickHouse users. Speaking about our plugin, its sole purpose is to provide a wrapper for clickhouse-driver in a form of a Airflow plugin. So, we will not introduce any additional logic into the operator's implementation: just pass the arguments to the driver. A user may specify SETTINGS log_comment = … themselves as a part of query argument.

I would suggest generalize this PR to include all the existing execute's arguments or limit only to a full support of query_id as suggested by the author, and implement support for the remaining attributes in a separate PR.

@1ng4lipt
Copy link
Author

1ng4lipt commented Aug 8, 2023

Hi @bryzgaloff, @ne1r0n and thank you for a suggestions and participation in reviewing my PR

@bryzgaloff answers on your questions from here

  1. Works for 0.2.4+
  2. As you can see in image below normalized_query_hash, query_id and initial_query_id are identical for two equal queries and this is a normal state for ClickHouse. I would recommend to all users in case of building a lineage to add Airflow execution_date to query_id param to identify queries you need. Another suggestion to identify queries with identical query_id is to use max(event_time) statemant in WHERE clause.

Знімок екрана з 2023-08-07 18-37-28

  1. I'll add params shortly
  2. Could you please explain more about this point? Should I modify test by adding a params from ClickHouseOperator.__init__?
    def test(self, clickhouse_hook_mock: mock.MagicMock):

@1ng4lipt 1ng4lipt requested a review from bryzgaloff August 9, 2023 11:56
Comment on lines 18 to +23
clickhouse_conn_id: str = default_conn_name,
parameters: Optional[Dict[str, Any]] = None,
database: Optional[str] = None,
with_column_types: Optional[bool] = False,
types_check: Optional[bool] = False,
query_id: Optional[str] = None,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@1ng4lipt my idea for this PR was to add all the parameters supported by execute method to all three: hook, operator, and sensor. The execute parameters are available here: https://clickhouse-driver.readthedocs.io/en/latest/api.html#clickhouse_driver.Client.execute — could you please add them all and propagate through all the classes and the tests?

Could you please also check which earliest version of clickhouse-driver supports all the current execute method arguments?

If you are busy to add the requested, please let me know: it's ok to accept the PR at the current stage. I will just create an issue to backlog adding other the parameters. Your contribution is already valuable, thanks! 🤝

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryzgaloff I see.. Could you please create an issue for all of this so I can modify those files later?

@1ng4lipt 1ng4lipt requested a review from bryzgaloff August 15, 2023 07:05
@bryzgaloff
Copy link
Owner

Hi @1ng4lipt thank you for an update, I plan to review and merge this week. I am busy working on v1.0.0 (major release) of the plugin but to not delay query_id delivery, I will merge your contribution first.

@1ng4lipt
Copy link
Author

Hi @bryzgaloff could you please clarify a date you plan to merge my PR? I badly need this functionality on my prod env
Thank you)

@bryzgaloff
Copy link
Owner

Hi @1ng4lipt I have decided to complete my work on v1.0.0, it has query_id functionality: #68. As a quick workaround you may git pull it and install locally. The code is finished, I need to update README and release the new version. I plan to finish it this week.

@bryzgaloff
Copy link
Owner

Hi @1ng4lipt I have released https://github.com/bryzgaloff/airflow-clickhouse-plugin/releases/tag/v1.0.0 which introduces query_id among the other arguments of Client.execute. Thank you for an inspiration to do the huge refactoring and patience waiting for the release 🤝

You are in the contributors list now 😉

@bryzgaloff bryzgaloff closed this Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants