-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #692 obv dupe changes for ftd and matcher #694
Conversation
@@ -0,0 +1,247 @@ | |||
package zingg.common.core.executor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class is not an executor and should be broken down into two classes at least. they can live in a separate package zingg.common.core.obviousDupes. one that takes args, dsutil and creates filters. one that applies those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought that may be we might want to create a phase out of it in future? 100% match filter kinds? If no such plan can break it down, no problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are talking of two different things here.
- A new possible phase for obviousDupes. If we do that, we need some kind of executor.
- Breaking down the class
The first one can always be defined later using the classes in second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -126,7 +121,7 @@ public void execute() throws ZinggClientException { | |||
ZFrame<D, R, C> obvDupePairs = getObvDupePairs(blocked); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic should move to obvious dupes. the matcher should just ask the obv dupes to add/remove dupes as necessary by passing blocks/pairs whatever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is already moved , this method delegates work to the obv dupe class. The method is kept as it gets overriden in enterprise matcher for purpose of instrumentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is getObvDupePairs method in Matcher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we should override the new class in enterprise and introduce im and weave into the ent matcher.
Obvious dupes logic should be encapsulated in its own package and classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and checked the code. The following should not be part of the Matcher.
if (obvDupePairs != null) { long obvDupeCount = obvDupePairs.count(); LOG.debug("obvDupePairs count " + obvDupeCount); if (obvDupeCount > 0) { blocks = removeObvDupesFromBlocks(blocks); } }
The methods in the Matcher that you have defined can continue to call the obv dupes classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in commit 70f1c33, please review
* @return | ||
*/ | ||
|
||
public C getObviousDupesFilter(ZFrame<D, R, C> df1, ZFrame<D, R, C> dfToJoin, String obviousDupeString, C extraAndCond) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you define the obvious filter not through a string put through json, this code becomes much simpler. how about
"obviousDupes": [
"matchCondition": [{"column":"fname"},{"column":"lname"}]
"matchCondition":[{}]
]
You can simply define this object as part of the args. It gets populated automatically, no need to parse, trim etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will create a JSON sample for various scenarios and get it reviewed before implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in commit c2ddce9 please review
have completed all review comments of obv dupe filter in OSS |
public static final Log LOG = LogFactory.getLog(ObvDupeFilter.class); | ||
|
||
protected Arguments args; | ||
protected Context<S,D,R,C,T> context; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pass Context or can we pass DSUtil ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also see if S and T are needed for this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in commit b29758c , please review
ZFrame<D,R,C> onlyIds = null; | ||
|
||
// loop thru the values and build a filter condition | ||
for (int i = 0; i < obviousDupes.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is tyhere a reason to loop through this one by one? I thought the filter would return the and, or conditions to join on completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to loop thru all the match conditions and within match conditions for every field name.
e.g "obviousDupes":[{ "matchCondition":[ { "fieldName":"fname" }, { "fieldName":"stNo" }, { "fieldName":"add1" } ]},{"matchCondition":[ { "fieldName":"ssn" }] }, { "matchCondition":[{"fieldName":"fname" }, {"fieldName":"stNo" }, {"fieldName":"lname" } ]} ],
so we loop thru to get 3 matchCondition and then within each matchCondition for each fieldName
forming something like
(fname = z_fname and stNo = z_stNo and add1 = z_add1) OR
(ssn = z_ssn) OR
(fname = z_fname and stNo = z_stNo and lname = z_lname)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I get that. The point I am trying to make is can we have one single Column returned for all the ors along with the ands? Eventually the query push down is anyways going to do it in one single statement. Do we need to do the unions or can we get the resuklt using one single join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is query formation.
We don't actually use it exactly like this.
In case of getting obv dupes from blocked:
- We use each and condition
- union all the dfs got from above
- do distinct
Directly doing OR was causing performance issues
For removing obv dupe from blocks
We first form a combined condition like above and than return a combined condition with NOT:
NOT (
(fname = z_fname and stNo = z_stNo and add1 = z_add1) OR
(ssn = z_ssn) OR
(fname = z_fname and stNo = z_stNo and lname = z_lname)
)
The way we make Column filter expression is independent of how we use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically we don;t use one final Column in end and we are flexible. You have asked similar question below, I hope this answers both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By keeping build of column of filter expression independent of how we use it we can use it as needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also many of the methods are private only the public/protected can be used by outside code
common/core/src/main/java/zingg/common/core/obviousdupes/ObvDupeFilter.java
Outdated
Show resolved
Hide resolved
* @param obvDupes | ||
* @return | ||
*/ | ||
public ZFrame<D,R,C> massageObvDupes(ZFrame<D,R,C> obvDupes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this belongs to the other class. It seems that the class names can be swapped - ObvDupeUtil can be renamed to ObviousDupeCondiiton or even ObvDupeFilter as it is only building the join/filter conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please tell the names of both and I will do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method moved : commit 45e9205
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObvDupeUtil is the class that applies the filters, adds and removes obv dupes
ObvDupeCondition or ObvDupeFilter is the class that reads the args vaues and creates the conditions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in commit 6db0092 , please review
ZFrame<D,R,C> onlyIds = null; | ||
|
||
// loop thru the values and build a filter condition | ||
for (int i = 0; i < obviousDupes.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I get that. The point I am trying to make is can we have one single Column returned for all the ors along with the ands? Eventually the query push down is anyways going to do it in one single statement. Do we need to do the unions or can we get the resuklt using one single join?
common/core/src/main/java/zingg/common/core/obviousdupes/ObvDupeFilter.java
Outdated
Show resolved
Hide resolved
* @param obvDupes | ||
* @return | ||
*/ | ||
public ZFrame<D,R,C> massageObvDupes(ZFrame<D,R,C> obvDupes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObvDupeUtil is the class that applies the filters, adds and removes obv dupes
ObvDupeCondition or ObvDupeFilter is the class that reads the args vaues and creates the conditions
Issue #692 obv dupe changes for ftd and matcher