Can you confirm that even if there are no additional attributes, it makes perfect sense to use a subject table with IDs only, because it ensures that events are correlated with customers? The only thing "retained" from the original data would be the sequence length, or is there anything else to be learnt?
Michael Platzer
July 10, 2023
yes, one even needs to use a subject table to ensure that every subject is only contained once in the subject table, and thus to protect privacy at the subject- (rather than event-) level.
aside from sequence length, it is the coherence across events that is being retained. ie you can have consistent event histories or rather erratic / random Event histories. these are important statistical properties of interest, and are retained.
Reply
Tobias Hann
Mar 29, 2023
"Reference keys are synthesized as categoricals" ... I suppose that means it's basically like adding a column with the categorical variable "season" and one with the categorical variable "region", correct? And if the reference tables don't contain categorical data but numerical data? What will happen then?
Michael Platzer
Mar 30, 2023
Those columns are actually already there in the original data. Thus, they don't need to be added. They are just processed by the engine as categoricals, no matter whether those reference keys are strings or numerics.
Reply
Tobias Hann
Mar 29, 2023
"That are not being leveraged" ... you mean, these data points are not taking into consideration when we build the model, i.e. the model won't benefit from that information?
Tobias Hann
Mar 29, 2023
I guess the later comment about merging additional columns in reference tables for best SD quality into a table that is actually synthesized confirms this.
Michael Platzer
Mar 30, 2023
yes, correct!
Reply
Tobias Hann
Mar 29, 2023
Will SCP only work with 2 sequential tables or also more (3, 4, ... n)?
Tasos Tsourtis
Mar 30, 2023
Yes it does now, thanks to the core team!
Michael Platzer
Mar 30, 2023
🙌
Michael Platzer
Mar 30, 2023
So, yes, SCP can support any number of sequential tables. And currently it seems that the training and generation time is not severely impacted by taking into account several tables.
Tobias Hann
Mar 30, 2023
🔥
Reply
Tobias Hann
Mar 29, 2023
What do you mean by "denormalize"?
Tasos Tsourtis
Mar 30, 2023
In a few words it is the act of expanding multiple attributes of a column, by replicating data in the columns that have a single attribute.
e.g. userID - age - orders
First two records: user_1, age_user_1, [order_1_user_1, order_2_user_1] user_2, age_user_2, [order_1_user_2, order_2_user_2]
is denormalized as four records: user_1, age_user_1, order_1_user_1 user_1, age_user_1, order_2_user_1 user_2, age_user_2, order_1_user_2 user_2, age_user_2, order_2_user_2,
Michael Platzer
Mar 30, 2023
In that example, I meant to say that the `LOGIN` table is enriched with those two smart select columns from the `ACCOUNTS` table. Like if you would do a left-join in SQL. Thus, every LOGIN knows the type and the date of its parent-ACCOUNT. We synthesize all together, and once done, we then use those generated extra fields to find an appropriate parent. I.e. one that has a matching type & date.
Michael Platzer
Mar 30, 2023
@Tobias Hann
Let us know whether that helped to clarify.
Tobias Hann
Mar 31, 2023
Thanks! That makes it clear
@Michael Platzer
Reply
Matthias Funke
Apr 28, 2023
You make a point about the customer finding out "the hard way" that certain correlations are not retained. How does the product deal with these cases? I assume that the "offending" missing correlations DO NOT show up in the quality report. Is my assumption correct? If so, should we generate a proactive warning, or make it part of super user training?
Michael Platzer
Apr 28, 2023
Good question. It's on the product team to improve the communication around these (as part of UI, as part of documentation). But until we get there, it's on Sales / CX to explain these to customers.
(and yes, your assumption is correct, that the QA report doesn't show these, as it only shows correlations between a child and its parent table)
Reply
Tasos Tsourtis
Mar 29, 2023
Thank you Michi for this great effort in summarizing our work in a concise and educational way 🙏
aside from sequence length, it is the coherence across events that is being retained. ie you can have consistent event histories or rather erratic / random Event histories. these are important statistical properties of interest, and are retained.
e.g. userID - age - orders
First two records:
user_1, age_user_1, [order_1_user_1, order_2_user_1]
user_2, age_user_2, [order_1_user_2, order_2_user_2]
is denormalized as four records:
user_1, age_user_1, order_1_user_1
user_1, age_user_1, order_2_user_1
user_2, age_user_2, order_1_user_2
user_2, age_user_2, order_2_user_2,
(and yes, your assumption is correct, that the QA report doesn't show these, as it only shows correlations between a child and its parent table)