An Explainer on Multi Table

Loom

Michael Platzer・

almost 2 years ago

80 views

1.5×

22 min⚡️27 min22 min18 min15 min13 min11 min8 min 46 sec

Discuss here on Slack

1.5×

22 min⚡️27 min22 min18 min15 min13 min11 min8 min 46 sec

Discuss here on Slack

9 Comments

Ryan Noble

Apr 3, 2023

This slide is perfect for the PNC project!

Matthias Funke

July 10, 2023

Can you confirm that even if there are no additional attributes, it makes perfect sense to use a subject table with IDs only, because it ensures that events are correlated with customers? The only thing "retained" from the original data would be the sequence length, or is there anything else to be learnt?

Michael Platzer

July 10, 2023

yes, one even needs to use a subject table to ensure that every subject is only contained once in the subject table, and thus to protect privacy at the subject- (rather than event-) level.

aside from sequence length, it is the coherence across events that is being retained. ie you can have consistent event histories or rather erratic / random Event histories. these are important statistical properties of interest, and are retained.

Tobias Hann

Mar 29, 2023

"Reference keys are synthesized as categoricals" ... I suppose that means it's basically like adding a column with the categorical variable "season" and one with the categorical variable "region", correct? And if the reference tables don't contain categorical data but numerical data? What will happen then?

Michael Platzer

Mar 30, 2023

Those columns are actually already there in the original data. Thus, they don't need to be added. They are just processed by the engine as categoricals, no matter whether those reference keys are strings or numerics.

Tobias Hann

Mar 29, 2023

"That are not being leveraged" ... you mean, these data points are not taking into consideration when we build the model, i.e. the model won't benefit from that information?

Tobias Hann

Mar 29, 2023

I guess the later comment about merging additional columns in reference tables for best SD quality into a table that is actually synthesized confirms this.

Michael Platzer

Mar 30, 2023

yes, correct!

Tobias Hann

Mar 29, 2023

Will SCP only work with 2 sequential tables or also more (3, 4, ... n)?

Tasos Tsourtis

Mar 30, 2023

Yes it does now, thanks to the core team!

Michael Platzer

Mar 30, 2023

🙌

Michael Platzer

Mar 30, 2023

So, yes, SCP can support any number of sequential tables. And currently it seems that the training and generation time is not severely impacted by taking into account several tables.

Tobias Hann

Mar 30, 2023

🔥

Tobias Hann

Mar 29, 2023

What do you mean by "denormalize"?

Tasos Tsourtis

Mar 30, 2023

In a few words it is the act of expanding multiple attributes of a column, by replicating data in the columns that have a single attribute.

e.g. userID - age - orders

First two records:
user_1, age_user_1, [order_1_user_1, order_2_user_1]
user_2, age_user_2, [order_1_user_2, order_2_user_2]

is denormalized as four records:
user_1, age_user_1, order_1_user_1
user_1, age_user_1, order_2_user_1
user_2, age_user_2, order_1_user_2
user_2, age_user_2, order_2_user_2,

Michael Platzer

Mar 30, 2023

In that example, I meant to say that the `LOGIN` table is enriched with those two smart select columns from the `ACCOUNTS` table. Like if you would do a left-join in SQL. Thus, every LOGIN knows the type and the date of its parent-ACCOUNT. We synthesize all together, and once done, we then use those generated extra fields to find an appropriate parent. I.e. one that has a matching type & date.

Michael Platzer

Mar 30, 2023

@Tobias Hann

Let us know whether that helped to clarify.

Tobias Hann

Mar 31, 2023

Thanks! That makes it clear

@Michael Platzer

Matthias Funke

Apr 28, 2023

You make a point about the customer finding out "the hard way" that certain correlations are not retained. How does the product deal with these cases? I assume that the "offending" missing correlations DO NOT show up in the quality report. Is my assumption correct? If so, should we generate a proactive warning, or make it part of super user training?

Michael Platzer

Apr 28, 2023

Good question. It's on the product team to improve the communication around these (as part of UI, as part of documentation). But until we get there, it's on Sales / CX to explain these to customers.

(and yes, your assumption is correct, that the QA report doesn't show these, as it only shows correlations between a child and its parent table)

Tasos Tsourtis

Mar 29, 2023

Thank you Michi for this great effort in summarizing our work in a concise and educational way 🙏

Michael Platzer

Mar 30, 2023

🙌

Felipe Calderero

July 26, 2024

Thank you Michael! 🙏