Soft Constraints for Data Management
Integrity constraints such as functional dependencies (FD), and multivalued dependencies (MVD) are fundamental in database schema design, query optimization, and for enforcing data integrity. Current data intensive applications such as ML algorithms process observational data that is often unnormalized, inconsistent, erroneous and noisy. In these applications, quite often the constraints need to be inferred from the data, and are not required to hold exactly, but it suffices if they hold only to a certain degree. In this work, we use information theory to quantify the degree of satisfaction of a constraint, giving rise to two major challenges that I will cover in this talk: the implication problem for soft constraints, and discovering soft constraints in data. The implication problem for soft constraints asks whether a set of constraints (antecedents) that hold in the data to a large degree imply a high degree of satisfaction of another constraint (consequent). The implication problem has been investigated in both the Database and AI literature, but only under the assumption that all constraints hold exactly; our work extends this to the case of soft constraints. Next, we address the problem of mining soft constraints from data, and present an algorithm for discovering complete schemas from data. The algorithm employs pruning techniques that take advantage of the properties of the information-theoretic measures associated with the constraints, and allow it to scale to datasets with up to 1M tuples, and up to 30 attributes.