Skip to content

Commit ef9b9fa

Browse files
Enable streaming (#24)
* enable streaming * scaffolding for simpleExpr validation * completed refactor -- tests outstanding * refactor and enablement complete * updated readme * added implicit boolean * added filter for summary report * Update ValidatorTestSuite (#19) * Update Validator tests with API changes. * Add tests for implicit and explicit expression rules. * imported outstanding spark sql functions * Add test suite for Rules class. * Add tests for RuleSet class. * Add test for complex expressions on aggregates. * Fix isGrouped bug when groupBys array is empty by default or explicitly set. * Fix overloaded add function that merges 2 RuleSets. * Add ignoreCase and invertMatch to ValidateStrings and ValidateNumerics rule types. * Update documentation with latest features in categorical Rules. Co-authored-by: Daniel Tomes [GeekSheikh] <10840635+geeksheikh@users.noreply.github.com> * Update sbt (#23) * simple update to build sbt * Add scoverage. Co-authored-by: Will Girten <will.girten@databricks.com> * removed unused imports * Accept expanded sequence of Rules to RuleSet Class. * cleaning up (#30) * cleaning up * removed dependencies from assembly * Fix whitespaces and special characters in Rule Names (#25) * Parse white spaces and special characters in failure report. * Update variable name with more meaningful name. * Add method to remove whitespace and special characters from Rule names. * Simplify ruleName public accessor. * Change special character replacement to underscores. * Update warning messages and assign private ruleName only once. * Update demo notebook (#33) * Update demo notebook with examples of latest features added. * added scala demo example Co-authored-by: Daniel Tomes [GeekSheikh] <10840635+geeksheikh@users.noreply.github.com> * implemented new inclusive boundaries option (#32) * implemented new inclusive boundaries option * enhanced logic for upper and lower inclusivity * readme updated * Update validation logic for Bounds class. Add test case for inclusive boundary rules. (#35) Co-authored-by: Will Girten <47335283+goodwillpunning@users.noreply.github.com> Co-authored-by: Will Girten <47335283+goodwillpunning@users.noreply.github.com> Co-authored-by: Will Girten <will.girten@databricks.com>
1 parent 18c0cfa commit ef9b9fa

18 files changed

+1702
-688
lines changed

README.md

+242-72
Large diffs are not rendered by default.

build.sbt

+23-6
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ name := "dataframe-rules-engine"
22

33
organization := "com.databricks.labs"
44

5-
version := "0.1.2"
5+
version := "0.2.0"
66

77
scalaVersion := "2.12.12"
88
scalacOptions ++= Seq("-Xmax-classfile-name", "78")
@@ -23,29 +23,46 @@ publishTo := Some(
2323

2424
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1" % Provided
2525
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.1" % Provided
26-
libraryDependencies += "org.scalactic" %% "scalactic" % "3.2.6"
2726
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.6" % Test
2827

28+
run in Compile := Defaults.runTask(fullClasspath in Compile, mainClass in(Compile, run), runner in(Compile, run)).evaluated
29+
runMain in Compile := Defaults.runMainTask(fullClasspath in Compile, runner in(Compile, run)).evaluated
30+
2931
lazy val excludes = jacocoExcludes in Test := Seq()
3032

31-
lazy val jacoco = jacocoReportSettings in test :=JacocoReportSettings(
33+
lazy val jacoco = jacocoReportSettings in test := JacocoReportSettings(
3234
"Jacoco Scala Example Coverage Report",
3335
None,
34-
JacocoThresholds (branch = 100),
36+
JacocoThresholds(branch = 100),
3537
Seq(JacocoReportFormats.ScalaHTML,
3638
JacocoReportFormats.CSV),
3739
"utf-8")
3840

3941
val jacocoSettings = Seq(jacoco)
40-
lazy val jse = (project in file (".")).settings(jacocoSettings: _*)
42+
lazy val jse = (project in file(".")).settings(jacocoSettings: _*)
4143

4244
fork in Test := true
4345
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:+CMSClassUnloadingEnabled")
4446
testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD")
4547

4648

4749
lazy val commonSettings = Seq(
48-
version := "0.1.2",
50+
version := "0.2.0",
4951
organization := "com.databricks.labs",
5052
scalaVersion := "2.12.12"
5153
)
54+
55+
assemblyMergeStrategy in assembly := {
56+
case PathList("META-INF", xs@_*) => MergeStrategy.discard
57+
case x => MergeStrategy.first
58+
}
59+
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
60+
61+
// exclude scala-library dependency
62+
assemblyExcludedJars in assembly := {
63+
val cp = (fullClasspath in assembly).value
64+
cp filter { f =>
65+
f.data.getName.contains("spark-core") ||
66+
f.data.getName.contains("spark-sql")
67+
}
68+
}

codecov.yml

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
ignore:
2+
- "src/test/**/*"
3+
- "target/**/*"
4+
- "images/**/*"
5+
- "project/**/*"
6+
- ".github/**/*"
7+
- "src/main/scala/com/databricks/labs/validation/utils/SparkSessionWrapper.scala"

demo/Example.scala

+229-71
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,24 @@
1-
package com.databricks.labs.validation
2-
3-
import com.databricks.labs.validation.utils.{Lookups, SparkSessionWrapper}
1+
// Databricks notebook source
42
import com.databricks.labs.validation.utils.Structures._
5-
import org.apache.spark.sql.Column
3+
import com.databricks.labs.validation._
64
import org.apache.spark.sql.functions._
5+
import org.apache.spark.sql.{Column, DataFrame}
6+
7+
// COMMAND ----------
8+
9+
// MAGIC %md
10+
// MAGIC # Sample Dataset
11+
12+
// COMMAND ----------
713

8-
object Example extends App with SparkSessionWrapper {
9-
import spark.implicits._
10-
11-
/**
12-
* Validation example
13-
* Passing pre-built array of rules into a RuleSet and validating a non-grouped dataframe
14-
*/
15-
16-
/**
17-
* Example of a proper UDF to simplify rules logic. Simplification UDFs should take in zero or many
18-
* columns and return one column
19-
* @param retailPrice column 1
20-
* @param scanPrice column 2
21-
* @return result column of applied logic
22-
*/
23-
def getDiscountPercentage(retailPrice: Column, scanPrice: Column): Column = {
24-
(retailPrice - scanPrice) / retailPrice
25-
}
26-
27-
// Example of creating array of custom rules
28-
val specializedRules = Array(
29-
Rule("Reasonable_sku_counts", count(col("sku")), Bounds(lower = 20.0, upper = 200.0)),
30-
Rule("Max_allowed_discount",
31-
max(getDiscountPercentage(col("retail_price"), col("scan_price"))),
32-
Bounds(upper = 90.0)),
33-
Rule("Retail_Price_Validation", col("retail_price"), Bounds(0.0, 6.99)),
34-
Rule("Unique_Skus", countDistinct("sku"), Bounds(upper = 1.0))
35-
)
36-
37-
// It's common to generate many min/max boundaries. These can be generated easily
38-
// The generator function can easily be extended or overridden to satisfy more complex requirements
39-
val minMaxPriceDefs = Array(
40-
MinMaxRuleDef("MinMax_Sku_Price", col("retail_price"), Bounds(0.0, 29.99)),
41-
MinMaxRuleDef("MinMax_Scan_Price", col("scan_price"), Bounds(0.0, 29.99)),
42-
MinMaxRuleDef("MinMax_Cost", col("cost"), Bounds(0.0, 12.0))
43-
)
44-
45-
val minMaxPriceRules = RuleSet.generateMinMaxRules(minMaxPriceDefs: _*)
46-
val someRuleSet = RuleSet(df)
47-
someRuleSet.addMinMaxRules(minMaxPriceDefs: _*)
48-
someRuleSet.addMinMaxRules("Retail_Price_Validation", col("retail_price"), Bounds(0.0, 6.99))
49-
50-
51-
val catNumerics = Array(
52-
Rule("Valid_Stores", col("store_id"), Lookups.validStoreIDs),
53-
Rule("Valid_Skus", col("sku"), Lookups.validSkus)
54-
)
55-
56-
val catStrings = Array(
57-
Rule("Valid_Regions", col("region"), Lookups.validRegions)
58-
)
59-
60-
//TODO - validate datetime
61-
// Test, example data frame
62-
val df = sc.parallelize(Seq(
14+
object Lookups {
15+
final val validStoreIDs = Array(1001, 1002)
16+
final val validRegions = Array("Northeast", "Southeast", "Midwest", "Northwest", "Southcentral", "Southwest")
17+
final val validSkus = Array(123456, 122987, 123256, 173544, 163212, 365423, 168212)
18+
final val invalidSkus = Array(9123456, 9122987, 9123256, 9173544, 9163212, 9365423, 9168212)
19+
}
20+
21+
val df = sc.parallelize(Seq(
6322
("Northwest", 1001, 123456, 9.32, 8.99, 4.23, "2020-02-01 00:00:00.000"),
6423
("Northwest", 1001, 123256, 19.99, 16.49, 12.99, "2020-02-01"),
6524
("Northwest", 1001, 123456, 0.99, 0.99, 0.10, "2020-02-01"),
@@ -75,19 +34,218 @@ object Example extends App with SparkSessionWrapper {
7534
.withColumn("create_ts", 'create_ts.cast("timestamp"))
7635
.withColumn("create_dt", 'create_ts.cast("date"))
7736

78-
// Doing the validation
79-
// The validate method will return the rules report dataframe which breaks down which rules passed and which
80-
// rules failed and how/why. The second return value returns a boolean to determine whether or not all tests passed
81-
// val (rulesReport, passed) = RuleSet(df, Array("store_id"))
82-
val (rulesReport, passed) = RuleSet(df)
83-
.add(specializedRules)
84-
.add(minMaxPriceRules)
85-
.add(catNumerics)
86-
.add(catStrings)
87-
.validate(2)
37+
// COMMAND ----------
38+
39+
display(df)
40+
41+
// COMMAND ----------
42+
43+
// MAGIC %md
44+
// MAGIC # Rule Types
45+
// MAGIC There are several Rule types available:
46+
// MAGIC
47+
// MAGIC 1. Categorical (numerical and string) - used to validate if row values fall in a pre-defined list of values, e.g. lookups
48+
// MAGIC 2. Boundaries - used to validate if row values fall within a range of numerical values
49+
// MAGIC 3. Expressions - used to validate if row values pass expressed conditions. These can be simple expressions like a Boolean column `col('valid')`, or complex, like `col('a') - col('b') > 0.0`
50+
51+
// COMMAND ----------
52+
53+
// MAGIC %md
54+
// MAGIC ### Example 1: Writing your first Rule
55+
// MAGIC Let's look at a very simple example...
56+
57+
// COMMAND ----------
58+
59+
// First, begin by defining your RuleSet by passing in your input DataFrame
60+
val myRuleSet = RuleSet(df)
61+
62+
// Next, define a Rule that validates that the `store_id` values fall within a list of pre-defined Store Ids
63+
val validStoreIdsRule = Rule("Valid_Store_Ids_Rule", col("store_id"), Array(1001, 1002))
8864

89-
rulesReport.show(200, false)
90-
// rulesReport.printSchema()
65+
// Finally, add the Rule to the RuleSet and validate!
66+
val validationResults = myRuleSet.add(validStoreIdsRule).validate()
9167

68+
// COMMAND ----------
9269

70+
// MAGIC %md
71+
// MAGIC ## Viewing the Validation Results
72+
// MAGIC
73+
// MAGIC The result from calling `validate()` on your RuleSet will be 2 DataFrames - a complete report and a summary report.
74+
// MAGIC
75+
// MAGIC #### The completeReport
76+
// MAGIC The complete report is verbose and will add all rule validations to the right side of the original df
77+
// MAGIC passed into RuleSet. Note that if the RuleSet is grouped, the result will include the groupBy columns and all rule
78+
// MAGIC evaluation specs and results
79+
// MAGIC
80+
// MAGIC #### The summaryReport
81+
// MAGIC The summary report is meant to be just that, a summary of the failed rules. This will return only the records that
82+
// MAGIC failed and only the rules that failed for that record; thus, if the `summaryReport.isEmpty` then all rules passed.
83+
84+
// COMMAND ----------
85+
86+
// Let's look at the completeReport from the example above
87+
display(validationResults.completeReport)
88+
89+
// COMMAND ----------
90+
91+
// MAGIC %md
92+
// MAGIC ## Example 2: Boundaries
93+
// MAGIC Boundary Rules can be used to validate if row values fall within a range of numerical values.
94+
// MAGIC
95+
// MAGIC It's quite common to generate many min/max boundaries and can be passed as an Array of Rules.
96+
97+
// COMMAND ----------
98+
99+
// Let's define several Boundary Rules to apply
100+
val minMaxPriceDefs = Array(
101+
MinMaxRuleDef("MinMax_Sku_Price", col("retail_price"), Bounds(0.0, 29.99)),
102+
MinMaxRuleDef("MinMax_Scan_Price", col("scan_price"), Bounds(0.0, 29.99)),
103+
MinMaxRuleDef("MinMax_Cost", col("cost"), Bounds(0.0, 12.0))
104+
)
105+
106+
// Add all the Rules at once using the array of Rules
107+
val minMaxPriceRules = RuleSet(df).addMinMaxRules(minMaxPriceDefs: _*)
108+
109+
// Validate rows against all the Boundary Rules
110+
val validationResults = minMaxPriceRules.validate()
111+
112+
// Let's look at the failed rows this time
113+
display(validationResults.summaryReport)
114+
115+
// COMMAND ----------
116+
117+
// MAGIC %md
118+
// MAGIC ## Example 3: Expressions
119+
// MAGIC Expressions can used to validate if row values pass expressed conditions.
120+
// MAGIC
121+
// MAGIC These can be simple expressions like a Boolean column `col('valid')`, or complex, like `col('a') - col('b') > 0.0`
122+
123+
// COMMAND ----------
124+
125+
// Ensure that each product has a distinct Product SKU
126+
val distinctProductsRule = Rule("Unique_Skus", countDistinct("sku"), Bounds(upper = 1.0))
127+
128+
// Rules can even be used in conjunction with user defined functions
129+
def getDiscountPercentage(retailPrice: Column, scanPrice: Column): Column = {
130+
(retailPrice - scanPrice) / retailPrice
93131
}
132+
133+
val maxDiscountRule = Rule("Max_allowed_discount",
134+
max(getDiscountPercentage(col("retail_price"), col("scan_price"))),
135+
Bounds(upper = 90.0))
136+
137+
// Notice the builder patthern. The idea is to buld up your rules and then add them to your RuleSet[s].
138+
// RuleSets can be combined to using the RuleSet.add(ruleSet: RuleSet) method
139+
var productRuleSet = RuleSet(df).add(distinctProductsRule)
140+
.add(maxDiscountRule)
141+
142+
// ...or add Rules together as an Array
143+
val specializedProductRules = Array(distinctProductsRule, maxDiscountRule)
144+
productRuleSet = RuleSet(df).add(specializedProductRules: _*)
145+
146+
val validationResults = productRuleSet.validate()
147+
148+
display(validationResults.summaryReport)
149+
150+
// COMMAND ----------
151+
152+
// MAGIC %md
153+
// MAGIC ### Inverting matches
154+
// MAGIC We can even invert the match to validate row values do not fall in a list of values
155+
156+
// COMMAND ----------
157+
158+
// Invert match to ensure values are **not** in a LOV
159+
val invalidStoreIdsRule = Rule("Invalid_Store_Ids_Rule", col("store_id"), Array(9001, 9002, 9003), invertMatch = true)
160+
161+
// COMMAND ----------
162+
163+
// MAGIC %md
164+
// MAGIC ### Case-sensitivity
165+
// MAGIC Case-sensitivity is enabled by default. However, an optional `ignoreCase` parameter can be used to apply/not apply case sensitivity to a list of String values
166+
167+
// COMMAND ----------
168+
169+
// Numerical categorical rules. Build create a list of values to be validated against.
170+
val catNumerics = Array(
171+
// Only allow store_ids in my validStoreIDs lookup
172+
Rule("Valid_Stores", col("store_id"), Lookups.validStoreIDs),
173+
// Validate against a pre-built list of skus that have been verified to be accurate
174+
// Currently this is manually created for demo but can easily be created from a dataframe, etc.
175+
Rule("Valid_Skus", col("sku"), Lookups.validSkus),
176+
// Ensure that the skus do not match any of the invalid skus defined earlier
177+
Rule("Invalid_Skus", col("sku"), Lookups.invalidSkus, invertMatch=true)
178+
)
179+
180+
// Validate strings as well as numericals. They don't need to be in a separate array, it's just done here for demonstration
181+
val catStrings = Array(
182+
// Case-sensitivity is enabled by default. However, `ignoreCase` parameter can be used
183+
// to apply/not apply case sensitivity to a list of String values
184+
Rule("Valid_Regions", col("region"), Lookups.validRegions, ignoreCase=true)
185+
)
186+
187+
// COMMAND ----------
188+
189+
// MAGIC %md
190+
// MAGIC # Aggregates
191+
// MAGIC Dataframes can be simple or a Seq of columns can be passed in as "bys" for the DataFrame to be grouped by. <br>
192+
// MAGIC If the dataframe is grouped validations will be per group
193+
194+
// COMMAND ----------
195+
196+
// Grouped Dataframe
197+
// Let's assume we want to perform validation by some grouping of one or many columns
198+
val validationResults = RuleSet(df, Array("store_id"))
199+
.add(specializedProductRules)
200+
.add(minMaxPriceRules)
201+
.add(catNumerics)
202+
.add(catStrings)
203+
.validate()
204+
205+
display(validationResults.summaryReport)
206+
207+
// COMMAND ----------
208+
209+
// MAGIC %md
210+
// MAGIC ## Streaming DataFrames
211+
// MAGIC Rules can be applied to streaming DataFrames, as well.
212+
213+
// COMMAND ----------
214+
215+
val yellowTaxi = spark.readStream
216+
.format("delta")
217+
.option("maxBytesPerTrigger", (1024 * 1024 * 4).toString)
218+
.load("/databricks-datasets/nyctaxi/tables/nyctaxi_yellow")
219+
220+
// COMMAND ----------
221+
222+
val validPaymentTypes = Array("Cash", "Credit")
223+
val rangeRules = Array(
224+
MinMaxRuleDef("Pickup Longitude On Earth", 'pickup_longitude, Bounds(-180, 180)),
225+
MinMaxRuleDef("Dropoff Longitude On Earth", 'dropoff_longitude, Bounds(-180, 180)),
226+
MinMaxRuleDef("Pickup Latitude On Earth", 'pickup_latitude, Bounds(-90, 90)),
227+
MinMaxRuleDef("Dropoff Latitude On Earth", 'dropoff_latitude, Bounds(-90, 90)),
228+
MinMaxRuleDef("Realistic Passenger Count", 'passenger_count, Bounds(1, 10))
229+
)
230+
231+
val taxiBaseRules = Array(
232+
Rule("dropoff after pickup", (unix_timestamp('dropoff_datetime) * 1.05).cast("long") >= unix_timestamp('pickup_datetime)),
233+
Rule("total is sum of parts", 'fare_amount + 'extra + 'mta_tax + 'tip_amount + 'tolls_amount, 'total_amount),
234+
Rule("total greater than 0", 'total_amount > 0),
235+
Rule("valid payment types", lower('payment_type), validPaymentTypes)
236+
)
237+
238+
val yellowTaxiReport = RuleSet(yellowTaxi)
239+
.add(taxiBaseRules: _*)
240+
.addMinMaxRules(rangeRules: _*)
241+
.validate()
242+
243+
// COMMAND ----------
244+
245+
display(
246+
yellowTaxiReport.summaryReport
247+
)
248+
249+
// COMMAND ----------
250+
251+

demo/Rules_Engine_Examples.dbc

49.4 KB
Binary file not shown.

demo/Rules_Engine_Examples.html

+25-24
Large diffs are not rendered by default.

project/plugins.sbt

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")
12
addSbtPlugin("com.github.sbt" % "sbt-jacoco" % "3.0.3")
23
addSbtPlugin("com.github.sbt" % "sbt-pgp" % "2.1.2")
34
addSbtPlugin("org.xerial.sbt" % "sbt-sonatype" % "2.3")

0 commit comments

Comments
 (0)