- Various dependency updates, see github log
- Test and build against spark 3.3
- Various dependency updates
- Push jars against spark 3.2 by default
- Various dependency updates
- Fix issue where extracting from an mssql table which is empty with a datetime col fails
- Update various dependencies (sttp, some maven plugins)
- Fix issue with dependency resolution inside databricks for waimak jars
- Update various dependencies
- Drop scala 2.11 and spark 2 from the build
- Build against spark 3.0, 3.1 and 3.2
- Add in experimental 2.13 build
- Use Azure Devops for all build and deploy steps
- Push scala 2.12 jars built against spark 3.1.2
- Update deequ to use new Scala 2.12 and Spark 3 version
- Publish deequ jar for Spark 3 builds
- Change build profiles
- Start removing Scala 2.11 builds. From here on the build will be Spark 3+ and Scala 2.12 onwards. Jars for 2.11 after this release are unlikely.
- Update version of scalatest
- Add spark 3 build and publish against spark 3, including new databricks jar
- Fixed a bug in the sql server temporal extractor where it would fail to extract from empty tables.
- Optimise the sql server temporal extractor further to select the appropriate upper bound timestamp when selecting from a history table. This should bring in less rows than before.
- Fix an issue with the sql server temporal extractor where rows were never being removed from storage, due to having a column added in error. Also try to improve the history query.
- Fixed a bug in the WriteAsNamedFilesAction
- Fixed a bug in the sql server temporal extractor, which caused extractions to not be isolated within a transaction
- Delimiter for the Deequ
genericSQLCheck
changed from a comma to a semi-colon to allow for checks which have commas in them writeAsNamedFiles
now creates the destination directory before performing the write- Refactored internal scheduler code to make it more functional
- Metadata and configuration extension functionality
sparkCache
actionscacheasparquet
andsparkcache
configuration extensionswriteAsNamedFile
- Data quality monitoring with Amazon Deequ
- Using F-bounded types on
DataFlow
- High-level storage layer API exposed outside of actions (in addition to existing actions)
- Spark 2.4.3 now used for Scala 2.12 build (no longer experimental)
- waimak-rdbm-export and waimak-azure-table modules
- Introduced retry mechanism for
PropertyProvider
HiveDBConnector
now works for complex types (e.gMapType
,StructType
)HiveEnv
environment cleanup no longer fails if the database does not exist
- Can now create or cleanup multiple environments at once in the
EnvironmentManager
- An issue where mutiple
HiveSparkSQLConnector
with different databases were interfering with each other (tables were being committed into each others' databases)
- The
errorOnUnexecutedActions
flag can now be configured in theWaimakEnv
configuration class in the Waimak-App module
- An optional Map can now be passed to the case class configuration parser where the parser will look for additional properties not found elsewhere
- Added
withExecutor
andexecute
functions onto DataFlows allowing DataFlows to be executed inline without needing to create an Executor explicitly and call execute
- When using the case class configuration parser, the prefix
spark.
will automatically be added to configuration keys when looking in the SparkConf. If a different prefix is required (e.g.spark.projectname.
), this can be configured usingspark.waimak.config.sparkConfPropertyPrefix
- Parallel scheduler no longer hangs due to Fatal exceptions occurring in actions, but fails the application instead
- Databases now set a location based on
baseDatabaseLocation/databaseName
in the experimental Waimak App Env class
- The temporary folder (if given) now gets cleaned up after a flow has been successfully executed. This behaviour can be disabled by setting the configuration property
spark.waimak.dataflow.removeTempAfterExecution
tofalse
. In all cases, the directory will not be deleted if flow execution fails. - Storage compaction no longer performs
hot -> cold
followed bycold -> cold
compactions, instead compacting all of the hot regions plus the cold regions under the configured row threshold in a single go. This reduces complexity, and removes the additional round of IOPs and Spark stage.
- Generic mechanism to provide properties when using the
CaseClassConfigParser
. An object extending thePropertyProviderBuilder
can now be configured using thespark.waimak.config.propertyProviderBuilderObjects
configuration parameter. ThePropertiesFilePropertyProviderBuilder
andDatabricksSecretsPropertyProviderBuilder
implementations are provided in Waimak - Added feature to optionally remove history from storage tables during compaction by setting the
retain_history
flag in theAuditTableInfo
metadata. For RDBM ingestion actions, if no last updated column is provided in the metadata then history is removed. This can be explicitly disabled by settingforceRetainStorageHistory
toSome(true)
- Added Waimak configuration parameter
spark.waimak.storage.updateMetadata
to force the update of table metadata (e.g. in case of primary key or history retention changed) - Added new experimental module containing currently unstable features
- The force recreate mechanism for recreating tables created using a
HadoopDBConnector
has been moved to the Waimak configuration parameterspark.waimak.metastore.forceRecreateTables
- All paths used in DDLs submitted through a
HadoopDBConnector
now use the full FileSystem URI including the scheme
- Bug in
TotalBytesPartitioner
andTotalCellsPartitioner
where 0 partitions were returned
- Additional definitions of the
commit
andwritePartitionedParquet
actions have been added that now take an integer to repartition by - Added a generic mechanism for calculating the number of files to generate during a recompaction. The default implementation (
TotalBytesPartitioner
) partitions by the estimated final output size, and an alternative implementation (TotalCellsPartitioner
) partitions by the number of cells (numRows * numColumns) - The URI used to create the FileSystem in the
SparkFlowContext
object is now also exposed in the object
- Breaking changes to the storage action API to simplify their use. Storage tables are now opened as Waimak actions and are stored as entities on the flow. As a result, the
getOrCreateAuditTable
,writeToStorage
andsnapshotFromStorage
actions can now be done in the same flow
- The force recompaction flag no longer requires an open compaction window
- Test Jar source artifacts are now generated and deployed
- Parallel scheduler: By default flows will be executed by the parallel scheduler allowing multiple independent actions to be executed at the same time within the same Spark session
- Added
commit
andpush
actions used to commit to permanent storage on the flow using an implementation ofDataCommitter
. The providedParquetDataCommitter
implementation can be used to commit labels to a Hadoop FileSystem in Parquet, and optionally to an Hadoop-based Metastore - Added a Hive implementation of the
HadoopDBConnector
trait used for committing DDLs for underlying parquet files. This includes an implementation of the Hive SQL dialect (trait HiveDBConnector
) and a class that submits Hive queries viasparkSession.sql
(HiveSparkSQLConnector
) - Added a new action writeHiveManagedTable that creates Hive tables using the saveAsTable on the DataFrameWriter class
- Added
SQLServerExtractor
for ingesting data from older (pre-2016) versions of SQL Server - Added global configuration parameters for the storage layer
- Added experimental Scala 2.12 support
- An exception is now thrown by default if any actions in the flow did not run. This can be overridden with an additional boolean option to the execute method on the executor object
- An exception is now thrown if a flow attempts to register a Spark view with an invalid label name
- Renamed
SimpleSparkDataFlow
toSparkDataFlow
- Removed unnecessary flow context on Interceptor API
- All Impala
HadoopDBConnector
classes now take aSparkFlowContext
object to allow them to work on filesystems other than the one given infs.defaultFS
- Removed
stageAndCommitParquet
andstageAndCommitParquetToDB
actions in favour of the newcommit
andpush
approach
- Fixed issue where reading from the storage layer can fail whilst another job is writing to the storage layer (non-compaction write) due to cached region info being overwritten
- Partitions within regions in the storage layer are now calculated on total size (
columns * rows
) instead of just number of rows. This should reduce partition sizes in the case of wide tables - Corruption detection for region info cache in the storage layer added in the case of job failures during previous writes
- Optimised Waimak flow validator in the case of large/complex flows
- Upgraded the Scala 2.11 compiler version due to a vulnerability in earlier versions
- Allowing optional output prefix on labels when reading from the storage layer
- Support of custom properties for JDBC connections using the Metastore Utils by passing either a
Properties
object or aMap
so they can be read securely from aJCEKS
file
- Removed support for Spark 2.0 and Spark 2.1
- Trash deletion feature in Waimak-Storage that will clean up old region compactions stored in
.Trash
- Interceptor actions will now show details of the actions they intercepted and actions they intercepted with in the Spark UI
- Single cold partitions will no longer be recompacted into themselves in the storage layer
- Azure Table uploader will now clean up thread pools to prevent exhausting system threads after being invoked multiple times
- Added optional Spark parameter
spark.waimak.fs.defaultFS
to specify the URI of the FileSystem object in theSparkFlowContext
- Azure Table uploader now respects and uploads
null
values instead of converting them to zero'd values
- Better exception logging on failing actions during execution
Any
types allowed to be used by and returned from actions- Impala queries to the same connection object now reuse connections to improve query submission performance
- Spark 2.0, 2.1 and 2.3 compatibility
- Azure Table writer hanging after API failures