Skip to content

CDAP 5.1.0

Compare
Choose a tag to compare
@sreevatsanraman sreevatsanraman released this 12 Oct 16:23
· 63 commits to release/5.1 since this release
a089619

Summary

This release introduces a number of new features, improvements and bug fixes to CDAP. Some of the main highlights of the release are:

  1. Date and Time Support

    • Support for Date, Time and Timestamp data types in the CDAP schema. In addition, this support is also now available in pipeline plugins and Data Preparation directives.
  2. Plugin Requirements

    • A way for plugins to specify certain runtime requirements, and the ability to filter available plugins based on those requirements.
  3. Bootstrapping

    • A method to automatically bootstrap CDAP with a given state, such as a set of deployed apps, artifacts, namespaces, and preferences.
  4. UI Customization

    • A way to customize the display of the CDAP UI by enabling or disabling certain features.

New Features

  • Added support for Date/Time in Preparation. Also, added a new directive parse-timestamp to convert unix timestamp in long or string to Timestamp object. (CDAP-14244)

  • Added Date, Time, and Timestamp support in plugins (Wrangler, Google Cloud BigQuery, Google Cloud Spanner, Database). (CDAP-14245)

  • Added Date, Time, and Timestamp support in CDAP Schema. (CDAP-14021)

  • Added Date, Time, and Timestamp support in UI. (CDAP-14028)

  • Added Google Cloud Spanner source and sink plugins in Pipeline and Google Cloud Spanner connection in Preparation. (CDAP-14053)

  • Added Google Cloud PubSub realtime source. (CDAP-14185)

  • Added a new user onboarding tour to CDAP. (CDAP-14088)

  • Added the ability to customize UI through theme. (CDAP-13990)

  • Added a framework that can be used to bootstrap a CDAP instance. (CDAP-14022)

  • Added the ability to configure system wide provisioner properties that can be set by admins but not by users. (CDAP-13746)

  • Added capability to allow specifying requirements by plugins and filter them on the basis of their requirements. (CDAP-13924)

  • Added REST endpoints to query the run counts of a program. (CDAP-13975)

  • Added a REST endpoint to get the latest run record of multiple programs in a single call. (CDAP-14260)

  • Added support for Apache Spark 2.3. (CDAP-13653)

Improvements

  • Improved runtime monitoring (which fetches program states, metadata and logs) of remotely launched programs from the CDAP Master by using dynamic port forwarding instead of HTTPS for communication. (CDAP-13566)

  • Removed duplicate classes to reduce the size of the sandbox by a couple hundred megabytes. (CDAP-13977)

  • Added cdap-env.sh to allow configuring jvm options while launching the Sandbox. (CDAP-14461)

  • Added support for bidirectional Field Level Lineage. (CDAP-14003)

  • Added capability for external dataset to record their schema. (CDAP-14013)

  • The Dataproc provisioner will try to pick up the project id and credentials from the environment if they are not specified. (CDAP-14091)

  • The Dataproc provisioner will use internal IP addresses when CDAP is in the same network as the Dataproc cluster. (CDAP-14104)

  • Added capability to always display current dataset schema in Field Level Lineage. (CDAP-14168)

  • Improved error handling in Preparation. (CDAP-13886)

  • Added a FileSink batch sink, FileMove action, and FileDelete action to replace their HDFS counterparts. (CDAP-14023)

  • Added a configurable jvm option to kill CDAP process immediately on sandbox when an OutOfMemory error occurs. (CDAP-14097)

  • Added better trace logging for dataset service. (CDAP-14135)

  • Make Google Cloud Storage, Google Cloud BigQuery, and Google Cloud Spanner connection properties optional (project id, service account keyfile path, temporary GCS bucket). (CDAP-14386)

  • Google Cloud PubSub sink will try to create the topic if it does not exist while preparing for the run. (CDAP-14401)

  • Added csv, tsv, delimited, json, and blob as formats to the S3 source and sink. (CDAP-14475)

  • Added csv, tsv, delimited, json, and blob as formats to the File source. (CDAP-14321)

  • Added a button on external sources and sinks to jump to the dataset detail page. (CDAP-9048)

  • Added format and suppress query params to the program logs endpoint to match the program run logs endpoint. (CDAP-14040)

  • Made all CDAP examples to be compatible with Spark 2. (CDAP-14132)

  • Added worker and master disk size properties to the Dataproc provisioner. (CDAP-14220)

  • Improved operational behavior of the dataset service. (CDAP-14298)

  • Fixed wrangler transform to make directives optional. If none are given, the transform is a no-op. (CDAP-14372)

  • Fixed Preparation to treat files wihtout extension as text files. (CDAP-14397)

  • Limited the number of files showed in S3 and Google Cloud Storage browser to 1000. (CDAP-14398)

  • Enhanced Google Cloud BigQuery sink to create dataset if the specified dataset does not exist. (CDAP-14482)

  • Increased log levels for the CDAP Sandbox so that only CDAP classes are at debug level. (CDAP-14489)

Bug Fixes

  • Fixed the 'distinct' plugin to use a drop down for the list of fields and to have a button to get the output schema. (CDAP-14468)

  • Ensured that destroy() is always called for MapReduce, even if initialize() fails. (CDAP-7444)

  • Fixed a bug where Alert Publisher will not work if there is a space in the label. (CDAP-13008)

  • Fixed a bug that caused Preparation to fail while parsing avro files. (CDAP-13230)

  • Fixed a misleading error message about hbase classes in cloud runtimes. (CDAP-13878)

  • Fixed a bug where the metric for failed profile program runs was not getting incremented when the run failed due to provisioning errors. (CDAP-13887)

  • Fixed a bug where querying metrics by time series will be incorrect after a certain amount of time. (CDAP-13894)

  • Fixed a bug where profile metrics is incorrect if an app is deleted. (CDAP-13959)

  • Fixed a deprovisioning bug when cluster creation would fail. (CDAP-13965)

  • Fixed an error where TMS publishing was retried indefinitely if the first attempt failed. (CDAP-13988)

  • Fixed a race condition in MapReduce that can cause a deadlock. (CDAP-14076)

  • Fixed a resource leak in preview feature. (CDAP-14098)

  • Fixed a bug that would cause RDD versions of the dynamic scala spark plugins to fail. (CDAP-14107)

  • Fixed a bug where profiles were getting applied to all program types instead of only workflows. (CDAP-14154)

  • Fixed a race condition by ensuring that a program is started before starting runtime monitoring for it. (CDAP-14203)

  • Fixed runs count for pipelines in UI to show correct number instead of limiting to 100. (CDAP-14211)

  • Fixed an issue where Dataproc client was not being closed, resulting in verbose error logs. (CDAP-14223)

  • Fixed a bug that could cause the provisioning state of stopped program runs to be corrupted. (CDAP-14261)

  • Fixed a bug that caused Preparation to be unable to list buckets in a Google Cloud Storage connection in certain environments. (CDAP-14271)

  • Fixed a bug where Dataproc provisioner is not able to provision a singlenode cluster. (CDAP-14303)

  • Fixed a bug where Preparation could not read json or xml files on Google Cloud Storage. (CDAP-14390)

  • Fixed dataproc provisioner to use full API access scopes so that Google Cloud Spanner and Google Cloud PubSub are accessible by default. (CDAP-14395)

  • Fixed a bug where profile metrics is not deleted when a profile is deleted. (CDAP-14435)

Deprecated and Removed Features

  • Removed old and buggy dynamic spark plugins. (CDAP-14108)

  • Dropped support for MapR 4.1. (CDAP-14456)