What is Databricks?
Databricks is a fully managed cloud-based unified analytics platform built on Apache Spark. It provides a collaborative environment for data engineers, data scientists, and analysts to process big data, conduct data science, and implement machine learning workflows.
Databricks simplifies the use of Spark by offering a managed environment where users can spin up clusters, collaborate using interactive notebooks, and access various built-in libraries for analytics, machine learning, and data engineering tasks.
This platform not only streamlines the development and deployment of big data applications but also promotes an environment of teamwork and innovation, enabling organizations to extract actionable insights from their data more efficiently.
Related reading: What is Databricks?
Understanding Databricks Pricing:
Databricks pricing comprises two main components:
Instance Cost: This is the cost of the underlying compute instances on which Databricks clusters run. These costs depend on the instance types and the duration for which the instances are running.
Databricks Unit (DBU) Cost: A Databricks Unit (DBU) is a unit of processing capability per hour, billed on a per-second usage basis. The cost depends on the type of cluster and its configuration. Each operation performed on Databricks consumes a certain number of DBUs.
Monitor and Analyze Performance Metrics:
It is essential to set up custom configurations to gather the metrics.
Enable Custom Metrics: To monitor performance metrics like CPU and memory usage, you need to enable custom metrics on your EC2 instances. This involves using initialization (INIT) scripts to send these metrics to AWS CloudWatch. Custom metrics provide deeper insights into cluster performance and help in making informed decisions.
Create INIT Scripts: Use INIT scripts to create custom namespaces in CloudWatch/Log Analytics for each cluster. This allows us to track performance metrics like CPU and memory usage for individual clusters. For instance, you can create an INIT script to capture metrics and send them to CloudWatch. This step ensures that all necessary performance data is collected systematically.
Attach INIT Scripts to Clusters: Attach the INIT scripts to the Databricks clusters. This ensures that the necessary performance metrics are collected and sent to CloudWatch/Log Analytics whenever the cluster is active. Regular monitoring of these metrics helps in identifying inefficiencies and optimizing resource usage.
Challenges in Databricks Cost Optimization:
Lack of Direct Performance Metrics: Earlier in Databricks, there were no direct performance metrics available. Performance metrics must be gathered from the underlying computing instances. Memory metrics require custom configurations to be reported to AWS CloudWatch/Log Analytics, adding another layer of complexity. This lack of direct visibility can make it challenging to optimize and manage costs effectively. Now in august Databricks made it available for the public access.
Limited Visibility into Resource Usage: Understanding which workloads or departments are driving up the costs can be challenging, especially in multi-tenant environments. This can make it difficult to allocate costs accurately and find optimization opportunities.
Databricks Cost Optimization Best Practices:
Enable Cluster Termination Option: During cluster configuration, enable the automatic termination option. Specify the period of inactivity after which the cluster should be terminated. Once this period is exceeded without any activity, the cluster will move to a terminated state, thus saving costs associated with running idle clusters.
Optimize Cluster Configurations: Choosing the right configuration for the Databricks clusters is essential for cost efficiency. Consider the following:
Select Appropriate Node Types: Match the node types to your workload requirements to avoid over-provisioning resources. By selecting the most suitable instance types, you can ensure that your clusters are cost-effective and performant.
DBU: Understanding the DBU consumption patterns and optimizing workloads can lead to significant cost savings.
Why CloudCADI for Databricks?
CloudCADI helps you in optimizing,
1. Autoscaling inefficiency
Though autoscaling in databricks brings enormous benefits, it can easily add up your cloud bills without adding any value. CloudCADI gives multiple actionable recommendations on the node resizing possibilities that can potentially save your databricks costs.
Example: Instead of 5 nodes of type – Standard_D4ads_v5 that costs $0.21 /h, you can alter it to 2 nodes of type Standard_D8as_v5 and realize 20% savings.
2. Cluster-node resizing inefficiency
CloudCADI analyzes the number of anomalies (inefficient CPU and Memory utilization) with its intelligent engine and gives recommendations on resizing.
Example: “Reduce the worker count from 8 to 5 for optimal usage”
Conclusion:
Optimizing costs on Databricks involves a combination of strategic configurations, attentive monitoring, and the use of best practices for the specific workloads. By implementing cluster termination policies, monitoring performance metrics, and optimizing cluster configurations, you can ensure that your Databricks environment is both cost-effective and efficient.
Want to explore CloudCADI? Call us today : Book a Demo
Author
Nandhini Kumar, is our Software Engineer L2 who was part of Databricks implementation team in CloudCADI.