2024 Databricks Fundamentals QUES TIONS WITH 100% SOLUTIONS LATEST UPDATE
What does Databricks help organizations do? - ANSWER The Databricks Lakehouse Platform enables organizations to: Ingest, process, and transform massive quantities and types of data Explore data through data science techniques, including but not limited to machine learning Guarantee that data available for business queries is reliable and up to date Provide data engineers, data scientists, and data analysts the unique tools they need to do their work Overcome traditional challenges associated with data science and machine learning workflows (we will explore this in detail in our next lesson) As data practitioners work to design their organization's big data infrastructure, they often ask and need to answer questions like: - ANSWER Where/how will we store our big data? How can we process batch and stream data? How can we use different types of data together in our analyses (unstructured vs. structured data)? How can we keep track of all of the work we're doing on our big data? Data lakehouses have the following key features: - ANSWER Transaction support to ensure that multiple parties can concurrently read or write data Data schema enforcement to ensure data integrity (writes to a table are rejected if they do not match the table's schema) Governance and audit mechanisms to make sure you can see how data is being used BI support so that BI tools can work directly on source data - this reduces data staleness. Storage is decoupled from compute, which means that it is easier for your system to scale to more concurrent users and data sizes. Openness - Storage formats used are open and standard. Plus, APIs and various other tools make it easy for team members to access data directly. Support for all data types - structured, unstructured, semi-structured End-to-end streaming so that real-time reporting and real-time data can be integrated into data analytics processes just as existing data is Support for diverse workloads, including data engineering, data science, machine learning, and SQL analytics - all on the same data repository. Delta Lake is an open-source storage layer that brings data reliability to data lakes. When we talk about data reliability, we refer to the accuracy and completeness of your data. In other words, Delta Lake working in conjunction with a data lake is what lays the foundation for your Lakehouse - that combination guarantees that your data is what you need for your use-cases via: - ANSWER ACID transactions, which are database transaction properties that guarantee data validity. With ACID transactions, you don't have to worry about missing data or inconsistencies in your data from interrupted or deleted operational transactions because changes to your data are performed as if they are a single operation. Indexing, which allows you to get an unordered table (which might be inefficient to query) into an order that will maximize the efficiency of your queries Table access control lists (ACLs), or governance mechanisms that ensure that only users who should have access to data can access it Expectation-setting, which refers to the ability for you to configure Delta Lake based on your workload patterns and business needs Bronze Table - ANSWER contain raw ingested data from a variety of sources JSON files, RDBMS data, IoT data Silver Table - ANSWER More refined tables which are query able and ready for big data and AI projects, joined from various bronze records tables to enrich records and update records based on recent activity Gold Tables - ANSWER provide business level aggregates used for particular use- cases Delta Files - ANSWER Delta Lake uses Parquet files to store a customer's data in the customer's cloud storage account. Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON. The columnar storage allows you to quickly skip over non-relevant data while executing queries. Delta files leverage all of the technical capabilities of Parquet files but have an additional layer over them. This additional layer tracks data versioning and metadata, stores transaction logs to keep track of changes made to a data table or object storage directory, and provides ACID transactions. Delta Tables - ANSWER A Delta table is a collection of data kept using the Delta Lake technology and consists of three things: Delta files containing the data and kept in object storage A Delta table registered in a Metastore (a metastore is simply a catalog that tracks your data's metadata - data about your data) The Delta Transaction Log saved with Delta files in object storage Delta Optimization Engine - ANSWER Delta Engine is a high-performance query engine that provides an efficient way to process data in data lakes. Delta Engine accelerates data lake operations and supports a variety of workloads ranging from large-scale ETL processing to ad-hoc, interactive queries. Many of these optimizations take place automatically; you get the benefits of these Delta Engine capabilities just by using Databricks for your data lakes. What the Delta Optimization Engine means for your business is that your data workloads run faster, so data times can perform their work in less time. Delta Lake Storage Layer - ANSWER When using Delta Lake, your organization stores its data in a Delta Lake Storage Layer and then accesses that data via Databricks. A key idea here is that an organization keeps all of this data in files in object storage. This is beneficial because it means your data is kept in a lower-cost and easily scalable environment. So - how does Delta Lake address these pain points to ensure reliable, ready-to-go data? - ANSWER ACID Transactions Schema Management Scalable Metadata Handling Unified Batch and Streaming Data Data Versioning and Time Travel
Written for
- Institution
- Databricks Fundamentals
- Course
- Databricks Fundamentals
Document information
- Uploaded on
- January 12, 2024
- Number of pages
- 4
- Written in
- 2023/2024
- Type
- Exam (elaborations)
- Contains
- Questions & answers