DATABRICKS DATA ANALYST EXAM
QUESTIONS WITH VERIFIED
ANSWERS
When can you ingest directories of files? - Answer-When the files are the same type
and have the same schema. DB reads all the files and combines them in a single
table
Describe how to connect Databricks SQL to visualization tools like Tableau, Power
BI, and Looker - Answer-1. Navigate to the Clusters tab. Click create clusters or
select an existing one
2. In the Advanced Options section, select the JDBC/ODBC tab
3. Follow the instructions to download the JDBC or ODBC driver for your
visualisation tool. Configure the tool using the driver.
Identify Databricks SQL as a complementary tool for BI partner tool workflows -
Answer-By using Databricks SQL as a complementary tool for BI partner tool
workflows, you can take advantage of the scalability and performance of the
Databricks platform while still using the familiar interface of your BI partner tool.
Describe the medallion architecture - Answer--It's a sequential data organisation and
pipeline system of progressively cleaner data
-consists of three layers: bronze, silver, and gold:
-The bronze layer contains unvalidated data in its raw state
-The silver layer represents a validated, enriched version of the data that can be
trusted for downstream analytics.
-The gold layer contains highly refined and aggregated data that powers analytics,
machine learning, and production applications.
Why is the gold layer as the most common layer for data analysts using Databricks
SQL? - Answer--It contains highly refined and aggregated data that power analytics,
machine learning and production applications
-Data shared with a customer would rarely be stored outside this level.
-Because aggregations, joins, and filtering are handled before data is written to the
gold layer, users should see low latency query performance on data in gold tables
Describe the cautions and benefits of working with streaming data - Answer-
BENEFITS:
-real-time insights
-faster decision-making
-ability to respond quickly to changing conditions
CAUTIONS:
-managing the volume and velocity of data
-ensuring quality and consistency
-requires specialised expertise and skills
, Identify that the Lakehouse allows the mixing of batch and streaming workloads. -
Answer-The ability to mix batch and streaming workloads is a key advantage of the
Lakehouse, as it allows you to build real-time applications that can process data as it
arrives, while also supporting traditional batch processing for historical analysis
Describe Delta Lake as a tool for managing data files. - Answer--One of the key
features of Delta Lake is its support for ACID transactions, which ensures data is
always in a consistent state
-It's designed to be highly scalable and can handle large volumes of data
-Delta Lake provides a number of tools for managing data files, such as VACUUM
and OPTIMISE
Describe that Delta Lake manages table metadata - Answer--Provides support for
schema evolution meaning you can modify the table over time without having to
rewrite the whole table. Schema validation ensures that changes are compatible with
exciting data.
-Delta Lake also provides support for managing table properties, such as the location
of the table data and the format of the data files
Identify that Delta Lake tables maintain history for a period of time - Answer-Each
operation that modifies a Delta Lake table creates a new table version, and you can
use the table history to audit operations, rollback a table, or query a table at a
specific point in time using time travel. You can retrieve information using the history
command.
Describe the benefits of Delta Lake within the Lakehouse - Answer-5 MAIN
BENEFITS IN THE LAKEHOUSE:
1. ACID transactions
2. Scalable metadata handling
3. Efficient query processing
4. Schema evolution
5. Unified platform
Describe persistence and scope of tables on Databricks - Answer-There are different
tables based on what's required:
1. Global tables are available across all clusters in a workspace and can be
accessed by all users with the appropriate permissions
2. Cluster-scoped tables are available only within a specific cluster and are not
visible to other clusters or users
3. Notebook-scoped tables are available only within a specific notebook and are not
visible to other notebooks or users
Persisting tables in a storage format allows them to be stored on disk and accessed
more efficiently, which can improve query performance and reduce query latency.
Compare and contrast the behavior of managed and unmanaged tables - Answer-
Overall, managed tables are easier to manage and optimize for performance, while
unmanaged tables are more flexible and can be faster for large datasets.
Managed tables:
QUESTIONS WITH VERIFIED
ANSWERS
When can you ingest directories of files? - Answer-When the files are the same type
and have the same schema. DB reads all the files and combines them in a single
table
Describe how to connect Databricks SQL to visualization tools like Tableau, Power
BI, and Looker - Answer-1. Navigate to the Clusters tab. Click create clusters or
select an existing one
2. In the Advanced Options section, select the JDBC/ODBC tab
3. Follow the instructions to download the JDBC or ODBC driver for your
visualisation tool. Configure the tool using the driver.
Identify Databricks SQL as a complementary tool for BI partner tool workflows -
Answer-By using Databricks SQL as a complementary tool for BI partner tool
workflows, you can take advantage of the scalability and performance of the
Databricks platform while still using the familiar interface of your BI partner tool.
Describe the medallion architecture - Answer--It's a sequential data organisation and
pipeline system of progressively cleaner data
-consists of three layers: bronze, silver, and gold:
-The bronze layer contains unvalidated data in its raw state
-The silver layer represents a validated, enriched version of the data that can be
trusted for downstream analytics.
-The gold layer contains highly refined and aggregated data that powers analytics,
machine learning, and production applications.
Why is the gold layer as the most common layer for data analysts using Databricks
SQL? - Answer--It contains highly refined and aggregated data that power analytics,
machine learning and production applications
-Data shared with a customer would rarely be stored outside this level.
-Because aggregations, joins, and filtering are handled before data is written to the
gold layer, users should see low latency query performance on data in gold tables
Describe the cautions and benefits of working with streaming data - Answer-
BENEFITS:
-real-time insights
-faster decision-making
-ability to respond quickly to changing conditions
CAUTIONS:
-managing the volume and velocity of data
-ensuring quality and consistency
-requires specialised expertise and skills
, Identify that the Lakehouse allows the mixing of batch and streaming workloads. -
Answer-The ability to mix batch and streaming workloads is a key advantage of the
Lakehouse, as it allows you to build real-time applications that can process data as it
arrives, while also supporting traditional batch processing for historical analysis
Describe Delta Lake as a tool for managing data files. - Answer--One of the key
features of Delta Lake is its support for ACID transactions, which ensures data is
always in a consistent state
-It's designed to be highly scalable and can handle large volumes of data
-Delta Lake provides a number of tools for managing data files, such as VACUUM
and OPTIMISE
Describe that Delta Lake manages table metadata - Answer--Provides support for
schema evolution meaning you can modify the table over time without having to
rewrite the whole table. Schema validation ensures that changes are compatible with
exciting data.
-Delta Lake also provides support for managing table properties, such as the location
of the table data and the format of the data files
Identify that Delta Lake tables maintain history for a period of time - Answer-Each
operation that modifies a Delta Lake table creates a new table version, and you can
use the table history to audit operations, rollback a table, or query a table at a
specific point in time using time travel. You can retrieve information using the history
command.
Describe the benefits of Delta Lake within the Lakehouse - Answer-5 MAIN
BENEFITS IN THE LAKEHOUSE:
1. ACID transactions
2. Scalable metadata handling
3. Efficient query processing
4. Schema evolution
5. Unified platform
Describe persistence and scope of tables on Databricks - Answer-There are different
tables based on what's required:
1. Global tables are available across all clusters in a workspace and can be
accessed by all users with the appropriate permissions
2. Cluster-scoped tables are available only within a specific cluster and are not
visible to other clusters or users
3. Notebook-scoped tables are available only within a specific notebook and are not
visible to other notebooks or users
Persisting tables in a storage format allows them to be stored on disk and accessed
more efficiently, which can improve query performance and reduce query latency.
Compare and contrast the behavior of managed and unmanaged tables - Answer-
Overall, managed tables are easier to manage and optimize for performance, while
unmanaged tables are more flexible and can be faster for large datasets.
Managed tables: