Databricks Quick Flashcards
Which cluster configuration options can be customized at the time of cluster creation? -
answer Maximum number of worker nodes
Databricks Runtime Version
Restart Policy
Which Cluster: Due to the platform administrator's policies, a data engineer needs to
use a single cluster on one very large batch of files for an ETL workload. The workload
is automated, and the cluster will only be used by one workload at a time. They are part
of an organization that wants them to minimize costs when possible. - answer Ulti Node
Job Cluster
A data engineer is trying to merge their development branch into the main branch for a
data project's repository.
Which of the following is a correct argument for why it is advantageous for the data
engineering team to use Databricks Repos to manage their notebooks? Select one
response. – answer Databricks Repos REST API enables the integration of data
projects into CI/CD pipelines.
A data engineer needs to run some SQL code within a Python notebook. Which of the
following will allow them to do this? Select two responses. - answerThey can wrap the
SQL command in spark.sql().
They can use the %sql command at the top of the cell containing SQL code.
python command to list files in directory - answerfiles =
dbutils.fs.ls(DA.paths.kafka_events)
sql command query single file - answerSELECT * FROM json.`$
{DA.paths.kafka_events}/001.json`
how to extract Text Files as Raw Strings - answerSELECT * FROM text.`$
{DA.paths.kafka_events}`
sparl sql command to create external table - answerCREATE TABLE table_identifier
(col_name1 col_type1, ...)
USING data_source
OPTIONS (key1 = "val1", key2 = "val2", ...)
LOCATION = "path"
, sql command to show all of the metadata associated with the table definition. -
answerDESCRIBE EXTENDED <table_name>
sql command to refresh table cache - answerREFRESH TABLE <table_name>
command to extract data from external sql db - answerCREATE TABLE
USING JDBC
OPTIONS (url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}",dbtable =
"{jdbcDatabase}.table",user = "{jdbcUsername}",password = "{jdbcPassword}")
SQL Function that returns the schema derived from an example JSON string. -
answerschema_of_json('<json string>')
SQL function that parses a column containing a JSON string into a struct type using the
specified schema. - answerfrom_json(<json_string>,'<json struct schema>')
sql function that separates the elements of an array into multiple rows; this creates a
new row for each element. - answerexplode()
sql function that provides a count for the number of elements in an array for each row -
answersize()
pivot table sql logic - answerSELECT * FROM (
SELECT user_id user, event_name
FROM events
) PIVOT ( count(*) FOR event_name IN (
<pivot_columns_string>))
count null values in df - answerusersDF.selectExpr("count_if(email IS NULL)")
usersDF.where(col("email").isNull()).count()
sql command to display details of schema and table - answerDESCRIBE SCHEMA
EXTENDED <schema name>
DESCRIBE DETAIL <table name>
Whats the command to create a schema(DB) with a specific location? - answerCREATE
SCHEMA IF NOT EXISTS ${da.schema_name}_custom_location LOCATION '$
{da.paths.working_dir}/${da.schema_name}_custom_location.db';
whats the default location of schema? - answerdbfs:/user/hive/warehouse/
A trick to remember the default location of the database/schema is UHW i.e. User Hive
Warehouse.
Which cluster configuration options can be customized at the time of cluster creation? -
answer Maximum number of worker nodes
Databricks Runtime Version
Restart Policy
Which Cluster: Due to the platform administrator's policies, a data engineer needs to
use a single cluster on one very large batch of files for an ETL workload. The workload
is automated, and the cluster will only be used by one workload at a time. They are part
of an organization that wants them to minimize costs when possible. - answer Ulti Node
Job Cluster
A data engineer is trying to merge their development branch into the main branch for a
data project's repository.
Which of the following is a correct argument for why it is advantageous for the data
engineering team to use Databricks Repos to manage their notebooks? Select one
response. – answer Databricks Repos REST API enables the integration of data
projects into CI/CD pipelines.
A data engineer needs to run some SQL code within a Python notebook. Which of the
following will allow them to do this? Select two responses. - answerThey can wrap the
SQL command in spark.sql().
They can use the %sql command at the top of the cell containing SQL code.
python command to list files in directory - answerfiles =
dbutils.fs.ls(DA.paths.kafka_events)
sql command query single file - answerSELECT * FROM json.`$
{DA.paths.kafka_events}/001.json`
how to extract Text Files as Raw Strings - answerSELECT * FROM text.`$
{DA.paths.kafka_events}`
sparl sql command to create external table - answerCREATE TABLE table_identifier
(col_name1 col_type1, ...)
USING data_source
OPTIONS (key1 = "val1", key2 = "val2", ...)
LOCATION = "path"
, sql command to show all of the metadata associated with the table definition. -
answerDESCRIBE EXTENDED <table_name>
sql command to refresh table cache - answerREFRESH TABLE <table_name>
command to extract data from external sql db - answerCREATE TABLE
USING JDBC
OPTIONS (url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}",dbtable =
"{jdbcDatabase}.table",user = "{jdbcUsername}",password = "{jdbcPassword}")
SQL Function that returns the schema derived from an example JSON string. -
answerschema_of_json('<json string>')
SQL function that parses a column containing a JSON string into a struct type using the
specified schema. - answerfrom_json(<json_string>,'<json struct schema>')
sql function that separates the elements of an array into multiple rows; this creates a
new row for each element. - answerexplode()
sql function that provides a count for the number of elements in an array for each row -
answersize()
pivot table sql logic - answerSELECT * FROM (
SELECT user_id user, event_name
FROM events
) PIVOT ( count(*) FOR event_name IN (
<pivot_columns_string>))
count null values in df - answerusersDF.selectExpr("count_if(email IS NULL)")
usersDF.where(col("email").isNull()).count()
sql command to display details of schema and table - answerDESCRIBE SCHEMA
EXTENDED <schema name>
DESCRIBE DETAIL <table name>
Whats the command to create a schema(DB) with a specific location? - answerCREATE
SCHEMA IF NOT EXISTS ${da.schema_name}_custom_location LOCATION '$
{da.paths.working_dir}/${da.schema_name}_custom_location.db';
whats the default location of schema? - answerdbfs:/user/hive/warehouse/
A trick to remember the default location of the database/schema is UHW i.e. User Hive
Warehouse.