Answers 100% Pass
Which in-context learning method involves creating an initial prompt that states the
task to be completed and includes a single example question with answer followed
by a second question to be answered by the LLM?
a. Hot shot
b. Zero shot
c. Few shot
d. One shot - ✔✔d. One Shot
One shot inference involves providing an example question with answer followed
by a second question to be answered by the LLM. Few shot inference provides
multiple example prompts and answers while zero shot provides only one prompt
to be answered by the LLM.
Which configuration parameter for inference can be adjusted to either increase or
decrease randomness within the model output layer?
a. Max new tokens
b. Top-k sampling
COPYRIGHT © 2025 BY EMILLY CHARLOTTE, ALL RIGHTS RESERVED 1
,c. Temperature
d. Number of beams & beam search - ✔✔c. Temperature
Temperature is used to affect the randomness of the output of the softmax layer. A
lower temperature results in reduced variability while a higher temperature results
in increased randomness of the output.
Which of the following best describes the role of data parallelism in the context of
training Large Language Models (LLMs) with GPUs?
a. Data parallelism is used to increase the size of the training data by duplicating it
across multiple GPUs.
b. Data parallelism refers to a type of storage mechanism where data is stored across
multiple GPUs.
c. Data parallelism is a technique to reduce the model size so that it can fit into the
memory of a single GPU.
d. Data parallelism allows for the use of multiple GPUs to process different parts of
the same data simultaneously, speeding up training time.
SkipSubmit - ✔✔d. Data parallelism allows for the use of multiple GPUs to process
different parts of the same data simultaneously, speeding up training time.
COPYRIGHT © 2025 BY EMILLY CHARLOTTE, ALL RIGHTS RESERVED 2
, Data parallelism is a strategy that splits the training data across multiple GPUs. Each
GPU processes a different subset of the data simultaneously, which can greatly
speed up the overall training time.
Which of the following statements about pretraining scaling laws are correct? Select
all that apply:
a. To scale our model, we need to jointly increase dataset size and model size, or they
can become a bottleneck for each other.
b. There is a relationship between model size (in number of parameters) and the
optimal number of tokens to train the model with.
c. When measuring compute budget, we can use "PetaFlops per second-Day" as a
metric.
d. You should always follow the recommended number of tokens, based on the
chinchilla laws, to train your model. - ✔✔a, b & c
a. To scale our model, we need to jointly increase dataset size and model size, or they
can become a bottleneck for each other.
Correct
For instance, while increasing dataset size is helpful, if we do not jointly improve the
model size, it might not be able to capture value from the larger dataset.
b. There is a relationship between model size (in number of parameters) and the
optimal number of tokens to train the model with.
COPYRIGHT © 2025 BY EMILLY CHARLOTTE, ALL RIGHTS RESERVED 3