IHUMAN
The process of identifying data - -----------DESCRIPTION --------The process of identifying data begins
by determining the information you want to collect. In this step, you make decisions regarding (a)
the specific information you need; and (b) the possible sources for this data. Your goals determine
the answers to these questions.
Case study to identify data for: overview and first step - -----------DESCRIPTION --------Let's take the
example of a product company that wants to create targeted marketing campaigns based on the age
group that buys their products the most. Their goal is to design reach-outs that appeal most to this
segment and encourages them to further influence their friends and peers into buying these
products. Based on this use case, some of the obvious information that you will identify includes the
customer profile, purchase history, location, age, education, profession, income, and marital status,
for example. To ensure you gain even greater insights into this segment, you may also decide to
collect the customer complaint data for this segment to understand the kind of issues they face
because this could discourage them from recommending your products. To know how satisfied they
were with the resolution of their issues, you could collect the ratings from the customer service
surveys. Taking this a step forward, you may want to understand how these customers talk about
your products on social media and how many of their connections engage with them in these
discussions, for example, the likes, shares, and comments their posts receive.
Case study to identify data for: second step - -----------DESCRIPTION --------The next step in the
process is to define a plan for collecting data. You need to establish a timeframe for collecting the
data you have identified. Some of the data you need may be required on an ongoing basis and some
over a defined period of time. For collecting website visitor data, for example, you may need to have
the numbers refreshed in real-time. But if you're tracking data for a specific event, you have a
definite beginning and end date for collecting the data. In this step, you can also define how much
data would be sufficient for you to reach a credible analysis. Is the volume defined by the segment,
for example, all customers within the age range of 21 to 30 years; or a dataset of a hundred
thousand customers within the age range of 21 to 30. You can also use this step to define the
dependencies, risks, mitigation plan, and several other such factors that are relevant to your
initiative. The purpose of the plan should be to establish the clarity you need for execution.
Case study to identify data for: third step - -----------DESCRIPTION --------The third step in the process
is for you to determine your data collection methods. In this step, you will identify the methods for
collecting the data you need. You will define how you will collect the data from the data sources you
, have identified, such as internal systems, social media sites, or third-party data providers. Your
methods will depend on the type of data, the timeframe over which you need the data, and the
volume of data. Once your plan and data collection methods are finalized, you can implement your
data collection strategy and start collecting data. You will be making updates to your plan as you go
along because conditions evolve as you implement the plan on the ground. The data you identify,
the source of that data, and the practices you employ for gathering the data have implications for
quality, security, and privacy. None of these steps are one-time considerations but are relevant
through the life cycle of the data analysis process.
Reliable data - -----------DESCRIPTION --------Working with data from disparate sources without
considering how it measures against the quality metric can lead to failure. In order to be reliable,
data needs to be free of errors, accurate, complete, relevant, and accessible. You need to define the
quality traits, the metric, and the checkpoints in order to ensure that your analysis is going to be
based on quality data. You also need to watch out for issues pertaining to data governance, such as,
security, regulation, and compliances. Data Governance policies and procedures relate to the
usability, integrity, and availability of data. Penalties for non-compliance can run into millions of
dollars and can hurt the credibility of not just your findings, but also your organization.
Data privacy - -----------DESCRIPTION --------Data you collect needs to check the boxes for
confidentiality, license for use, and compliance to mandated regulations. Checks, validations, and an
auditable trail needs to be planned. Loss of trust in the data used for analysis can compromise the
process, result in suspect findings, and invite penalties.
The importance of identifying the right data for analysis - -----------DESCRIPTION --------Identifying the
right data is a very important step of the data analysis process. Done right, it will ensure that you are
able to look at a problem from multiple perspectives and your findings are credible and reliable.
Primary data - -----------DESCRIPTION --------The term primary data refers to information obtained
directly by you from the source. This could be from internal sources such as data from the
organization, CRM, HR or workflow applications. It could also include data you gather directly
through surveys, interviews, discussions, observations and focus groups.
Secondary data - -----------DESCRIPTION --------Secondary data refers to information retrieved from
existing sources, such as external databases, research articles, publications, training material and
Internet searches, or financial records available as public data. This could also include data collected
through externally conducted surveys, interviews, discussions, observations and focus groups.
Third party data - -----------DESCRIPTION --------Third party data is data you purchased from
aggregators who collect data from various sources and combine it into comprehensive datasets
purely for the purpose of selling the data.