what is data analysis in academic research

Skip to main content
Skip to primary sidebar
Skip to footer
QuestionPro

Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case AskWhy Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
Resources Blog eBooks Survey Templates Case Studies Training Help center

Home Market Research

Data Analysis in Research: Types & Methods

What is data analysis in research?

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense.

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Why analyze data in research?

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research.

Create a Free Account

Types of data in research

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Finding patterns in the qualitative data

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words.

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find “food” and “hunger” are the most commonly used words and will highlight them for further analysis.

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended text analysis methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other.

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

Methods used for data analysis in qualitative research

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

Content Analysis: It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
Discourse Analysis: Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
Grounded Theory: When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

Choosing the right software can be tough. Whether you’re a researcher, business leader, or marketer, check out the top 10 qualitative data analysis software for analyzing qualitative data.

Data analysis in quantitative research

Preparing data for analysis.

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

Fraud: To ensure an actual human being records each response to the survey or the questionnaire
Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
Procedure: To ensure ethical standards were maintained while collecting the data sample
Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

Methods used for data analysis in quantitative research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

Count, Percent, Frequency
It is used to denote home often a particular event occurs.
Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

Mean, Median, Mode
The method is widely used to demonstrate distribution by various points.
Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

Range, Variance, Standard deviation
Here the field equals high/low points.
Variance standard deviation = difference between the observed score and mean
It is used to identify the spread of scores by stating intervals.
Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

Percentile ranks, Quartile ranks
It relies on standardized scores helping researchers to identify the relationship between different scores.
It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided sample without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected sample to reason that about 80-90% of people like the movie.

Here are two significant areas of inferential statistics.

Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
Cross-tabulation: Also called contingency tables, cross-tabulation is used to analyze the relationship between multiple variables. Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.

Considerations in research data analysis

Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection methods , and choose samples.

LEARN ABOUT: Best Data Collection Tools

The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing audience sample il to draw a biased inference.
Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

User Behavior Analytics: What it is, Importance, Uses & Tools

Sep 26, 2024

Data Security: What it is, Types, Risk & Strategies to Follow

Sep 25, 2024

User Behavior: What it is, How to Understand, Track & Uses

Sep 24, 2024

what is data analysis in academic research

Mass Personalization is not Personalization! — Tuesday CX Thoughts

Other categories.

Academic Research
Artificial Intelligence
Assessments
Brand Awareness
Case Studies
Communities
Consumer Insights
Customer effort score
Customer Engagement
Customer Experience
Customer Loyalty
Customer Research
Customer Satisfaction
Employee Benefits
Employee Engagement
Employee Retention
Friday Five
General Data Protection Regulation
Insights Hub
Life@QuestionPro
Market Research
Mobile diaries
Mobile Surveys
New Features
Online Communities
Question Types
Questionnaire
QuestionPro Products
Release Notes
Research Tools and Apps
Revenue at Risk
Survey Templates
Training Tips
Tuesday CX Thoughts (TCXT)
Uncategorized
What’s Coming Up
Workforce Intelligence

Data Analysis

Introduction to Data Analysis
Quantitative Analysis Tools
Qualitative Analysis Tools
Mixed Methods Analysis
Geospatial Analysis
Further Reading

What is Data Analysis?

According to the federal government, data analysis is "the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data" ( Responsible Conduct in Data Management ). Important components of data analysis include searching for patterns, remaining unbiased in drawing inference from data, practicing responsible data management , and maintaining "honest and accurate analysis" ( Responsible Conduct in Data Management ).

In order to understand data analysis further, it can be helpful to take a step back and understand the question "What is data?". Many of us associate data with spreadsheets of numbers and values, however, data can encompass much more than that. According to the federal government, data is "The recorded factual material commonly accepted in the scientific community as necessary to validate research findings" ( OMB Circular 110 ). This broad definition can include information in many formats.

Some examples of types of data are as follows:

Photographs
Hand-written notes from field observation
Machine learning training data sets
Ethnographic interview transcripts
Sheet music
Scripts for plays and musicals
Observations from laboratory experiments ( CMU Data 101 )

Thus, data analysis includes the processing and manipulation of these data sources in order to gain additional insight from data, answer a research question, or confirm a research hypothesis.

Data analysis falls within the larger research data lifecycle, as seen below.

( University of Virginia )

Why Analyze Data?

Through data analysis, a researcher can gain additional insight from data and draw conclusions to address the research question or hypothesis. Use of data analysis tools helps researchers understand and interpret data.

What are the Types of Data Analysis?

Data analysis can be quantitative, qualitative, or mixed methods.

Quantitative research typically involves numbers and "close-ended questions and responses" ( Creswell & Creswell, 2018 , p. 3). Quantitative research tests variables against objective theories, usually measured and collected on instruments and analyzed using statistical procedures ( Creswell & Creswell, 2018 , p. 4). Quantitative analysis usually uses deductive reasoning.

Qualitative research typically involves words and "open-ended questions and responses" ( Creswell & Creswell, 2018 , p. 3). According to Creswell & Creswell, "qualitative research is an approach for exploring and understanding the meaning individuals or groups ascribe to a social or human problem" ( 2018 , p. 4). Thus, qualitative analysis usually invokes inductive reasoning.

Mixed methods research uses methods from both quantitative and qualitative research approaches. Mixed methods research works under the "core assumption... that the integration of qualitative and quantitative data yields additional insight beyond the information provided by either the quantitative or qualitative data alone" ( Creswell & Creswell, 2018 , p. 4).

Next: Planning >>
Last Updated: Sep 4, 2024 11:49 AM
URL: https://guides.library.georgetown.edu/data-analysis

History & Society
Science & Tech
Biographies
Animals & Nature
Geography & Travel
Arts & Culture
Games & Quizzes
On This Day
One Good Fact
New Articles
Lifestyles & Social Issues
Philosophy & Religion
Politics, Law & Government
World History
Health & Medicine
Browse Biographies
Birds, Reptiles & Other Vertebrates
Bugs, Mollusks & Other Invertebrates
Environment
Fossils & Geologic Time
Entertainment & Pop Culture
Sports & Recreation
Visual Arts
Demystified
Image Galleries
Infographics
Top Questions
Britannica Kids
Saving Earth
Space Next 50
Student Center
Introduction

Data collection

data analysis

Our editors will review what you’ve submitted and determine whether to revise the article.

Academia - Data Analysis
U.S. Department of Health and Human Services - Office of Research Integrity - Data Analysis
Chemistry LibreTexts - Data Analysis
IBM - What is Exploratory Data Analysis?
Table Of Contents

data analysis , the process of systematically collecting, cleaning, transforming, describing, modeling, and interpreting data , generally employing statistical techniques. Data analysis is an important part of both scientific research and business, where demand has grown in recent years for data-driven decision making . Data analysis techniques are used to gain useful insights from datasets, which can then be used to make operational decisions or guide future research . With the rise of “ big data ,” the storage of vast quantities of data in large databases and data warehouses, there is increasing need to apply data analysis techniques to generate insights about volumes of data too large to be manipulated by instruments of low information-processing capacity.

Datasets are collections of information. Generally, data and datasets are themselves collected to help answer questions, make decisions, or otherwise inform reasoning. The rise of information technology has led to the generation of vast amounts of data of many kinds, such as text, pictures, videos, personal information, account data, and metadata, the last of which provide information about other data. It is common for apps and websites to collect data about how their products are used or about the people using their platforms. Consequently, there is vastly more data being collected today than at any other time in human history. A single business may track billions of interactions with millions of consumers at hundreds of locations with thousands of employees and any number of products. Analyzing that volume of data is generally only possible using specialized computational and statistical techniques.

The desire for businesses to make the best use of their data has led to the development of the field of business intelligence , which covers a variety of tools and techniques that allow businesses to perform data analysis on the information they collect.

For data to be analyzed, it must first be collected and stored. Raw data must be processed into a format that can be used for analysis and be cleaned so that errors and inconsistencies are minimized. Data can be stored in many ways, but one of the most useful is in a database . A database is a collection of interrelated data organized so that certain records (collections of data related to a single entity) can be retrieved on the basis of various criteria . The most familiar kind of database is the relational database , which stores data in tables with rows that represent records (tuples) and columns that represent fields (attributes). A query is a command that retrieves a subset of the information in the database according to certain criteria. A query may retrieve only records that meet certain criteria, or it may join fields from records across multiple tables by use of a common field.

Frequently, data from many sources is collected into large archives of data called data warehouses. The process of moving data from its original sources (such as databases) to a centralized location (generally a data warehouse) is called ETL (which stands for extract , transform , and load ).

The extraction step occurs when you identify and copy or export the desired data from its source, such as by running a database query to retrieve the desired records.
The transformation step is the process of cleaning the data so that they fit the analytical need for the data and the schema of the data warehouse. This may involve changing formats for certain fields, removing duplicate records, or renaming fields, among other processes.
Finally, the clean data are loaded into the data warehouse, where they may join vast amounts of historical data and data from other sources.

After data are effectively collected and cleaned, they can be analyzed with a variety of techniques. Analysis often begins with descriptive and exploratory data analysis. Descriptive data analysis uses statistics to organize and summarize data, making it easier to understand the broad qualities of the dataset. Exploratory data analysis looks for insights into the data that may arise from descriptions of distribution, central tendency, or variability for a single data field. Further relationships between data may become apparent by examining two fields together. Visualizations may be employed during analysis, such as histograms (graphs in which the length of a bar indicates a quantity) or stem-and-leaf plots (which divide data into buckets, or “stems,” with individual data points serving as “leaves” on the stem).

Data analysis frequently goes beyond descriptive analysis to predictive analysis, making predictions about the future using predictive modeling techniques. Predictive modeling uses machine learning , regression analysis methods (which mathematically calculate the relationship between an independent variable and a dependent variable), and classification techniques to identify trends and relationships among variables. Predictive analysis may involve data mining , which is the process of discovering interesting or useful patterns in large volumes of information. Data mining often involves cluster analysis , which tries to find natural groupings within data, and anomaly detection , which detects instances in data that are unusual and stand out from other patterns. It may also look for rules within datasets, strong relationships among variables in the data.

Data Analysis Techniques in Research – Methods, Tools & Examples

Varun Saharawat is a seasoned professional in the fields of SEO and content writing. With a profound knowledge of the intricate aspects of these disciplines, Varun has established himself as a valuable asset in the world of digital marketing and online content creation.

Data analysis techniques in research are essential because they allow researchers to derive meaningful insights from data sets to support their hypotheses or research objectives.

Data Analysis Techniques in Research : While various groups, institutions, and professionals may have diverse approaches to data analysis, a universal definition captures its essence. Data analysis involves refining, transforming, and interpreting raw data to derive actionable insights that guide informed decision-making for businesses.

A straightforward illustration of data analysis emerges when we make everyday decisions, basing our choices on past experiences or predictions of potential outcomes.

If you want to learn more about this topic and acquire valuable skills that will set you apart in today’s data-driven world, we highly recommend enrolling in the Data Analytics Course by Physics Wallah . And as a special offer for our readers, use the coupon code “READER” to get a discount on this course.

Table of Contents

What is Data Analysis?

Data analysis is the systematic process of inspecting, cleaning, transforming, and interpreting data with the objective of discovering valuable insights and drawing meaningful conclusions. This process involves several steps:

Inspecting : Initial examination of data to understand its structure, quality, and completeness.
Cleaning : Removing errors, inconsistencies, or irrelevant information to ensure accurate analysis.
Transforming : Converting data into a format suitable for analysis, such as normalization or aggregation.
Interpreting : Analyzing the transformed data to identify patterns, trends, and relationships.

Types of Data Analysis Techniques in Research

Data analysis techniques in research are categorized into qualitative and quantitative methods, each with its specific approaches and tools. These techniques are instrumental in extracting meaningful insights, patterns, and relationships from data to support informed decision-making, validate hypotheses, and derive actionable recommendations. Below is an in-depth exploration of the various types of data analysis techniques commonly employed in research:

1) Qualitative Analysis:

Definition: Qualitative analysis focuses on understanding non-numerical data, such as opinions, concepts, or experiences, to derive insights into human behavior, attitudes, and perceptions.

Content Analysis: Examines textual data, such as interview transcripts, articles, or open-ended survey responses, to identify themes, patterns, or trends.
Narrative Analysis: Analyzes personal stories or narratives to understand individuals’ experiences, emotions, or perspectives.
Ethnographic Studies: Involves observing and analyzing cultural practices, behaviors, and norms within specific communities or settings.

2) Quantitative Analysis:

Quantitative analysis emphasizes numerical data and employs statistical methods to explore relationships, patterns, and trends. It encompasses several approaches:

Descriptive Analysis:

Frequency Distribution: Represents the number of occurrences of distinct values within a dataset.
Central Tendency: Measures such as mean, median, and mode provide insights into the central values of a dataset.
Dispersion: Techniques like variance and standard deviation indicate the spread or variability of data.

Diagnostic Analysis:

Regression Analysis: Assesses the relationship between dependent and independent variables, enabling prediction or understanding causality.
ANOVA (Analysis of Variance): Examines differences between groups to identify significant variations or effects.

Predictive Analysis:

Time Series Forecasting: Uses historical data points to predict future trends or outcomes.
Machine Learning Algorithms: Techniques like decision trees, random forests, and neural networks predict outcomes based on patterns in data.

Prescriptive Analysis:

Optimization Models: Utilizes linear programming, integer programming, or other optimization techniques to identify the best solutions or strategies.
Simulation: Mimics real-world scenarios to evaluate various strategies or decisions and determine optimal outcomes.

Specific Techniques:

Monte Carlo Simulation: Models probabilistic outcomes to assess risk and uncertainty.
Factor Analysis: Reduces the dimensionality of data by identifying underlying factors or components.
Cohort Analysis: Studies specific groups or cohorts over time to understand trends, behaviors, or patterns within these groups.
Cluster Analysis: Classifies objects or individuals into homogeneous groups or clusters based on similarities or attributes.
Sentiment Analysis: Uses natural language processing and machine learning techniques to determine sentiment, emotions, or opinions from textual data.

Also Read: AI and Predictive Analytics: Examples, Tools, Uses, Ai Vs Predictive Analytics

Data Analysis Techniques in Research Examples

To provide a clearer understanding of how data analysis techniques are applied in research, let’s consider a hypothetical research study focused on evaluating the impact of online learning platforms on students’ academic performance.

Research Objective:

Determine if students using online learning platforms achieve higher academic performance compared to those relying solely on traditional classroom instruction.

Data Collection:

Quantitative Data: Academic scores (grades) of students using online platforms and those using traditional classroom methods.
Qualitative Data: Feedback from students regarding their learning experiences, challenges faced, and preferences.

Data Analysis Techniques Applied:

1) Descriptive Analysis:

Calculate the mean, median, and mode of academic scores for both groups.
Create frequency distributions to represent the distribution of grades in each group.

2) Diagnostic Analysis:

Conduct an Analysis of Variance (ANOVA) to determine if there’s a statistically significant difference in academic scores between the two groups.
Perform Regression Analysis to assess the relationship between the time spent on online platforms and academic performance.

3) Predictive Analysis:

Utilize Time Series Forecasting to predict future academic performance trends based on historical data.
Implement Machine Learning algorithms to develop a predictive model that identifies factors contributing to academic success on online platforms.

4) Prescriptive Analysis:

Apply Optimization Models to identify the optimal combination of online learning resources (e.g., video lectures, interactive quizzes) that maximize academic performance.
Use Simulation Techniques to evaluate different scenarios, such as varying student engagement levels with online resources, to determine the most effective strategies for improving learning outcomes.

5) Specific Techniques:

Conduct Factor Analysis on qualitative feedback to identify common themes or factors influencing students’ perceptions and experiences with online learning.
Perform Cluster Analysis to segment students based on their engagement levels, preferences, or academic outcomes, enabling targeted interventions or personalized learning strategies.
Apply Sentiment Analysis on textual feedback to categorize students’ sentiments as positive, negative, or neutral regarding online learning experiences.

By applying a combination of qualitative and quantitative data analysis techniques, this research example aims to provide comprehensive insights into the effectiveness of online learning platforms.

Also Read: Learning Path to Become a Data Analyst in 2024

Data Analysis Techniques in Quantitative Research

Quantitative research involves collecting numerical data to examine relationships, test hypotheses, and make predictions. Various data analysis techniques are employed to interpret and draw conclusions from quantitative data. Here are some key data analysis techniques commonly used in quantitative research:

1) Descriptive Statistics:

Description: Descriptive statistics are used to summarize and describe the main aspects of a dataset, such as central tendency (mean, median, mode), variability (range, variance, standard deviation), and distribution (skewness, kurtosis).
Applications: Summarizing data, identifying patterns, and providing initial insights into the dataset.

2) Inferential Statistics:

Description: Inferential statistics involve making predictions or inferences about a population based on a sample of data. This technique includes hypothesis testing, confidence intervals, t-tests, chi-square tests, analysis of variance (ANOVA), regression analysis, and correlation analysis.
Applications: Testing hypotheses, making predictions, and generalizing findings from a sample to a larger population.

3) Regression Analysis:

Description: Regression analysis is a statistical technique used to model and examine the relationship between a dependent variable and one or more independent variables. Linear regression, multiple regression, logistic regression, and nonlinear regression are common types of regression analysis .
Applications: Predicting outcomes, identifying relationships between variables, and understanding the impact of independent variables on the dependent variable.

4) Correlation Analysis:

Description: Correlation analysis is used to measure and assess the strength and direction of the relationship between two or more variables. The Pearson correlation coefficient, Spearman rank correlation coefficient, and Kendall’s tau are commonly used measures of correlation.
Applications: Identifying associations between variables and assessing the degree and nature of the relationship.

5) Factor Analysis:

Description: Factor analysis is a multivariate statistical technique used to identify and analyze underlying relationships or factors among a set of observed variables. It helps in reducing the dimensionality of data and identifying latent variables or constructs.
Applications: Identifying underlying factors or constructs, simplifying data structures, and understanding the underlying relationships among variables.

6) Time Series Analysis:

Description: Time series analysis involves analyzing data collected or recorded over a specific period at regular intervals to identify patterns, trends, and seasonality. Techniques such as moving averages, exponential smoothing, autoregressive integrated moving average (ARIMA), and Fourier analysis are used.
Applications: Forecasting future trends, analyzing seasonal patterns, and understanding time-dependent relationships in data.

7) ANOVA (Analysis of Variance):

Description: Analysis of variance (ANOVA) is a statistical technique used to analyze and compare the means of two or more groups or treatments to determine if they are statistically different from each other. One-way ANOVA, two-way ANOVA, and MANOVA (Multivariate Analysis of Variance) are common types of ANOVA.
Applications: Comparing group means, testing hypotheses, and determining the effects of categorical independent variables on a continuous dependent variable.

8) Chi-Square Tests:

Description: Chi-square tests are non-parametric statistical tests used to assess the association between categorical variables in a contingency table. The Chi-square test of independence, goodness-of-fit test, and test of homogeneity are common chi-square tests.
Applications: Testing relationships between categorical variables, assessing goodness-of-fit, and evaluating independence.

These quantitative data analysis techniques provide researchers with valuable tools and methods to analyze, interpret, and derive meaningful insights from numerical data. The selection of a specific technique often depends on the research objectives, the nature of the data, and the underlying assumptions of the statistical methods being used.

Also Read: Analysis vs. Analytics: How Are They Different?

Data Analysis Methods

Data analysis methods refer to the techniques and procedures used to analyze, interpret, and draw conclusions from data. These methods are essential for transforming raw data into meaningful insights, facilitating decision-making processes, and driving strategies across various fields. Here are some common data analysis methods:

Description: Descriptive statistics summarize and organize data to provide a clear and concise overview of the dataset. Measures such as mean, median, mode, range, variance, and standard deviation are commonly used.
Description: Inferential statistics involve making predictions or inferences about a population based on a sample of data. Techniques such as hypothesis testing, confidence intervals, and regression analysis are used.

3) Exploratory Data Analysis (EDA):

Description: EDA techniques involve visually exploring and analyzing data to discover patterns, relationships, anomalies, and insights. Methods such as scatter plots, histograms, box plots, and correlation matrices are utilized.
Applications: Identifying trends, patterns, outliers, and relationships within the dataset.

4) Predictive Analytics:

Description: Predictive analytics use statistical algorithms and machine learning techniques to analyze historical data and make predictions about future events or outcomes. Techniques such as regression analysis, time series forecasting, and machine learning algorithms (e.g., decision trees, random forests, neural networks) are employed.
Applications: Forecasting future trends, predicting outcomes, and identifying potential risks or opportunities.

5) Prescriptive Analytics:

Description: Prescriptive analytics involve analyzing data to recommend actions or strategies that optimize specific objectives or outcomes. Optimization techniques, simulation models, and decision-making algorithms are utilized.
Applications: Recommending optimal strategies, decision-making support, and resource allocation.

6) Qualitative Data Analysis:

Description: Qualitative data analysis involves analyzing non-numerical data, such as text, images, videos, or audio, to identify themes, patterns, and insights. Methods such as content analysis, thematic analysis, and narrative analysis are used.
Applications: Understanding human behavior, attitudes, perceptions, and experiences.

7) Big Data Analytics:

Description: Big data analytics methods are designed to analyze large volumes of structured and unstructured data to extract valuable insights. Technologies such as Hadoop, Spark, and NoSQL databases are used to process and analyze big data.
Applications: Analyzing large datasets, identifying trends, patterns, and insights from big data sources.

8) Text Analytics:

Description: Text analytics methods involve analyzing textual data, such as customer reviews, social media posts, emails, and documents, to extract meaningful information and insights. Techniques such as sentiment analysis, text mining, and natural language processing (NLP) are used.
Applications: Analyzing customer feedback, monitoring brand reputation, and extracting insights from textual data sources.

These data analysis methods are instrumental in transforming data into actionable insights, informing decision-making processes, and driving organizational success across various sectors, including business, healthcare, finance, marketing, and research. The selection of a specific method often depends on the nature of the data, the research objectives, and the analytical requirements of the project or organization.

Also Read: Quantitative Data Analysis: Types, Analysis & Examples

Data Analysis Tools

Data analysis tools are essential instruments that facilitate the process of examining, cleaning, transforming, and modeling data to uncover useful information, make informed decisions, and drive strategies. Here are some prominent data analysis tools widely used across various industries:

1) Microsoft Excel:

Description: A spreadsheet software that offers basic to advanced data analysis features, including pivot tables, data visualization tools, and statistical functions.
Applications: Data cleaning, basic statistical analysis, visualization, and reporting.

2) R Programming Language :

Description: An open-source programming language specifically designed for statistical computing and data visualization.
Applications: Advanced statistical analysis, data manipulation, visualization, and machine learning.

3) Python (with Libraries like Pandas, NumPy, Matplotlib, and Seaborn):

Description: A versatile programming language with libraries that support data manipulation, analysis, and visualization.
Applications: Data cleaning, statistical analysis, machine learning, and data visualization.

4) SPSS (Statistical Package for the Social Sciences):

Description: A comprehensive statistical software suite used for data analysis, data mining, and predictive analytics.
Applications: Descriptive statistics, hypothesis testing, regression analysis, and advanced analytics.

5) SAS (Statistical Analysis System):

Description: A software suite used for advanced analytics, multivariate analysis, and predictive modeling.
Applications: Data management, statistical analysis, predictive modeling, and business intelligence.

6) Tableau:

Description: A data visualization tool that allows users to create interactive and shareable dashboards and reports.
Applications: Data visualization , business intelligence , and interactive dashboard creation.

7) Power BI:

Description: A business analytics tool developed by Microsoft that provides interactive visualizations and business intelligence capabilities.
Applications: Data visualization, business intelligence, reporting, and dashboard creation.

8) SQL (Structured Query Language) Databases (e.g., MySQL, PostgreSQL, Microsoft SQL Server):

Description: Database management systems that support data storage, retrieval, and manipulation using SQL queries.
Applications: Data retrieval, data cleaning, data transformation, and database management.

9) Apache Spark:

Description: A fast and general-purpose distributed computing system designed for big data processing and analytics.
Applications: Big data processing, machine learning, data streaming, and real-time analytics.

10) IBM SPSS Modeler:

Description: A data mining software application used for building predictive models and conducting advanced analytics.
Applications: Predictive modeling, data mining, statistical analysis, and decision optimization.

These tools serve various purposes and cater to different data analysis needs, from basic statistical analysis and data visualization to advanced analytics, machine learning, and big data processing. The choice of a specific tool often depends on the nature of the data, the complexity of the analysis, and the specific requirements of the project or organization.

Also Read: How to Analyze Survey Data: Methods & Examples

Importance of Data Analysis in Research

The importance of data analysis in research cannot be overstated; it serves as the backbone of any scientific investigation or study. Here are several key reasons why data analysis is crucial in the research process:

Data analysis helps ensure that the results obtained are valid and reliable. By systematically examining the data, researchers can identify any inconsistencies or anomalies that may affect the credibility of the findings.
Effective data analysis provides researchers with the necessary information to make informed decisions. By interpreting the collected data, researchers can draw conclusions, make predictions, or formulate recommendations based on evidence rather than intuition or guesswork.
Data analysis allows researchers to identify patterns, trends, and relationships within the data. This can lead to a deeper understanding of the research topic, enabling researchers to uncover insights that may not be immediately apparent.
In empirical research, data analysis plays a critical role in testing hypotheses. Researchers collect data to either support or refute their hypotheses, and data analysis provides the tools and techniques to evaluate these hypotheses rigorously.
Transparent and well-executed data analysis enhances the credibility of research findings. By clearly documenting the data analysis methods and procedures, researchers allow others to replicate the study, thereby contributing to the reproducibility of research findings.
In fields such as business or healthcare, data analysis helps organizations allocate resources more efficiently. By analyzing data on consumer behavior, market trends, or patient outcomes, organizations can make strategic decisions about resource allocation, budgeting, and planning.
In public policy and social sciences, data analysis is instrumental in developing and evaluating policies and interventions. By analyzing data on social, economic, or environmental factors, policymakers can assess the effectiveness of existing policies and inform the development of new ones.
Data analysis allows for continuous improvement in research methods and practices. By analyzing past research projects, identifying areas for improvement, and implementing changes based on data-driven insights, researchers can refine their approaches and enhance the quality of future research endeavors.

However, it is important to remember that mastering these techniques requires practice and continuous learning. That’s why we highly recommend the Data Analytics Course by Physics Wallah . Not only does it cover all the fundamentals of data analysis, but it also provides hands-on experience with various tools such as Excel, Python, and Tableau. Plus, if you use the “ READER ” coupon code at checkout, you can get a special discount on the course.

For Latest Tech Related Information, Join Our Official Free Telegram Group : PW Skills Telegram Group

Data Analysis Techniques in Research FAQs

What are the 5 techniques for data analysis.

The five techniques for data analysis include: Descriptive Analysis Diagnostic Analysis Predictive Analysis Prescriptive Analysis Qualitative Analysis

What are techniques of data analysis in research?

Techniques of data analysis in research encompass both qualitative and quantitative methods. These techniques involve processes like summarizing raw data, investigating causes of events, forecasting future outcomes, offering recommendations based on predictions, and examining non-numerical data to understand concepts or experiences.

What are the 3 methods of data analysis?

The three primary methods of data analysis are: Qualitative Analysis Quantitative Analysis Mixed-Methods Analysis

What are the four types of data analysis techniques?

The four types of data analysis techniques are: Descriptive Analysis Diagnostic Analysis Predictive Analysis Prescriptive Analysis

Data Analyst Job Description, Salary, Responsibilities

Data analysts are tasked with developing analysis and reporting capabilities, enabling them to extract meaningful insights from the data they…

Top 18 Business Analytics Tools Used by Companies Today

Business Analytics Tools: In today's data-driven landscape, businesses rely heavily on vast data for success. However, many companies struggle to…

9 Deep Learning Books to Check Out!

Deep Learning has become increasingly popular since the 1980s. We've gathered the nine best Deep Learning books to help you…

Data Mining Architecture: Components, Types & Techniques
Types Of Regression Analysis In Machine Learning
Data Profiling in ETL: Definition, Process, Tools, and Best Practices
7 Affordable Business Intelligence Software Programs for Data Analysis
How To Become Data Analyst In 2023
Essential Data Analytics Tools for Successful Analysis
Google Data Analytics Certification: A Path to Success

Loading metrics

Open Access

Principles for data analysis workflows

Contributed equally to this work with: Sara Stoudt, Váleri N. Vásquez

Affiliations Berkeley Institute for Data Science, University of California Berkeley, Berkeley, California, United States of America, Statistical & Data Sciences Program, Smith College, Northampton, Massachusetts, United States of America

Affiliations Berkeley Institute for Data Science, University of California Berkeley, Berkeley, California, United States of America, Energy and Resources Group, University of California Berkeley, Berkeley, California, United States of America

* E-mail: [email protected]

Affiliations Berkeley Institute for Data Science, University of California Berkeley, Berkeley, California, United States of America, Department of Molecular and Cellular Biology, University of California Berkeley, Berkeley, California, United States of America

Sara Stoudt,
Váleri N. Vásquez,
Ciera C. Martinez

Published: March 18, 2021

https://doi.org/10.1371/journal.pcbi.1008770
Reader Comments

A systematic and reproducible “workflow”—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between design principles and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students new to research and current researchers who are new to data-intensive work.

Citation: Stoudt S, Vásquez VN, Martinez CC (2021) Principles for data analysis workflows. PLoS Comput Biol 17(3): e1008770. https://doi.org/10.1371/journal.pcbi.1008770

Editor: Patricia M. Palagi, SIB Swiss Institute of Bioinformatics, SWITZERLAND

Copyright: © 2021 Stoudt et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: SS was supported by the National Physical Sciences Consortium ( https://stemfellowships.org/ ) fellowship. SS, VNV, and CCM were supported by the Gordon & Betty Moore Foundation ( https://www.moore.org/ ) (GBMF3834) and Alfred P. Sloan Foundation ( https://sloan.org/ ) (2013-10-27) as part of the Moore-Sloan Data Science Environments. CCM holds a Postdoctoral Enrichment Program Award from the Burroughs Wellcome Fund ( https://www.bwfund.org/ ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Both traditional science fields and the humanities are becoming increasingly data driven and computational. Researchers who may not identify as data scientists are working with large and complex data on a regular basis. A systematic and reproducible research workflow —the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of data-intensive research practice in any academic discipline. The importance and effective development of a workflow should, in turn, be a cornerstone of the data science education designed to prepare researchers across disciplinary specializations.

Data science education tends to review foundational statistical analysis methods [ 1 ] and furnish training in computational tools , software, and programming languages. In scientific fields, education and training includes a review of domain-specific methods and tools, but generally omits guidance on the coding practices relevant to developing new analysis software—a skill of growing relevance in data-intensive scientific fields [ 2 ]. Meanwhile, the holistic discussion of how to develop and pursue a research workflow is often left out of introductions to both data science and disciplinary science. Too frequently, students and academic practitioners of data-intensive research are left to learn these essential skills on their own and on the job. Guidance on the breadth of potential products that can emerge from research is also lacking. In the interest of both reproducible science (providing the necessary data and code to recreate the results) and effective career building, researchers should be primed to regularly generate outputs over the course of their workflow.

The goal of this paper is to deconstruct an academic data-intensive research project, demonstrating how both design principles and software development methods can motivate the creation and standardization of practices for reproducible data and code. The implementation of such practices generates research products that can be effectively communicated, in addition to constituting a scientific contribution. Here, “data-intensive” research is used interchangeably with “data science” in a recognition of the breadth of domain applications that draw upon computational analysis methods and workflows. (We define other terms we’ve bolded throughout this paper in Box 1 ). To be useful, let alone high impact, research analyses should be contextualized in the data processing decisions that led to their creation and accompanied by a narrative that explains why the rest of the world should be interested. One way of thinking about this is that the scientific method should be tangibly reflected, and feasibly reproducible, in any data-intensive research project.

Box 1. Terminology

This box provides definitions for terms in bold throughout the text. Terms are sorted alphabetically and cross referenced where applicable.

Agile: An iterative software development framework which adheres to the principles described in the Manifesto for Agile software development [ 35 ] (e.g., breaks up work into small increments).

Accessor function: A function that returns the value of a variable (synonymous term: getter function).

Assertion: An expression that is expected to be true at a particular point in the code.

Computational tool: May include libraries, packages, collections of functions, and/or data structures that have been consciously designed to facilitate the development and pursuit of data-intensive questions (synonymous term: software tool).

Continuous integration: Automatic tests that updated code.

Gut check: Also “data gut check.” Quick, broad, and shallow testing [ 48 ] before and during data analysis. Although this is usually described in the context of software development, the concept of a data-specific gut check can include checking the dimensions of data structures after merging or assessing null values/missing values, zero values, negative values, and ranges of values to see if they make sense (synonymous words: smoke test, sanity check [ 49 ], consistency check, sniff test, soundness check).

Data-intensive research : Research that is centrally based on the analysis of data and its structural or statistical properties. May include but is not limited to research that hinges on large volumes of data or a wide variety of data types requiring computational skills to approach such research (synonymous term: data science research). “Data science” as a stand-alone term may also refer more broadly to the use of computational tools and statistical methods to gain insights from digitized information.

Data structure: A format for storing data values and definition of operations that can be applied to data of a particular type.

Defensive programming : Strategies to guard against failures or bugs in code; this includes the use of tests and assertions.

Design thinking: The iterative process of defining a problem then identifying and prototyping potential solutions to that problem, with an emphasis on solutions that are empathetic to the particular needs of the target user.

Docstring: A code comment for a particular line of code that describes what a function does, as opposed to how the function performs that operation.

DOI: A digital object identifier or DOI is a unique handle, standardized by the International Organization for Standardization (ISO), that can be assigned to different types of information objects.

Extensibility: The flexibility to be extended or repurposed in a new scenario.

Function: A piece of more abstracted code that can be reused to perform the same operation on different inputs of the same type and has a standardized output [ 50 – 52 ].

Getter function: Another term for an accessor function.

Integrated Development Environment (IDE): A software application that facilitates software development and minimally consists of a source code editor, build automation tools, and a debugger.

Modularity: An ability to separate different functionality into stand-alone pieces.

Mutator method: A function used to control changes to variables. See “setter function” and “accessor function.”

Notebook: A computational or physical place to store details of a research process including decisions made.

Mechanistic code : Code used to perform a task as opposed to conduct an analysis. Examples include processing functions and plotting functions.

Overwrite: The process, intentional or accidental, of assigning new values to existing variables.

Package manager: A system used to automate the installation and configuration of software.

Pipeline : A series of programmatic processes during data analysis and data cleaning, usually linear in nature, that can be automated and usually be described in the context of inputs and outputs.

Premature optimization : Focusing on details before the general scheme is decided upon.

Refactoring: A change in code, such as file renaming, to make it more organized without changing the overall output or behavior.

Replicable: A new study arrives at the same scientific findings as a previous study, collecting new data (with the same or different methods) and completes new analyses [ 53 – 55 ].

Reproducible: Authors provide all the necessary data, and the computer codes to run the analysis again, recreating the results [ 53 – 55 ].

Script : A collection of code, ideally related to one particular step in the data analysis.

Setter function: A type of function that controls changes to variables. It is used to directly access and alter specific values (synonymous term: mutator method).

Serialization: The process of saving data structures, inputs and outputs, and experimental setups generally in a storable, shareable format. Serialized information can be reconstructed in different computer environments for the purpose of replicating or reproducing experiments.

Software development: A process of writing and documenting code in pursuit of an end goal, typically focused on process over analysis.

Source code editor: A program that facilitates changes to code by an author.

Technical debt: The extra work you defer by pursuing an easier, yet not ideal solution, early on in the coding process.

Test-driven development: Each change in code should be verified against tests to prove its functionality.

Unit test: A code test for the smallest chunk of code that is actually testable.

Version control: A way of managing changes to code or documentation that maintains a record of changes over time.

White paper: An informative, at least semiformal document that explains a particular issue but is not peer reviewed.

Workflow : The process that moves a scientific investigation from raw data to coherent research question to insightful contribution. This often involves a complex series of processes and includes a mixture of machine automation and human intervention. It is a nonlinear and iterative exercise.

Discussions of “workflow” in data science can take on many different meanings depending on the context. For example, the term “workflow” often gets conflated with the term “ pipeline ” in the context of software development and engineering. Pipelines are often described as a series of processes that can be programmatically defined and automated and explained in the context of inputs and outputs. However, in this paper, we offer an important distinction between pipelines and workflows: The former refers to what a computer does, for example, when a piece of software automatically runs a series of Bash or R script s. For the purpose of this paper, a workflow describes what a researcher does to make advances on scientific questions: developing hypotheses, wrangling data, writing code, and interpreting results.

Data analysis workflows can culminate in a number of outcomes that are not restricted to the traditional products of software engineering (software tools and packages) or academia (research papers). Rather, the workflow that a researcher defines and iterates over the course of a data science project can lead to intellectual contributions as varied as novel data sets, new methodological approaches, or teaching materials in addition to the classical tools, packages, and papers. While the workflow should be designed to serve the researcher and their collaborators, maintaining a structured approach throughout the process will inform results that are replicable (see replicable versus reproducible in Box 1 ) and easily translated into a variety of products that furnish scientific insights for broader consumption.

In the following sections, we explain the basic principles of a constructive and productive data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases. Each phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Where relevant, we draw analogies to the realm of design thinking and software development . While the 3 phases described here are not intended to be a strict rulebook, we hope that the many references to additional resources—and suggestions for nontraditional research products—provide guidance and support for both students new to research and current researchers who are new to data-intensive work.

The Explore, Refine, Produce (ERP) workflow for data-intensive research

We partition the workflow of a data-intensive research process into 3 phases: Explore, Refine, and Produce. These phases, collectively the ERP workflow, are visually described in Fig 1A and 1B . In the Explore Phase, researchers “meet” their data: process it, interrogate it, and sift through potential solutions to a problem of interest. In the Refine Phase, researchers narrow their focus to a particularly promising approach, develop prototypes, and organize their code into a clearer narrative. The Produce Phase happens concurrently with the Explore and Refine Phases. In this phase, researchers prepare their work for broader consumption and critique.

PPT PowerPoint slide
PNG larger image
TIFF original image

(A) We deconstruct a data-intensive research project into 3 phases, visualizing this process as a tree structure. Each branch in the tree represents a decision that needs to be made about the project, such as data cleaning, refining the scope of the research, or using a particular tool or model. Throughout the natural life of a project, there are many dead ends (yellow Xs). These may include choices that do not work, such as experimentation with a tool that is ultimately not compatible with our data. Dead ends can result in informal learning or procedural fine-tuning. Some dead ends that lie beyond the scope of our current project may turn into a new project later on (open turquoise circles). Throughout the Explore and Refine Phases, we are concurrently in the Produce Phase because research products (closed turquoise circles) can arise at any point throughout the workflow. Products, regardless of the phase that generates their content, contribute to scientific understanding and advance the researcher’s career goals. Thus, the data-intensive research portfolio and corresponding academic CV can be grown at any point in the workflow. (B) The ERP workflow as a nonlinear cycle. Although the tree diagram displayed in Fig 1A accurately depicts the many choices and dead ends that a research project contains, it does not as easily reflect the nonlinearity of the process; Fig 1B’s representation aims to fill this gap. We often iterate between the Explore and Refine Phases while concurrently contributing content to the Produce Phase. The time spent in each phase can vary significantly across different types of projects. For example, hypothesis generation in the Explore Phase might be the biggest hurdle in one project, while effectively communicating a result to a broader audience in the Produce Phase might be the most challenging aspect of another project.

https://doi.org/10.1371/journal.pcbi.1008770.g001

Each phase has an immediate audience—the researcher themselves, their collaborative groups, or the public—that broadens progressively and guides priorities. Each of the 3 phases can benefit from standards that the software development community uses to streamline their code-based pipelines, as well as from principles the design community uses to generate and carry out ideas; many such practices can be adapted to help structure a data-intensive researcher’s workflow. The Explore and Refine Phases provide fodder for the concurrent Produce Phase. We hope that the potential to produce a variety of research products throughout a data-intensive research process, rather than merely at the end of a project, motivates researchers to apply the ERP workflow.

Phase 1: Explore

Data-intensive research projects typically start with a domain-specific question or a particular data set to explore [ 3 ]. There is no fixed, cross-disciplinary rule that defines the point in a workflow by which a hypothesis must be established. This paper adopts an open-minded approach concerning the timing of hypothesis generation [ 4 ], assuming that data-intensive research projects can be motivated by either an explicit, preexisting hypothesis or a new data set about which no strong preconceived assumptions or intuitions exist. The often messy Explore Phase is rarely discussed as an explicit step of the methodological process, but it is an essential component of research: It allows us to gain intuition about our data, informing future phases of the workflow. As we explore our data, we refine our research question and work toward the articulation of a well-defined problem. The following section will address how to reap the benefits of data set and problem space exploration and provide pointers on how to impose structure and reproducibility during this inherently creative phase of the research workflow.

Designing data analysis: Goals and standards of the Explore Phase

Trial and error is the hallmark of the Explore Phase (note the density of “deadends” and decisions made in this phase in Fig 1A ). In “Designerly Ways of Knowing” [ 5 ], the design process is described as a “co-evolution of solution and problem spaces.” Like designers, data-intensive researchers explore the problem space, learn about the potential structure of the solution space, and iterate between the 2 spaces. Importantly, the difficulties we encounter in this phase help us build empathy for an eventual audience beyond ourselves. It is here that we experience firsthand the challenges of processing our data set, framing domain research questions appropriate to it, and structuring the beginnings of a workflow. Documenting our trial and error helps our own work stay on track in addition to assisting future researchers facing similar challenges.

One end goal of the Explore Phase is to determine whether new questions of interest might be answered by leveraging existing software tools (either off the shelf or with minor adjustments), rather than building new computational capabilities ourselves. For example, during this phase, a common activity includes surveying the software available for our data set or problem space and estimating its utility for the unique demands of our current analysis. Through exploration, we learn about relevant computational and analysis tools while concurrently building an understanding of our data.

A second important goal of the Explore Phase is data cleaning and developing a strategy to analyze our data. This is a dynamic process that often goes hand in hand with improving our understanding of the data. During the Explore Phase, we redesign and reformat data structures, identify important variables, remove redundancies, take note of missing information, and ponder outliers in our data set. Once we have established the software tools—the programming language, data analysis packages, and a handful of the useful functions therein—that are best suited to our data and domain area, we also start putting those tools to use [ 6 ]. In addition, during the Explore Phase, we perform initial tests, build a simple model, or create some basic visualizations to better grasp the contents of our data set and check for expected outputs. Our research is underway in earnest now, and this effort will help us to identify what questions we might be able to ask of our data.

The Explore Phase is often a solo endeavor; as shown in Fig 1A , our audience is typically our current or future self. This can make navigating the phase difficult, especially for new researchers. It also complicates a third goal of this phase: documentation. In this phase, we ourselves are our only audience, and if we are not conscientious documenters, we can easily end up concluding the phase without the ability to coherently describe our research process up to that point. Record keeping in the Explore Phase is often subject to our individual style of approaching problems. Some styles work in real time, subsetting or reconfiguring data as ideas occur. More methodical styles tend to systematically plan exploratory steps, recording them before taking action. These natural tendencies impact the state of our analysis code, affecting its readability and reproducibility.

However, there are strategies—inspired by analogous software development principles—that can help set us up for success in meeting the standards of reproducibility [ 7 ] relevant to a scientifically sound research workflow. These strategies impose a semblance of order on the Explore Phase. To avoid concerns of premature optimization [ 8 ] while we are iterating during this phase, documentation is the primary goal, rather than fine-tuning the code structure and style. Documentation enables the traceability of a researcher’s workflow, such that all efforts are replicable and final outcomes are reproducible.

Analogies to software development in the Explore Phase

Documentation: code and process..

Software engineers typically value formal documentation that is readable by software users. While the audience for our data analysis code may not be defined as a software user per se, documentation is still vital for workflow development. Documentation for data analysis workflows can come in many forms, including comments describing individual lines of code, README files orienting a reader within a code repository, descriptive commit history logs tracking the progress of code development, docstrings detailing function capabilities, and vignettes providing example applications. Documentation provides both a user manual for particular tools within a project (for example, data cleaning functions), and a reference log describing scientific research decisions and their rationale (for example, the reasons behind specific parameter choices).

In the Explore Phase, we may identify with the type of programmer described by Brant and colleagues as “opportunistic” [ 9 ]. This type of programmer finds it challenging to prioritize documenting and organizing code that they see as impermanent or a work in progress. “Opportunistic” programmers tend to build code using others’ tools, focusing on writing “glue” code that links preexisting components and iterate quickly. Hartmann and colleagues also describe this mash-up approach [ 10 ]. Rather than “opportunistic programmers,” their study focuses on “opportunistic designers.” This style of design “search[es] for bridges,” finding connections between what first appears to be different fields. Data-intensive researchers often use existing tools to answer questions of interest; we tend to build our own only when needed.

Even if the code that is used for data exploration is not developed into a software-based final research product, the exploratory process as a whole should exist as a permanent record: Future scientists should be able to rerun our analysis and work from where we left off, beginning from raw, unprocessed data. Therefore, documenting choices and decisions we make along the way is crucial to making sure we do not forget any aspect of the analysis workflow, because each choice may ultimately impact the final results. For example, if we remove some data points from our analyses, we should know which data points we removed—and our reason for removing them—and be able to communicate those choices when we start sharing our work with others. This is an important argument against ephemerally conducting our data analysis work via the command line.

Instead of the command line, tools like a computational notebook [ 11 ] can help capture a researcher’s decision-making process in real time [ 12 ]. A computational notebook where we never delete code, and—to avoid overwriting named variables—only move forward in our document, could act as “version control designed for a 10-minute scale” that Brant and colleagues found might help the “opportunistic” programmer. More recent advances in this area include the reactive notebook [ 13 – 14 ]. Such tools assist documentation while potentially enhancing our creativity during the Explore Phase. The bare minimum documentation of our Explore Phase might therefore include such a notebook or an annotated script [ 15 ] to record all analyses that we perform and code that we write.

To go a step beyond annotated scripts or notebooks, researchers might employ a version control system such as Git. With its issues, branches, and informative commit messages, Git is another useful way to maintain a record of our trial-and-error process and track which files are progressing toward which goals of the overall project. Using Git together with a public online hosting service such as GitHub allows us to share our work with collaborators and the public in real time, if we so choose.

A researcher dedicated to conducting an even more thoroughly documented Explore Phase may take Ford’s advice and include notes that explicitly document our stream of consciousness [ 16 ]. Our notes should be able to efficiently convey what failed, what worked but was uninteresting or beyond scope of the project, and what paths of inquiry we will continue forward with in more depth ( Fig 1A ). In this way, as we transition from the Explore Phase to the Refine Phase, we will have some signposts to guide our way.

Testing: Comparing expectations to output.

As Ford [ 16 ] explains, we face competing goals in the Explore Phase: We want to get results quickly, but we also want to be confident in our answers. Her strategy is to focus on documentation over tests for one-off analyses that will not form part of a larger research project. However, the complete absence of formal tests may raise a red flag for some data scientists used to the concept of test-driven development . This is a tension between the code-based work conducted in scientific research versus software development: Tests help build confidence in analysis code and convince users that it is reliable or accurate, but tests also imply finality and take time to write that we may not be willing to allocate in the experimental Explore Phase. However, software development style tests do have useful analogs in data analysis efforts: We can think of tests, in the data analysis sense, as a way of checking whether our expectations match the reality of a piece of code’s output.

Imagine we are looking at a data set for the first time. What weird things can happen? The type of variable might not be what we expect (for example, the integer 4 instead of the float 4.0). The data set could also include unexpected aspects (for example, dates formatted as strings instead of numbers). The amount of missing data may be larger than we thought, and this missingness could be coded in a variety of ways (for example, as a NaN, NULL, or −999). Finally, the dimensions of a data frame after merging or subsetting it for data cleaning may not match our expectations. Such gaps in expectation versus reality are “silent faults” [ 17 ]. Without checking for them explicitly, we might proceed with our analysis unaware that anything is amiss and encode that error in our results.

For these reasons, every data exploration should include quantitative and qualitative “gut checks” [ 18 ] that can help us diagnose an expectation mismatch as we go about examining and manipulating our data. We may check assumptions about data quality such as the proportion of missing values, verify that a joined data set has the expected dimensions, or ascertain the statistical distributions of well-known data categories. In this latter case, having domain knowledge can help us understand what to expect. We may want to compare 2 data sets (for example, pre- and post-processed versions) to ensure they are the same [ 19 ]; we may also evaluate diagnostic plots to assess a model’s goodness of fit. Each of the elements that gut checks help us monitor will impact the accuracy and direction of our future analyses.

We perform these manual checks to reassure ourselves that our actions at each step of data cleaning, processing, or preliminary analysis worked as expected. However, these types of checks often rely on us as researchers visually assessing output and deciding if we agree with it. As we transition to needing to convince users beyond ourselves of the correctness of our work, we may consider employing defensive programming techniques that help guard against specific mistakes. An example of defensive programming in the Julia language is the use of assertions, such as the @assert macro to validate values or function outputs. Another option includes writing “chatty functions” [ 20 ] that signal a user to pause, examine the output, and decide if they agree with it.

When to transition from the Explore Phase: Balancing breadth and depth

A researcher in the Explore Phase experiments with a variety of potential data configurations, analysis tools, and research directions. Not all of these may bear fruit in the form of novel questions or promising preliminary findings. Learning how to find a balance between the breadth and depth of data exploration helps us understand when to transition to the Refine Phase of data-intensive research. Specific questions to ask ourselves as we prepare to transition between the Explore Phase and the Refine Phase can be found in Box 2 .

Box 2. Questions

This box provides guiding questions to assist readers in navigating through each workflow phase. Questions pertain to planning, organization, and accountability over the course of workflow iteration.

Questions to ask in the Explore Phase

Good: Ourselves (e.g., Code includes signposts refreshing our memory of what is happening where.)
Better: Our small team who has specialized knowledge about the context of the problem.
Best: Anyone with experience using similar tools to us.
Good: Dead ends marked differently than relevant and working code.
Better: Material connected to a handful of promising leads.
Best: Material connected to a clearly defined scope.
Good: Backed up in a second location in addition to our computer.
Better: Within a shared space among our team (e.g., Google Drive, Box, etc.).
Best: Within a version control system (e.g., GitHub) that furnishes a complete timeline of actions taken.
Good: Noted in a separate place from our code (e.g., a physical notebook).
Better: Noted in comments throughout the code itself, with expectations informally checked.
Best: Noted systematically throughout code as part of a narrative, with expectations formally checked.

Questions to ask in the Refine Phase

Who is in our team?
Consider career level, computational experience, and domain-specific experience.
How do we communicate methodology with our teammates’ skills in mind?
What reproducibility tools can be agreed upon?
How can our work be packaged into impactful research products?
Can we explain the same important results across different platforms (e.g., blog post in addition to white paper)?
How can we alert these people and make our work accessible?
How can we use narrative to make this clear?

Questions to ask in the Produce Phase

Do we have more than 1 audience?
What is the next step in our research?
Can we turn our work into more than 1 publishable product?
Consider products throughout the entire workflow.
See suggestions in the Tool development guide ( Box 4 ).

Imposing structure at certain points throughout the Explore Phase can help to balance our wide search for solutions with our deep dives into particular options. In an analogy to the software development world, we can treat our exploratory code as a code release—the marker of a stable version of a piece of software. For example, we can take stock of the code we have written at set intervals, decide what aspects of the analysis conducted using it seem most promising, and focus our attention on more formally tuning those parts of the code. At this point, we can also note the presence of research “dead ends” and perhaps record where they fit into our thought process. Some trains of thought may not continue into the next phase or become a formal research product, but they can still contribute to our understanding of the problem or eliminate a potential solution from consideration. As the project matures, computational pipelines are established. These inform project workflow, and tools, such as Snakemake and Nextflow, can begin to be used to improve the flexibility and reproducibility of the project [ 21 – 23 ]. As we make decisions about which research direction we are going to pursue, we can also adjust our file structure and organize files into directories with more informative names.

Just as Cross [ 5 ] finds that a “reasonably-structured process” leads to design success where “rigid, over-structured approaches” find less success, a balance between the formality of documentation and testing and the informality of creative discovery is key to the Explore Phase of data-intensive research. By taking inspiration from software development and adapting the principles of that arena to fit our data analysis work, we add enough structure to this phase to ease transition into the next phase of the research workflow.

Phase 2: Refine

Inevitably, we reach a point in the Explore Phase when we have acquainted ourselves with our data set, processed and cleaned it, identified interesting research questions that might be asked using it, and found the analysis tools that we prefer to apply. Having reached this important juncture, we may also wish to expand our audience from ourselves to a team of research collaborators. It is at this point that we are ready to transition to the Refine Phase. However, we should keep in mind that new insights may bring us back to the Explore Phase: Over the lifetime of a given research project, we are likely to cycle through each workflow phase multiple times.

In the Refine Phase, the extension of our target audience demands a higher standard for communicating our research decisions as well as a more formal approach to organizing our workflow and documenting and testing our code. In this section, we will discuss principles for structuring our data analysis in the Refine Phase. This phase will ultimately prepare our work for polishing into more traditional research products, including peer-reviewed academic papers.

Designing data analysis: Goals and standards of the Refine Phase

The Refine Phase encompasses many critical aspects of a data-intensive research project. Additional data cleaning may be conducted, analysis methodologies are chosen, and the final experimental design is decided upon. Experimental design may include identifying case studies for variables of interest within our data. If applicable, it is during this phase that we determine the details of simulations. Preliminary results from the Explore Phase inform how we might improve upon or scale up prototypes in the Refine Phase. Data management is essential during this phase and can be expanded to include the serialization of experimental setups. Finally, standards of reproducibility should be maintained throughout. Each of these aspects constitutes an important goal of the Refine Phase as we determine the most promising avenues for focusing our research workflow en route to the polished research products that will emerge from this phase and demand even higher reproducibility standards.

All of these goals are developed in conjunction with our research team. Therefore, decisions should be documented and communicated in a way that is reproducible and constructive within that group. Just as the solitary nature of the Explore Phase can be daunting, the collaboration that may happen in the Refine Phase brings its own set of challenges as we figure out how to best work together. Our team can be defined as the people who participate in developing the research question, preparing the data set it is applied to, coding the analysis, or interpreting the results. It might also include individuals who offer feedback about the progress of our work. In the context of academia, our team usually includes our laboratory or research group. Like most other aspects of data-intensive research, our team may evolve as the project evolves. But however we define our team, its members inform how our efforts proceed during the Refine Phase: Thus, another primary goal of the Refine Phase is establishing group-based standards for the research workflow. Specific questions to ask ourselves during this phase can be found in Box 2 .

In recent years, the conversation on standards within academic data science and scientific computing has shifted from “best” practices [ 24 ] to “good enough” practices [ 25 ]. This is an important distinction when establishing team standards during the Refine Phase: Reproducibility is a spectrum [ 26 ], and collaborative work in data-intensive research carries unique demands on researchers as scholars and coworkers [ 27 ]. At this point in the research workflow, standards should be adopted according to their appropriateness for our team. This means talking among ourselves not only about scientific results, but also about the computational experimental design that led to those results and the role that each team member plays in the research workflow. Establishing methods for effective communication is therefore another important goal in the Refine Phase, as we cannot develop group-based standards for the research workflow without it.

Analogies to software development in the Refine Phase

Documentation as a driver of reproducibility..

The concept of literate programming [ 8 ] is at the core of an effective Refine Phase. This philosophy brings together code with human-readable explanations, allowing scientists to demonstrate the functionality of their code in the context of words and visualizations that describe the rationale for and results of their analysis. The computational notebooks that were useful in the Explore Phase are also applicable here, where they can assist with team-wide discussions, research development, prototyping, and idea sharing. Jupyter Notebooks [ 28 ] are agnostic to choice of programming language and so provide a good option for research teams that may be working with a diverse code base or different levels of comfort with a particular programming language. Language-specific interfaces such as R’s RMarkdown functionality [ 29 ] and Literate.jl or the reactive notebook put forward by Pluto.jl in the Julia programming language furnish additional options for literate programming.

The same strategies that promote scientific reproducibility for traditional laboratory notebooks can be applied to the computational notebook [ 30 ]. After all, our data-intensive research workflow can be considered a sort of scientific experiment—we develop a hypothesis, query our data, support or reject our hypothesis, and state our insights. A central tenet of scientific reproducibility is recording inputs relevant to a given analysis, such as parameter choices, and explaining any calculation used to obtain them so that our outputs can later be verifiably replicated. Methodological details—for example, the decision to develop a dynamic model in continuous time versus discrete time or the choice of a specific statistical analysis over alternative options—should also be fully explained in computational notebooks developed during the Refine Phase. Domain knowledge may inform such decisions, making this an important part of proper notebook documentation; such details should also be elaborated in the final research product. Computational research descriptions in academic journals generally include a narrative relevant to their final results, but these descriptions often do not include enough methodological detail to enable replicability, much less reproducibility. However, this is changing with time [ 31 , 32 ].

As scientists, we should keep a record of the tools we use to obtain our results in addition to our methodological process. In a data-intensive research workflow, this includes documenting the specific version of any software that we used, as well as its relevant dependencies and compatibility constraints. Recording this information at the top of the computational notebook that details our data science experiment allows future researchers—including ourselves and our teams—to establish the precise computational environment that was used to run the original research analysis. Our chosen programming language may supply automated approaches for doing this, such as a package manager , simplifying matters and painlessly raising the standards of reproducibility in a research team. The unprecedented levels of reproducibility possible in modern computational environments have produced some variance in the expectations of different research communities; it behooves the research team to investigate the community-level standards applicable to our specific domain science and chosen programming language.

A notebook can include more than a deep dive into a full-fledged data science experiment. It can also involve exploring and communicating basic properties of the data, whether for purposes of training team members new to the project or for brainstorming alternative possible approaches to a piece of research. In the Exploration Phase, we have discovered characteristics of our data that we want our research team to know about, for example, outliers or unexpected distributions, and created preliminary visualizations to better understand their presence. In the Refine Phase, we may choose to improve these initial plots and reprise our data processing decisions with team members to ensure that the logic we applied still holds.

Computational notebooks can live in private or public repositories to ensure accessibility and transparency among team members. A version control system such as Git continues to be broadly useful for documentation purposes in the Refine Phase, beyond acting as a storage site for computational notebooks. Especially as our team and code base grows larger, a history of commits and pull requests helps keep track of responsibilities, coding or data issues, and general workflow.

Importantly however, all tools have their appropriate use cases. Researchers should not develop an overt reliance on any one tool and should learn to recognize when different tools are required. For example, computational notebooks may quickly become unwieldy for certain projects and large teams, incurring technical debt in the form of duplications or overwritten variables. As our research project grows in complexity and size, or gains team members, we may want to transition to an Integrated Development Environment (IDE) or a source code editor —which interact easily with container environments like Docker and version control systems such as GitHub—to help scale our data analysis, while retaining important properties like reproducibility.

Testing and establishing code modularity.

Code in data-intensive research is generally written as a means to an end, the end being a scientific result from which researchers can draw conclusions. This stands in stark contrast to the purpose of code developed by data engineers or computer scientists, which is generally written to optimize a mechanistic function for maximum efficiency. During the Refine Phase, we may find ourselves with both analysis-relevant and mechanistic code , especially in “big data” statistical analyses or complex dynamic simulations where optimized computation becomes a concern. Keeping the immediate audience of this workflow phase, our research team, at the forefront of our mind can help us take steps to structure both mechanistic and analysis code in a useful way.

Mechanistic code, which is designed for repeated use, often employs abstractions by wrapping code into functions that apply the same action repeatedly or stringing together multiple scripts into a computational pipeline. Unit tests and so-called accessor functions or getter and setter functions that extract parameter values from data structures or set new values are examples of mechanistic code that might be included in a data-intensive research analysis. Meanwhile, code that is designed to gain statistical insight into distributions or model scientific dynamics using mathematical equations are 2 examples of analysis code. Sometimes, the line between mechanistic code and analysis code can be a blurry one. For example, we might write a looping function to sample our data set repeatedly, and that would classify as mechanistic code. But that sampling may be designed to occur according to an algorithm such as Markov Chain Monte Carlo that is directly tied to our desire to sample from a specific probability distribution; therefore, this could be labeled analysis and mechanistic code. Keep your audience in mind and the reproducibility of your experiment when considering how to present your code.

It is common practice to wrap code that we use repeatedly into functions to increase readability and modularity while reducing the propensity for user-induced error. However, the scripts and programming notebooks so useful to establishing a narrative and documenting work in the Refine Phase are set up to be read in a linear fashion. Embedding mechanistic functions in the midst of the research narrative obscures the utility of the notebooks in telling the research story and generally clutters up the analysis with a lot of extra code. For example, if we develop a function to eliminate the redundancy of repeatedly restructuring our data to produce a particular type of plot, we do not need to showcase that function in the middle of a computational notebook analyzing the implications of the plot that is created—the point is the research implications of the image, not the code that made the plot. Then where do we keep the data-reshaping, plot-generating code?

Strategies to structure the more mechanistic aspects of our analysis can be drawn from common software development practices. As our team grows or changes, we may require the same mechanistic code. For example, the same data-reshaping, plot-generating function described earlier might be pulled into multiple computational experiments that are set up in different locations, computational notebooks, scripts, or Git branches. Therefore, a useful approach would be to start collecting those mechanistic functions into their own script or file, sometimes called “helpers” or “utils,” that acts as a supplement to the various ongoing experiments, wherever they may be conducted. This separate script or file can be referenced or “called” at the beginning of the individual data analyses. Doing so allows team members to benefit from collaborative improvements to the mechanistic code without having to reinvent the wheel themselves. It also preserves the narrative properties of team members’ analysis-centric computational notebooks or scripts while maintaining transparency in basic methodologies that ensure project-wide reproducibility. The need to begin collecting mechanistic functions into files separate from analysis code is a good indicator that it may be time for the research team to supplement computational notebooks by using a code editor or IDE for further code development.

Testing scientific software is not always perfectly analogous to testing typical software development projects, where automated continuous integration is often employed [ 17 ]. However, as we start to modularize our code, breaking it into functions and from there into separate scripts or files that serve specific purposes, principles from software engineering become more readily applicable to our data-intensive analysis. Unit tests can now help us ensure that our mechanistic functions are working as expected, formalizing the “gut checks” that we performed in the Explore Phase. Among other applications, these tests should verify that our functions return the appropriate value, object type, or error message as needed [ 33 ]. Formal tests can also provide a more extensive investigation of how “trustworthy” the performance of a particular analysis method might be, affording us an opportunity to check the correctness of our scientific inferences. For example, we could use control data sets where we know the result of a particular analysis to make sure our analysis code is functioning as we expect. Alternatively, we could also use a regression test to compare computational outputs before and after changes in the code to make sure we haven’t introduced any unanticipated behavior.

When to transition from the Refine Phase: Going backwards and forwards

Workflows in data science are rarely linear; it is often necessary for researchers to iterate between the Refine and Explore Phases ( Fig 1B ). For example, while our research team may decide on a computational experimental design to pursue in the Refine Phase, the scope of that design may require us to revisit decisions made during the data processing that was conducted in the Explore Phase. This might mean including additional information from supplementary data sets to help refine our hypothesis or research question. In returning to the Explore Phase, we investigate these potential new data sets and decide if it makes sense to merge them with our original data set.

Iteration between the Refine and Explore Phases is a careful balance. On the one hand, we should be careful not to allow “scope creep” to expand our problem space beyond an area where we are able to develop constructive research contributions. On the other hand, if we are too rigid about decisions made over the course of our workflow and refuse to look backwards as well as forwards, we may risk cutting ourselves off from an important part of the potential solution space.

Data-intensive researchers can once more look to principles within the software development community, such as Agile frameworks, to help guide the careful balancing act required to conduct research that is both comprehensive and able to be completed [ 34 , 35 ]. How a team organizes and further documents their organization process can serve as research products themselves, which we describe further in the next phase of the workflow: the Produce Phase.

Phase 3: Produce

In the previous sections of this paper, we discussed how to progress from the exploration of raw data through the refinement of a research question and selection of an analytical methodology. We also described how the details of that workflow are guided by the breadth of the immediately relevant audience: ourselves in the Explore Phase and our research team in the Refine Phase. In the Produce Phase, it becomes time to make our data analysis camera ready for a much broader group, bringing our research results into a state that can be understood and built upon by others. This may translate to developing a variety of research products in addition to—or instead of—traditional academic outputs like peer-reviewed publications and typical software development products such as computational tools.

Beyond data analysis: Goals and standards of the Produce Phase

The main goal of the Produce Phase is to prepare our analysis to enter the public realm as a set of products ready for external use, reflection, and improvement. The Produce Phase encompasses the cleanup that happens prior to initially sharing our results to a broader community beyond our team, for example, ahead of submitting our work to peer review. It also includes the process of incorporating suggestions for improvement prior to finalization, for example, adjustments to address reviewer comments ahead of publication. The research products that emerge from a given workflow may vary in both their form and their formality—indeed, some research products, like a code base, might continually evolve without ever assuming “final” status—but each product constitutes valuable contributions that push our field’s scientific boundaries in their own way.

Importantly, producing public-facing products over the course of an entire workflow ( Fig 2 ) rather than just at the end of a project can help researchers progressively build their data science research portfolios and fulfill a second goal of the Produce Phase: gaining credit, and credibility, in our domain area. This is especially relevant for junior scientists who are just starting research careers or who wish to become industry data scientists [ 3 ]. Developing polished products at several intervals along a single workflow is also instructional for the researcher themselves. Researchers who prepare their work for public assessment from the earliest phases of an analysis become acquainted with the pertinent problem and solution spaces from multiple perspectives. This additional understanding, together with the feedback that polished products generate from people outside ourselves and our immediate team, may furnish insights that improve our approach in other phases of the research workflow.

Research products can build off of content generated in either the Explore or the Refine Phase. As they did in Fig 1A , turquoise circles represent potential research products generated as the project develops Closed circles represents research project within scope of project, while open circles represent beyond scope of current project. This figure emphasizes how those research products project onto a timeline and represent elements in our portfolio of work or lines on a CV. The ERP workflow emphasizes and encourages production, beyond traditional, academic research products, throughout the lifecycle of a data-intensive project rather than just at the very end.

https://doi.org/10.1371/journal.pcbi.1008770.g002

Building our data science research portfolio requires a method for tracking and attributing the many products that we might develop. One important method for tracking and attribution is the digital object identifier or DOI. It is a unique handle, standardized by the International Organization for Standardization (ISO), that can be assigned to different types of information objects. DOIs are usually connected to metadata, for example, they might include a URL pointing to where the object they are associated with can be found online. Academic researchers are used to thinking of DOIs as persistent identifiers for peer-reviewed publications. However, DOIs can also be generated for data sets, GitHub repositories, computational notebooks, teaching materials, management plans, reports, white papers , and preprints. Researchers would also be well advised to register for a unique and persistent digital identifier to be associated with their name, called an ORCID iD ( https://orcid.org ), as an additional method of tracking and attributing their personal outputs over the course of their career.

A third, longer-term goal of the Produce Phase involves establishing a researcher’s professional trajectory. Every individual needs to gauge how their compendium of research products contribute to their career and how intentional portfolio building might, in turn, drive the research that they ultimately conduct. For example, researchers who wish to work in academia might feel obliged to obtain “academic value” from less traditional research products by essentially reprising them as peer-reviewed papers. But judging a researcher’s productivity by the metric of paper authorship can alter how and even whether research is performed [ 36 ]. Increasingly, academic journals are revisiting their publishing requirements [ 37 ] and raising their standards of reproducibility. This shift is bringing the data and programming methodologies that underpin our written analyses closer to center stage. Data-intensive research, and the people who produce it, stand to benefit. Scientists—now encouraged, and even required by some academic journals to share both data and code—can publish and receive credit as well as feedback for the multiple research products that support their publications. Questions to ask ourselves as we consider possible research products can be found in Box 2 .

Produce: Products of the Explore Phase

The old adage that one person’s trash is another’s treasure is relevant to the Explore Phase of a data science analysis: Of the many potential applications for a particular data set, there is often only time to explore a small subset. Those applications which fall outside the scope of the current analysis can nonetheless be valuable to our future selves or to others seeking to conduct their own analyses. To that end, the documentation that accompanies data exploration can furnish valuable guidance for later projects. Further, the cleaned and processed data set that emerges from the Explore Phase is itself a valuable outcome that can be assigned a DOI and rendered a formal product of this portion of the data analysis workflow, using outlets like Dryad ( http://www.datadryad.org ) and Figshare ( https://figshare.com/ ) among others.

Publicly sharing the data set, along with its metadata, is an essential component of scientific transparency and reproducibility, and it is of fundamental importance to the scientific community. Data associated with a research outcome should follow “FAIR” principles of findability, accessibility, interoperability, and reusability. Importantly, discipline-specific data standards should be followed when preparing data, whether the data are being refined for public-facing or personal use. Data-intensive researchers should familiarize themselves with the standards relevant to their field of study and recognize that meeting these standards increases the likelihood of their work being both reusable and reproducible. In addition to enabling future scientists to use the data set as it was developed, adhering to a standard also facilitates the creation of synthetic data sets for later research projects. Examples of discipline-specific data standards in the natural sciences are Darwin Core ( https://dwc.tdwg.org ) for biodiversity data and EML ( https://eml.ecoinformatics.org ) for ecological data. To maximize the utility of a publically accessible data set, during the Produce Phase, researchers should confirm that it includes descriptive README files and field descriptions and also ensure that all abbreviations and coded entries are defined. In addition, an appropriate license should be assigned to the data set prior to publication: The license indicates whether, or under what circumstances, the data require attribution.

The Git repositories or computational notebooks that archive a data scientist’s approach, record the process of uncovering coding bugs, redundancies, or inconsistencies and note the rationale for focusing on specific aspects of the data are also useful research products in their own right. These items, which emerge from software development practices, can provide a touchstone for alternative explorations of the same data set at a later time. In addition to documenting valuable lessons learned, contributions of this kind can formally augment a data-intensive researcher’s registered body of work: Code used to actively clean data or record an Explore Phase process can be made citable by employing services like Zenodo to add a DOI to the applicable Git commit. Smaller code snippets or data excerpts can be shared—publicly or privately—using the more lightweight GitHub Gists ( https://gist.github.com/ ). Tools such as Dr.Watson ( https://github.com/JuliaDynamics/DrWatson.jl ) and Snakemake [ 23 ] are designed to assist researchers with organization and reproducibility and can inform the polishing process for products emerging from any phase of the analysis (see [ 22 ] for more discussion of reproducible workflow design and tools). As with data products, in the Produce Phase, researchers should license their code repositories such that other scientists know how they can use, augment, or redistribute the contents. The Produce Phase is also the time for researchers to include descriptive README files and clear guidelines for future code contributors in their repository.

Alternative mechanisms for crediting the time and talent that researchers invest in the Explore Phase include relatively informal products. For example, blog posts can detail problem space exploration for a specific research question or lessons learned about data analysis training and techniques. White papers that describe the raw data set and the steps taken to clean it, together with an explanation of why and how these decisions were taken, might constitute another such informal product. Versions of these blog posts or white papers can be uploaded to open-access websites such as arXiv.org as preprints and receive a DOI.

The familiar academic route of a peer-reviewed publication is also available for products emerging from the Explore Phase. For example, depending on the domain area of interest, journals such as Nature Scientific Data and IEEE Transactions are especially suited to papers that document the methods of data set development or simply reproduce the data set itself. Pedagogical contributions that were learned or applied over the course of a research workflow can be written up for submission to training-focused journals such as the Journal of Statistics Education . For a list of potential research product examples for the Explore Phase, see Box 3 .

Box 3. Products

Research products can be developed throughout the ERP workflow. This box helps identify some options for each phase, including products less traditional to academia. Those that can be labeled with a digital object identifier (DOI) are marked as such.

Potential Products in the Explore Phase

Publication of cleaned and processed data set (DOI)
Citable GitHub repository and/or computational notebook that shows data cleaning/processing, exploratory data analysis. (e.g., Jupyter Notebook, Knitr, Literate, Pluto, etc.) (DOI)
GitHub Gists (e.g., particular piece of processing code)
White paper (e.g., explaining a data set)
Blog post (e.g., detailing exploratory process)
Teaching/training materials (e.g., data wrangling)
Preprint (e.g., about a data set or its creation) (DOI)
Peer-reviewed publication (e.g., about a curated data set) (DOI)

Potential Products in the Refine Phase

White paper (e.g., explaining preliminary findings)
Citable GitHub repository and/or computational showing methodology and results (DOI)
Blog post (e.g., explaining findings informally)
Teaching/training materials (e.g., using your work as an example to teach a computational method)
Preprint (e.g., preliminary paper before being submitted to a journal) (DOI)
Peer-reviewed publication (e.g., formal description of your findings) (DOI)
Grant application incorporating the data management procedure
Methodology (e.g., writing a methods paper) (DOI)
This might include a package, a library, or an interactive web application.
See Box 4 for further discussion of this potential research product.

Produce: Products of the Refine Phase

In the Refine Phase, documentation and the ability to communicate both methods and results become essential to daily management of the project. Happily, the implementation of these basic practices can also provide benefits beyond the immediate team of research collaborators: They can be standardized as a Data Management Plan or Protocol (DMP). DMPs are a valuable product that can emerge from the Refine Phase as a formal version of lessons learned concerning both research and team management. This product records the strategies and approaches used to, for example, describe, share, store, analyze, and preserve data.

While DMPs are often living documents over the course of a research project, evolving dynamically with the needs or restrictions that are encountered along the way, there is great utility to codifying them either for our team’s later use or for others conducting similar projects. DMPs can also potentially be leveraged into new research grants for our team, as these protocols are now a common mandate by many funders [ 38 ]. The group discussions that contribute to developing a DMP can be difficult and encompass considerations relevant to everything from team building to research design. The outcome of these discussions is often directly tied to the constructiveness of a research team and its robustness to potential turnover [ 38 ]. Sharing these standards and lessons learned in the form of polished research products can propel a proactive discussion of data management and sharing practices within our research domain. This, in turn, bolsters the creation or enhancement of community standards beyond our team and provides training materials for those new to the field.

As with the research products that are generated by the Explore Phase, DMPs can lead to polished blog posts, training materials, white papers, and preprints that enable researchers to both spread the word about their valuable findings and be credited for their work. In addition, peer-reviewed journals are beginning to allow the publication of DMPs as a formal outcome of the data analysis workflow (e.g., Rio Journal ). Importantly, when new members join a research team, they should receive a copy of the group’s DMP. If any additional training pertinent to plans or protocols is furnished to help get new members up to speed, these materials too can be polished into research products that contribute to scientific advancement. For a list of potential research product examples for the Refine Phase, see Box 3 .

Produce: Traditional research products and scientific software

By polishing our work, we finalize and format it to receive critiques beyond ourselves and our immediate team. The scientific analysis and results that are born of the full research workflow—once documented and linked appropriately to the code and data used to conduct it—are most frequently packaged into the traditional academic research product: a peer-reviewed publication. Even this product, however, can be improved upon in terms of its reproducibility and transparency thanks to software development tools and practices. For example, papers that employ literate programming notebooks enable researchers to augment the real-time evolution of a written draft with the code that informs it. A well-kept notebook can be used to outline the motivations for a manuscript and select the figures best suited to conveying the intended narrative, because it shows the evolution of ideas and the mathematics behind each analysis along with—ideally—brief textual explanations.

Peer-reviewed papers are of primary importance to the career and reputation of academic researchers [ 39 ], but the traditional format for such publications often does not take into account essential aspects of data-intensive analysis such as computational reproducibility [ 40 ]. Where strict requirements for reproducibility are not enforced by a given journal, researchers should nonetheless compile the supporting products that made our submitted manuscript possible—including relevant code and data, as well as the documentation of our computational tools and methodologies as described in the earlier sections of this paper—into a research compendium [ 37 , 41 – 43 ]. The objective is to provide transparency to those who read or wish to replicate our academic publication and reproduce the workflow that led to our results.

In addition to peer-reviewed publications and the various alternative research products described above, some scientists may choose to revisit the scripts developed during the Explore or RefinePhases and polish that code into a traditional software development product: a computational tool, also called a software tool . A computational tool can include libraries, packages, collections of functions, or data structures designed to help with a specific class of problem. Such products might be accompanied by repository documentation or a full-fledged methodological paper that can be categorized as additional research products beyond the tool itself. Each of these items can augment a researcher’s body of citable work and contribute to advances in our domain science.

One very simple example of a tool might be an interactive web application built in RShiny ( https://shiny.rstudio.com/ ) that allows the easy exploration of cleaned data sets or demonstrates the outcomes of alternative research questions. More complex examples include a software package that builds an open-source analysis pipeline or a data structure that formally standardizes the problem space of a domain-specific research area. In all cases, the README files, docstrings, example vignettes, and appropriate licensing relevant to the Explore phase are also a necessity for open-source software. Developers should also specify contributing guidelines for future researchers who might seek to improve or extend the capabilities of the original tool. Where applicable, the dynamic equations that inform simulations should be cited with the original scientific literature where they were derived.

The effort to translate reproducible scripts into reusable software and then to maintain the software and support users is often a massive undertaking. While the software engineering literature furnishes a rich suite of resources for researchers seeking to develop their own computational tools, this existing body of work is generally directed toward trained programmers and software engineers. The design decisions that are crucial to scientists—who are primarily interested in data analysis, experiment extensibility , and result reporting and inference—can be obscured by concepts that are either out of scope or described in overtly technical jargon. Box 4 furnishes a basic guide to highlight the decision points and architectural choices relevant to creating a tool for data-intensive research. Domain scientists seeking to wade into computational tool development are well advised to review the guidelines described in Gruning and colleagues [ 2 ] in addition to more traditional software development resources and texts such as Clean Code [ 44 ], Refactoring [ 45 ], and Best Practices in Scientific Computing [ 24 ].

Box 4. Tool development guide

Creating a new software tool as the polished product of a research workflow is nontrivial. This box furnishes a series of guiding questions to help researchers think through whether tool creation is appropriate to project goals, domain science needs, and team member skill sets.

Does a tool in this space already exist that can be used to provide the functionality/answer the research question of interest?
Does it formalize our research question?
Does it extend/allow extension of investigative capabilities beyond the research question that our existing script was developed to ask?
Does creating a tool advance our personal career goals or augment a desired/necessary skill set?
Funding (if applicable)?
Domain expertise?
Programming expertise?
Collaborative research partners with either time, funding, or relevant expertise?
Will the process of creating the new tool be valued/helpful for your career goals?
Should we build on an existing tool or make a new one?
What research area is it designed for?
Who is the envisioned end user? (e.g., scientist inside our domain, scientist outside our domain, policy maker, member of the public)
What is the goal of the end user? (e.g., analysis of raw inputs, explanation of results, creation of inputs for the next step of a larger analysis)
What are field norms?
Is it accessible (free, open source)?
What is the likely form and type of data input to our tool?
What is the desired form and type of data output from our tool?
Are there preexisting structures that are useful to emulate, or should we develop our own?
Is there an existing package that provides basic structure or building block functionalities necessary or useful for our tool, such that we do not need to reinvent the wheel?

Conclusions

Defining principles for data analysis workflows is important for scientific accuracy, efficiency, and the effective communication of results, regardless of whether researchers are working alone or in a team. Establishing standards, such as for documentation and unit testing, both improves the quality of work produced by practicing data scientists and sets a proactive example for fledgling researchers to do the same. There is no single set of principles for performing data-intensive research. Each computational project carries its own context—from the scientific domain in which it is conducted, to the software and methodological analysis tools we use to pursue our research questions, to the dynamics of our particular research team. Therefore, this paper has outlined general concepts for designing a data analysis such that researchers may incorporate the aspects of the ERP workflow that work best for them. It has also put forward suggestions for specific tools to facilitate that workflow and for a selection of nontraditional research products that could emerge throughout a given data analysis project.

Aiming for full reproducibility when communicating research results is a noble pursuit, but it is imperative to understand that there is a balance between generating a complete analysis and furnishing a 100% reproducible product. Researchers have competing motivations: finishing their work in a timely fashion versus having a perfectly documented final product, while balancing how these trade-offs might strengthen their career. Despite various calls for the creation of a standard framework [ 7 , 46 ], achieving complete reproducibility may go far beyond the individual researcher to encompass a culture-wide shift in expectations by consumers of scientific research products, to realistic capacities of version control software. The first of these advancements is particularly challenging and unlikely to manifest quickly across data-intensive research areas, although it is underway in a number of scientific domains [ 26 ]. By reframing what a formal research product can be—and noting that polished contributions can constitute much more than the academic publications previously held forth as the benchmark for career advancement—we motivate structural change to data analysis workflows.

In addition to amassing outputs beyond the peer-reviewed academic publication, there are increasingly venues for writing less traditional papers that describe or consist solely of a novel data set, a software tool, a particular methodology, or training materials. As the professional landscape for data-intensive research evolves, these novel publications and research products are extremely valuable for distinguishing applicants to academic and nonacademic jobs, grants, and teaching positions. Data scientists and researchers should possess numerous and multifaceted skills to perform scientifically robust and computationally effective data analysis. Therefore, potential research collaborators or hiring entities both inside and outside the academy should take into account a variety of research products, from every phase of the data analysis workflow, when evaluating the career performance of data-intensive researchers [ 47 ].

Acknowledgments

We thank the Best Practices Working Group (UC Berkeley) for the thoughtful conversations and feedback that greatly informed the content of this paper. We thank the Berkeley Institute for Data Science for hosting meetings that brought together data scientists, biologists, statisticians, computer scientists, and software engineers to discuss how data-intensive research is performed and evaluated. We especially thank Stuart Gieger (UC Berkeley) for his leadership of the Best Practices in Data Science Group and Rebecca Barter (UC Berkeley) for her helpful feedback.

View Article
PubMed/NCBI
Google Scholar
3. Robinson E, Nolis J. Build a Career in Data Science. Simon and Schuster; 2020.
6. Terence S. An Extensive Step by Step Guide to Exploratory Data Analysis. 2020 [cited 2020 Jun 15]. https://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e .
13. Bostock MA. Better Way to Code—Mike Bostock—Medium. 2017 [cited 2020 Jun 15]. https://medium.com/@mbostock/a-better-way-to-code-2b1d2876a3a0 .
14. van der Plas F. Pluto.jl. Github. https://github.com/fonsp/Pluto.jl .
15. Best Practices for Writing R Code–Programming with R. [cited 15 Jun 2020]. https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/
16. PyCon 2019. Jes Ford—Getting Started Testing in Data Science—PyCon 2019. Youtube; 5 May 2019 [cited 2020 Feb 20]. https://www.youtube.com/watch?v=0ysyWk-ox-8
17. Hook D, Kelly D. Testing for trustworthiness in scientific software. 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. 2009. pp. 59–64.
18. Oh J-H. Check Yo’ Data Before You Wreck Yo’ Results. In: Medium [Internet]. ACLU Tech & Analytics; 24 Jan 2020 [cited 2020 Apr 9]. https://medium.com/aclu-tech-analytics/check-yo-data-before-you-wreck-yo-results-53f0e919d0b9 .
19. Gelfand S. comparing two data frames: one #rstats, many ways! | Sharla Gelfand. In: Sharla Gelfand [Internet]. Sharla Gelfand; 17 Feb 2020 [cited 2020 Apr 20]. https://sharla.party/post/comparing-two-dfs/ .
20. Gelfand S. Don’t repeat yourself, talk to yourself! Repeated reporting in the R universe | Sharla Gelfand. In: Sharla Gelfand [Internet]. 30 Jan 2020 [cited 2020 Apr 20]. https://sharla.party/talk/2020-01-01-rstudio-conf/ .
27. Geiger RS, Sholler D, Culich A, Martinez C, Hoces de la Guardia F, Lanusse F, et al. Challenges of Doing Data-Intensive Research in Teams, Labs, and Groups: Report from the BIDS Best Practices in Data Science Series. 2018.
29. Xie Y. Dynamic Documents with R and knitr. Chapman and Hall/CRC; 2017.
33. Wickham H. R Packages: Organize, Test, Document, and Share Your Code. “O’Reilly Media, Inc.”; 2015.
34. Abrahamsson P, Salo O, Ronkainen J, Warsta J. Agile Software Development Methods: Review and Analysis. arXiv [cs.SE]. 2017. http://arxiv.org/abs/1709.08439 .
35. Beck K, Beedle M, Van Bennekum A, Cockburn A, Cunningham W, Fowler M, et al. Manifesto for agile software development. 2001. https://moodle2019-20.ua.es/moodle/pluginfile.php/2213/mod_resource/content/2/agile-manifesto.pdf .
38. Sholler D, Das D, Hoces de la Guardia F, Hoffman C, Lanusse F, Varoquaux N, et al. Best Practices for Managing Turnover in Data Science Groups, Teams, and Labs. 2019.
44. Martin RC. Clean Code: A Handbook of Agile Software Craftsmanship. Pearson Education; 2009.
45. Fowler M. Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional; 2018.
47. Geiger RS, Cabasse C, Cullens CY, Norén L, Fiore-Gartland B, Das D, et al. Career Paths and Prospects in Academic Data Science: Report of the Moore-Sloan Data Science Environments Survey. 2018.
49. Jorgensen PC, editor. About the International Software Testing Qualification Board. 1st ed. The Craft of Model-Based Testing. 1st ed. Boca Raton: Taylor & Francis, a CRC title, part of the Taylor & Francis imprint, a member of the Taylor & Francis Group, the academic division of T&F Informa, plc, [2017]: Auerbach Publications; 2017. pp. 231–240.
51. Wikipedia contributors. Functional design. In: Wikipedia, The Free Encyclopedia [Internet]. 4 Feb 2020 [cited 21 Feb 2020]. https://en.wikipedia.org/w/index.php?title=Functional_design&oldid=939128138
52. 7 Essential Guidelines For Functional Design—Smashing Magazine. In: Smashing Magazine [Internet]. 5 Aug 2008 [cited 21 Feb 2020]. https://www.smashingmagazine.com/2008/08/7-essential-guidelines-for-functional-design/
53. Claerbout JF, Karrenbach M. Electronic documents give reproducible research a new meaning. SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists; 1992. pp. 601–604.
54. Heroux MA, Barba L, Parashar M, Stodden V, Taufer M. Toward a Compatible Reproducibility Taxonomy for Computational and Computing Sciences. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States); 2018. https://www.osti.gov/biblio/1481626 .

Website navigation

In this section

Imperial Home
Educational Development Unit
Teaching toolkit
Educational research methods
Analysing and writing up your research

Types of data analysis

The means by which you analyse your data are largely determined by the nature of your research question , the approach and paradigm within which your research operates, the methods used, and consequently the type of data elicited. In turn, the language and terms you use in both conducting and reporting your data analysis should reflect these.

The list below includes some of the more commonly used means of qualitative data analysis in educational research – although this is by no means exhaustive. It is also important to point out that each of the terms given below generally encompass a range of possible methods or options and there can be overlap between them. In all cases, further reading is essential to ensure that the process of data analysis is valid, transparent and appropriately systematic, and we have provided below (as well as in our further resources and tools and resources for qualitative data analysis sections) some recommendations for this.

If your research is likely to involve quantitative analysis, we recommend the books listed below.

Types of qualitative data analysis

Thematic analysis
Coding and/or content analysis
Concept map analysis
Discourse or narrative analysis
Grouded theory
Phenomenological analysis or interpretative phenomenological analysis (IPA)

For qualitative approaches

Savin-Baden, M. & Howell Major, C. (2013) Data analysis. In Qualitative Research: The essential guide to theory and practice . (Abingdon, Routledge, pp. 434-450).

For quantitative approaches

Bors, D. (2018) Data analysis for the social sciences (Sage, London).

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

Knowledge Base

Methodology

Research Methods | Definitions, Types, Examples

Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design . When planning your methods, there are two key decisions you will make.

First, decide how you will collect data . Your methods depend on what type of data you need to answer your research question :

Qualitative vs. quantitative : Will your data take the form of words or numbers?
Primary vs. secondary : Will you collect original data yourself, or will you use data that has already been collected by someone else?
Descriptive vs. experimental : Will you take measurements of something as it is, or will you perform an experiment?

Second, decide how you will analyze the data .

For quantitative data, you can use statistical analysis methods to test relationships between variables.
For qualitative data, you can use methods such as thematic analysis to interpret patterns and meanings in the data.

Methods for collecting data, examples of data collection methods, methods for analyzing data, examples of data analysis methods, other interesting articles, frequently asked questions about research methods.

Data is the information that you collect for the purposes of answering your research question . The type of data you need depends on the aims of your research.

Qualitative vs. quantitative data

Your choice of qualitative or quantitative data collection depends on the type of knowledge you want to develop.

For questions about ideas, experiences and meanings, or to study something that can’t be described numerically, collect qualitative data .

If you want to develop a more mechanistic understanding of a topic, or your research involves hypothesis testing , collect quantitative data .


Qualitative		to broader populations. .
Quantitative		.

You can also take a mixed methods approach , where you use both qualitative and quantitative research methods.

Primary vs. secondary research

Primary research is any original data that you collect yourself for the purposes of answering your research question (e.g. through surveys , observations and experiments ). Secondary research is data that has already been collected by other researchers (e.g. in a government census or previous scientific studies).

If you are exploring a novel research question, you’ll probably need to collect primary data . But if you want to synthesize existing knowledge, analyze historical trends, or identify patterns on a large scale, secondary data might be a better choice.


Primary	.	methods.
Secondary

Descriptive vs. experimental data

In descriptive research , you collect data about your study subject without intervening. The validity of your research will depend on your sampling method .

In experimental research , you systematically intervene in a process and measure the outcome. The validity of your research will depend on your experimental design .

To conduct an experiment, you need to be able to vary your independent variable , precisely measure your dependent variable, and control for confounding variables . If it’s practically and ethically possible, this method is the best choice for answering questions about cause and effect.


Descriptive		. .
Experimental

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

Academic style
Vague sentences
Style consistency

See an example

Research methods for collecting data
Research method	Primary or secondary?	Qualitative or quantitative?	When to use
	Primary	Quantitative	To test cause-and-effect relationships.
	Primary	Quantitative	To understand general characteristics of a population.
Interview/focus group	Primary	Qualitative	To gain more in-depth understanding of a topic.
Observation	Primary	Either	To understand how something occurs in its natural setting.
	Secondary	Either	To situate your research in an existing body of work, or to evaluate trends within a research topic.
	Either	Either	To gain an in-depth understanding of a specific group or context, or when you don’t have the resources for a large study.

Your data analysis methods will depend on the type of data you collect and how you prepare it for analysis.

Data can often be analyzed both quantitatively and qualitatively. For example, survey responses could be analyzed qualitatively by studying the meanings of responses or quantitatively by studying the frequencies of responses.

Qualitative analysis methods

Qualitative analysis is used to understand words, ideas, and experiences. You can use it to interpret data that was collected:

From open-ended surveys and interviews , literature reviews , case studies , ethnographies , and other sources that use text rather than numbers.
Using non-probability sampling methods .

Qualitative analysis tends to be quite flexible and relies on the researcher’s judgement, so you have to reflect carefully on your choices and assumptions and be careful to avoid research bias .

Quantitative analysis methods

Quantitative analysis uses numbers and statistics to understand frequencies, averages and correlations (in descriptive studies) or cause-and-effect relationships (in experiments).

You can use quantitative analysis to interpret data that was collected either:

During an experiment .
Using probability sampling methods .

Because the data is collected and analyzed in a statistically valid way, the results of quantitative analysis can be easily standardized and shared among researchers.

Research methods for analyzing data
Research method	Qualitative or quantitative?	When to use
	Quantitative	To analyze data collected in a statistically valid manner (e.g. from experiments, surveys, and observations).
Meta-analysis	Quantitative	To statistically analyze the results of a large collection of studies. Can only be applied to studies that collected data in a statistically valid manner.
	Qualitative	To analyze data collected from interviews, , or textual sources. To understand general themes in the data and how they are communicated.
	Either	To analyze large volumes of textual or visual data collected from surveys, literature reviews, or other sources. Can be quantitative (i.e. frequencies of words) or qualitative (i.e. meanings of words).

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

Chi square test of independence
Statistical power
Descriptive statistics
Degrees of freedom
Pearson correlation
Null hypothesis
Double-blind study
Case-control study
Research ethics
Data collection
Hypothesis testing
Structured interviews

Research bias

Hawthorne effect
Unconscious bias
Recall bias
Halo effect
Self-serving bias
Information bias

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

A sample is a subset of individuals from a larger population . Sampling means selecting the group that you will actually collect data from in your research. For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students.

In statistics, sampling allows you to test a hypothesis about the characteristics of a population.

The research methods you use depend on the type of data you need to answer your research question .

If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts and meanings, use qualitative methods .
If you want to analyze a large amount of readily-available data, use secondary data. If you want data specific to your purposes with control over how it is generated, collect primary data.
If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Methodology refers to the overarching strategy and rationale of your research project . It involves studying the methods used in your field and the theories or principles behind them, in order to develop an approach that matches your objectives.

Methods are the specific tools and procedures you use to collect and analyze data (for example, experiments, surveys , and statistical tests ).

In shorter scientific papers, where the aim is to report the findings of a specific study, you might simply describe what you did in a methods section .

In a longer or more complex research project, such as a thesis or dissertation , you will probably include a methodology section , where you explain your approach to answering the research questions and cite relevant sources to support your choice of methods.

Is this article helpful?

Other students also liked, writing strong research questions | criteria & examples.

What Is a Research Design | Types, Guide & Examples
Data Collection | Definition, Methods & Examples

What is your plagiarism score?

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

Advanced Search
Journal List
Springer Nature - PMC COVID-19 Collection

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.

Introduction

We are living in the age of “data science and advanced analytics”, where almost everything in our daily lives is digitally recorded as data [ 17 ]. Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [ 112 ]. The data can be structured, semi-structured, or unstructured, which increases day by day [ 105 ]. Data science is typically a “concept to unify statistics, data analysis, and their related methods” to understand and analyze the actual phenomena with data. According to Cao et al. [ 17 ] “data science is the science of data” or “data science is the study of data”, where a data product is a data deliverable, or data-enabled or guided, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, or system. The popularity of “Data science” is increasing day-by-day, which is shown in Fig. Fig.1 1 according to Google Trends data over the last 5 years [ 36 ]. In addition to data science, we have also shown the popularity trends of the relevant areas such as “Data analytics”, “Data mining”, “Big data”, “Machine learning” in the figure. According to Fig. Fig.1, 1 , the popularity indication values for these data-driven domains, particularly “Data science”, and “Machine learning” are increasing day-by-day. This statistical information and the applicability of the data-driven smart decision-making in various real-world application areas, motivate us to study briefly on “Data science” and machine-learning-based “Advanced analytics” in this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig1_HTML.jpg

The worldwide popularity score of data science comparing with relevant areas in a range of 0 (min) to 100 (max) over time where x -axis represents the timestamp information and y -axis represents the corresponding score

Usually, data science is the field of applying advanced analytics methods and scientific concepts to derive useful business information from data. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to analyze granular data, which we are interested in. In the field of data science, several types of analytics are popular, such as "Descriptive analytics" which answers the question of what happened; "Diagnostic analytics" which answers the question of why did it happen; "Predictive analytics" which predicts what will happen in the future; and "Prescriptive analytics" which prescribes what action should be taken, discussed briefly in “ Advanced analytics methods and smart computing ”. Such advanced analytics and decision-making based on machine learning techniques [ 105 ], a major part of artificial intelligence (AI) [ 102 ] can also play a significant role in the Fourth Industrial Revolution (Industry 4.0) due to its learning capability for smart computing as well as automation [ 121 ].

Although the area of “data science” is huge, we mainly focus on deriving useful insights through advanced analytics, where the results are used to make smart decisions in various real-world application areas. For this, various advanced analytics methods such as machine learning modeling, natural language processing, sentiment analysis, neural network, or deep learning analysis can provide deeper knowledge about data, and thus can be used to develop data-driven intelligent applications. More specifically, regression analysis, classification, clustering analysis, association rules, time-series analysis, sentiment analysis, behavioral patterns, anomaly detection, factor analysis, log analysis, and deep learning which is originated from the artificial neural network, are taken into account in our study. These machine learning-based advanced analytics methods are discussed briefly in “ Advanced analytics methods and smart computing ”. Thus, it’s important to understand the principles of various advanced analytics methods mentioned above and their applicability to apply in various real-world application areas. For instance, in our earlier paper Sarker et al. [ 114 ], we have discussed how data science and machine learning modeling can play a significant role in the domain of cybersecurity for making smart decisions and to provide data-driven intelligent security services. In this paper, we broadly take into account the data science application areas and real-world problems in ten potential domains including the area of business data science, health data science, IoT data science, behavioral data science, urban data science, and so on, discussed briefly in “ Real-world application domains ”.

Based on the importance of machine learning modeling to extract the useful insights from the data mentioned above and data-driven smart decision-making, in this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application. The key contribution of this study is thus understanding data science modeling, explaining different analytic methods for solution perspective and their applicability in various real-world data-driven applications areas mentioned earlier. Overall, the purpose of this paper is, therefore, to provide a basic guide or reference for those academia and industry people who want to study, research, and develop automated and intelligent applications or systems based on smart computing and decision making within the area of data science.

The main contributions of this paper are summarized as follows:

To define the scope of our study towards data-driven smart computing and decision-making in our real-world life. We also make a brief discussion on the concept of data science modeling from business problems to data product and automation, to understand its applicability and provide intelligent services in real-world scenarios.
To provide a comprehensive view on data science including advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application.
To discuss the applicability and significance of machine learning-based analytics methods in various real-world application areas. We also summarize ten potential real-world application areas, from business to personalized applications in our daily life, where advanced analytics with machine learning modeling can be used to achieve the expected outcome.
To highlight and summarize the challenges and potential research directions within the scope of our study.

The rest of the paper is organized as follows. The next section provides the background and related work and defines the scope of our study. The following section presents the concepts of data science modeling for building a data-driven application. After that, briefly discuss and explain different advanced analytics methods and smart computing. Various real-world application areas are discussed and summarized in the next section. We then highlight and summarize several research issues and potential future directions, and finally, the last section concludes this paper.

Background and Related Work

In this section, we first discuss various data terms and works related to data science and highlight the scope of our study.

Data Terms and Definitions

There is a range of key terms in the field, such as data analysis, data mining, data analytics, big data, data science, advanced analytics, machine learning, and deep learning, which are highly related and easily confusing. In the following, we define these terms and differentiate them with the term “Data Science” according to our goal.

The term “Data analysis” refers to the processing of data by conventional (e.g., classic statistical, empirical, or logical) theories, technologies, and tools for extracting useful information and for practical purposes [ 17 ]. The term “Data analytics”, on the other hand, refers to the theories, technologies, instruments, and processes that allow for an in-depth understanding and exploration of actionable data insight [ 17 ]. Statistical and mathematical analysis of the data is the major concern in this process. “Data mining” is another popular term over the last decade, which has a similar meaning with several other terms such as knowledge mining from data, knowledge extraction, knowledge discovery from data (KDD), data/pattern analysis, data archaeology, and data dredging. According to Han et al. [ 38 ], it should have been more appropriately named “knowledge mining from data”. Overall, data mining is defined as the process of discovering interesting patterns and knowledge from large amounts of data [ 38 ]. Data sources may include databases, data centers, the Internet or Web, other repositories of data, or data dynamically streamed through the system. “Big data” is another popular term nowadays, which may change the statistical and data analysis approaches as it has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous” [ 74 ]. Big data can be generated by mobile devices, social networks, the Internet of Things, multimedia, and many other new applications [ 129 ]. Several unique features including volume, velocity, variety, veracity, value (5Vs), and complexity are used to understand and describe big data [ 69 ].

In terms of analytics, basic analytics provides a summary of data whereas the term “Advanced Analytics” takes a step forward in offering a deeper understanding of data and helps to analyze granular data. Advanced analytics is characterized or defined as autonomous or semi-autonomous data or content analysis using advanced techniques and methods to discover deeper insights, predict or generate recommendations, typically beyond traditional business intelligence or analytics. “Machine learning”, a branch of artificial intelligence (AI), is one of the major techniques used in advanced analytics which can automate analytical model building [ 112 ]. This is focused on the premise that systems can learn from data, recognize trends, and make decisions, with minimal human involvement [ 38 , 115 ]. “Deep Learning” is a subfield of machine learning that discusses algorithms inspired by the human brain’s structure and the function called artificial neural networks [ 38 , 139 ].

Unlike the above data-related terms, “Data science” is an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies. In [ 17 ], Cao et al. defined data science from the disciplinary perspective as “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. In “ Understanding data science modeling ”, we briefly discuss the data science modeling from a practical perspective starting from business problems to data products that can assist the data scientists to think and work in a particular real-world problem domain within the area of data science and analytics.

Related Work

In the area, several papers have been reviewed by the researchers based on data science and its significance. For example, the authors in [ 19 ] identify the evolving field of data science and its importance in the broader knowledge environment and some issues that differentiate data science and informatics issues from conventional approaches in information sciences. Donoho et al. [ 27 ] present 50 years of data science including recent commentary on data science in mass media, and on how/whether data science varies from statistics. The authors formally conceptualize the theory-guided data science (TGDS) model in [ 53 ] and present a taxonomy of research themes in TGDS. Cao et al. include a detailed survey and tutorial on the fundamental aspects of data science in [ 17 ], which considers the transition from data analysis to data science, the principles of data science, as well as the discipline and competence of data education.

Besides, the authors include a data science analysis in [ 20 ], which aims to provide a realistic overview of the use of statistical features and related data science methods in bioimage informatics. The authors in [ 61 ] study the key streams of data science algorithm use at central banks and show how their popularity has risen over time. This research contributes to the creation of a research vector on the role of data science in central banking. In [ 62 ], the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in [ 87 ] provide a thorough understanding of computational optimal transport with application to data science. In [ 97 ], the authors present data science as theoretical contributions in information systems via text analytics.

Unlike the above recent studies, in this paper, we concentrate on the knowledge of data science including advanced analytics methods, machine learning modeling, real-world application domains, and potential research directions within the scope of our study. The advanced analytics methods based on machine learning techniques discussed in this paper can be applied to enhance the capabilities of an application in terms of data-driven intelligent decision making and automation in the final data product or systems.

Understanding Data Science Modeling

In this section, we briefly discuss how data science can play a significant role in the real-world business process. For this, we first categorize various types of data and then discuss the major steps of data science modeling starting from business problems to data product and automation.

Types of Real-World Data

Typically, to build a data-driven real-world system in a particular domain, the availability of data is the key [ 17 , 112 , 114 ]. The data can be in different types such as (i) Structured—that has a well-defined data structure and follows a standard order, examples are names, dates, addresses, credit card numbers, stock information, geolocation, etc.; (ii) Unstructured—has no pre-defined format or organization, examples are sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, etc.; (iii) Semi-structured—has elements of both the structured and unstructured data containing certain organizational properties, examples are HTML, XML, JSON documents, NoSQL databases, etc.; and (iv) Metadata—that represents data about the data, examples are author, file type, file size, creation date and time, last modification date and time, etc. [ 38 , 105 ].

In the area of data science, researchers use various widely-used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 127 ], UNSW-NB15 [ 79 ], Bot-IoT [ 59 ], ISCX’12 [ 15 ], CIC-DDoS2019 [ 22 ], etc., smartphone datasets such as phone call logs [ 88 , 110 ], mobile application usages logs [ 124 , 149 ], SMS Log [ 28 ], mobile phone notification logs [ 77 ] etc., IoT data [ 56 , 11 , 64 ], health data such as heart disease [ 99 ], diabetes mellitus [ 86 , 147 ], COVID-19 [ 41 , 78 ], etc., agriculture and e-commerce data [ 128 , 150 ], and many more in various application domains. In “ Real-world application domains ”, we discuss ten potential real-world application domains of data science and analytics by taking into account data-driven smart computing and decision making, which can help the data scientists and application developers to explore more in various real-world issues.

Overall, the data used in data-driven applications can be any of the types mentioned above, and they can differ from one application to another in the real world. Data science modeling, which is briefly discussed below, can be used to analyze such data in a specific problem domain and derive insights or useful information from the data to build a data-driven model or data product.

Steps of Data Science Modeling

Data science is typically an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies, mentioned earlier in “ Background and related work ”. In this section, we briefly discuss how data science can play a significant role in the real-world business process. Figure Figure2 2 shows an example of data science modeling starting from real-world data to data-driven product and automation. In the following, we briefly discuss each module of the data science process.

Understanding business problems: This involves getting a clear understanding of the problem that is needed to solve, how it impacts the relevant organization or individuals, the ultimate goals for addressing it, and the relevant project plan. Thus to understand and identify the business problems, the data scientists formulate relevant questions while working with the end-users and other stakeholders. For instance, how much/many, which category/group, is the behavior unrealistic/abnormal, which option should be taken, what action, etc. could be relevant questions depending on the nature of the problems. This helps to get a better idea of what business needs and what we should be extracted from data. Such business knowledge can enable organizations to enhance their decision-making process, is known as “Business Intelligence” [ 65 ]. Identifying the relevant data sources that can help to answer the formulated questions and what kinds of actions should be taken from the trends that the data shows, is another important task associated with this stage. Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem.
Understanding data: As we know that data science is largely driven by the availability of data [ 114 ]. Thus a sound understanding of the data is needed towards a data-driven model or system. The reason is that real-world data sets are often noisy, missing values, have inconsistencies, or other data issues, which are needed to handle effectively [ 101 ]. To gain actionable insights, the appropriate data or the quality of the data must be sourced and cleansed, which is fundamental to any data science engagement. For this, data assessment that evaluates what data is available and how it aligns to the business problem could be the first step in data understanding. Several aspects such as data type/format, the quantity of data whether it is sufficient or not to extract the useful knowledge, data relevance, authorized access to data, feature or attribute importance, combining multiple data sources, important metrics to report the data, etc. are needed to take into account to clearly understand the data for a particular business problem. Overall, the data understanding module involves figuring out what data would be best needed and the best ways to acquire it.
Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods [ 135 ]. This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct meaningful summaries of the data. Thus data exploration is typically used to figure out the gist of data and to develop a first step assessment of its quality, quantity, and characteristics. A statistical model can be used or not, but primarily it offers tools for creating hypotheses by generally visualizing and interpreting the data through graphical representation such as a chart, plot, histogram, etc [ 72 , 91 ]. Before the data is ready for modeling, it’s necessary to use data summarization and visualization to audit the quality of the data and provide the information needed to process it. To ensure the quality of the data, the data pre-processing technique, which is typically the process of cleaning and transforming raw data [ 107 ] before processing and analysis is important. It also involves reformatting information, making data corrections, and merging data sets to enrich data. Thus, several aspects such as expected data, data cleaning, formatting or transforming data, dealing with missing values, handling data imbalance and bias issues, data distribution, search for outliers or anomalies in data and dealing with them, ensuring data quality, etc. could be the key considerations in this step.
Machine learning modeling and evaluation: Once the data is prepared for building the model, data scientists design a model, algorithm, or set of models, to address the business problem. Model building is dependent on what type of analytics, e.g., predictive analytics, is needed to solve the particular problem, which is discussed briefly in “ Advanced analytics methods and smart computing ”. To best fits the data according to the type of analytics, different types of data-driven or machine learning models that have been summarized in our earlier paper Sarker et al. [ 105 ], can be built to achieve the goal. Data scientists typically separate training and test subsets of the given dataset usually dividing in the ratio of 80:20 or data considering the most popular k -folds data splitting method [ 38 ]. This is to observe whether the model performs well or not on the data, to maximize the model performance. Various model validation and assessment metrics, such as error rate, accuracy, true positive, false positive, true negative, false negative, precision, recall, f-score, ROC (receiver operating characteristic curve) analysis, applicability analysis, etc. [ 38 , 115 ] are used to measure the model performance, which can guide the data scientists to choose or design the learning method or model. Besides, machine learning experts or data scientists can take into account several advanced analytics such as feature engineering, feature selection or extraction methods, algorithm tuning, ensemble methods, modifying existing algorithms, or designing new algorithms, etc. to improve the ultimate data-driven model to solve a particular business problem through smart decision making.
Data product and automation: A data product is typically the output of any data science activity [ 17 ]. A data product, in general terms, is a data deliverable, or data-enabled or guide, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, application, or system that process data and generate results. Businesses can use the results of such data analysis to obtain useful information like churn (a measure of how many customers stop using a product) prediction and customer segmentation, and use these results to make smarter business decisions and automation. Thus to make better decisions in various business problems, various machine learning pipelines and data products can be developed. To highlight this, we summarize several potential real-world data science application areas in “ Real-world application domains ”, where various data products can play a significant role in relevant business problems to make them smart and automate.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in business practices. The interesting part of the data science process indicates having a deeper understanding of the business problem to solve. Without that, it would be much harder to gather the right data and extract the most useful information from the data for making decisions to solve the problem. In terms of role, “Data Scientists” typically interpret and manage data to uncover the answers to major questions that help organizations to make objective decisions and solve complex problems. In a summary, a data scientist proactively gathers and analyzes information from multiple sources to better understand how the business performs, and designs machine learning or data-driven tools/methods, or algorithms, focused on advanced analytics, which can make today’s computing process smarter and intelligent, discussed briefly in the following section.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig2_HTML.jpg

An example of data science modeling from real-world data to data-driven system and decision making

Advanced Analytics Methods and Smart Computing

As mentioned earlier in “ Background and related work ”, basic analytics provides a summary of data whereas advanced analytics takes a step forward in offering a deeper understanding of data and helps in granular data analysis. For instance, the predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. Thus, “advanced analytics” can be defined as the autonomous or semi-autonomous analysis of data or content using advanced techniques and methods to discover deeper insights, make predictions, or produce recommendations, where machine learning-based analytical modeling is considered as the key technologies in the area. In the following section, we first summarize various types of analytics and outcome that are needed to solve the associated business problems, and then we briefly discuss machine learning-based analytical modeling.

Types of Analytics and Outcome

In the real-world business process, several key questions such as “What happened?”, “Why did it happen?”, “What will happen in the future?”, “What action should be taken?” are common and important. Based on these questions, in this paper, we categorize and highlight the analytics into four types such as descriptive, diagnostic, predictive, and prescriptive, which are discussed below.

Descriptive analytics: It is the interpretation of historical data to better understand the changes that have occurred in a business. Thus descriptive analytics answers the question, “what happened in the past?” by summarizing past data such as statistics on sales and operations or marketing strategies, use of social media, and engagement with Twitter, Linkedin or Facebook, etc. For instance, using descriptive analytics through analyzing trends, patterns, and anomalies, etc., customers’ historical shopping data can be used to predict the probability of a customer purchasing a product. Thus, descriptive analytics can play a significant role to provide an accurate picture of what has occurred in a business and how it relates to previous times utilizing a broad range of relevant business data. As a result, managers and decision-makers can pinpoint areas of strength and weakness in their business, and eventually can take more effective management strategies and business decisions.
Diagnostic analytics: It is a form of advanced analytics that examines data or content to answer the question, “why did it happen?” The goal of diagnostic analytics is to help to find the root cause of the problem. For example, the human resource management department of a business organization may use these diagnostic analytics to find the best applicant for a position, select them, and compare them to other similar positions to see how well they perform. In a healthcare example, it might help to figure out whether the patients’ symptoms such as high fever, dry cough, headache, fatigue, etc. are all caused by the same infectious agent. Overall, diagnostic analytics enables one to extract value from the data by posing the right questions and conducting in-depth investigations into the answers. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
Predictive analytics: Predictive analytics is an important analytical technique used by many organizations for various purposes such as to assess business risks, anticipate potential market patterns, and decide when maintenance is needed, to enhance their business. It is a form of advanced analytics that examines data or content to answer the question, “what will happen in the future?” Thus, the primary goal of predictive analytics is to identify and typically answer this question with a high degree of probability. Data scientists can use historical data as a source to extract insights for building predictive models using various regression analyses and machine learning techniques, which can be used in various application domains for a better outcome. Companies, for example, can use predictive analytics to minimize costs by better anticipating future demand and changing output and inventory, banks and other financial institutions to reduce fraud and risks by predicting suspicious activity, medical specialists to make effective decisions through predicting patients who are at risk of diseases, retailers to increase sales and customer satisfaction through understanding and predicting customer preferences, manufacturers to optimize production capacity through predicting maintenance requirements, and many more. Thus predictive analytics can be considered as the core analytical method within the area of data science.
Prescriptive analytics: Prescriptive analytics focuses on recommending the best way forward with actionable information to maximize overall returns and profitability, which typically answer the question, “what action should be taken?” In business analytics, prescriptive analytics is considered the final step. For its models, prescriptive analytics collects data from several descriptive and predictive sources and applies it to the decision-making process. Thus, we can say that it is related to both descriptive analytics and predictive analytics, but it emphasizes actionable insights instead of data monitoring. In other words, it can be considered as the opposite of descriptive analytics, which examines decisions and outcomes after the fact. By integrating big data, machine learning, and business rules, prescriptive analytics helps organizations to make more informed decisions to produce results that drive the most successful business decisions.

In summary, to clarify what happened and why it happened, both descriptive analytics and diagnostic analytics look at the past. Historical data is used by predictive analytics and prescriptive analytics to forecast what will happen in the future and what steps should be taken to impact those effects. In Table Table1, 1 , we have summarized these analytics methods with examples. Forward-thinking organizations in the real world can jointly use these analytical methods to make smart decisions that help drive changes in business processes and improvements. In the following, we discuss how machine learning techniques can play a big role in these analytical methods through their learning capabilities from the data.

Various types of analytical methods with examples

Analytical methods	Data-driven model building	Examples
Descriptive analytics	Answer the question, “what happened in the past”?	Summarising past events, e.g., sales, business data, social media usage, reporting general trends, etc.
Diagnostic analytics	Answer the question, “why did it happen?”	Identify anomalies and determine casual relationships, to find out business loss, identifying the influence of medications, etc.
Predictive analytics	Answer the question, “what will happen in the future?”	Predicting customer preferences, recommending products, identifying possible security breaches, predicting staff and resource needs, etc.
Prescriptive analytics	Answer the question, “what action should be taken?”	Improving business management, maintenance, improving patient care and healthcare administration, determining optimal marketing strategies, etc.

Machine Learning Based Analytical Modeling

In this section, we briefly discuss various advanced analytics methods based on machine learning modeling, which can make the computing process smart through intelligent decision-making in a business process. Figure Figure3 3 shows a general structure of a machine learning-based predictive modeling considering both the training and testing phase. In the following, we discuss a wide range of methods such as regression and classification analysis, association rule analysis, time-series analysis, behavioral analysis, log analysis, and so on within the scope of our study.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Regression Analysis

In data science, one of the most common statistical approaches used for predictive modeling and data mining tasks is regression techniques [ 38 ]. Regression analysis is a form of supervised machine learning that examines the relationship between a dependent variable (target) and independent variables (predictor) to predict continuous-valued output [ 105 , 117 ]. The following equations Eqs. 1 , 2 , and 3 [ 85 , 105 ] represent the simple, multiple or multivariate, and polynomial regressions respectively, where x represents independent variable and y is the predicted/target output mentioned above:

Regression analysis is typically conducted for one of two purposes: to predict the value of the dependent variable in the case of individuals for whom some knowledge relating to the explanatory variables is available, or to estimate the effect of some explanatory variable on the dependent variable, i.e., finding the relationship of causal influence between the variables. Linear regression cannot be used to fit non-linear data and may cause an underfitting problem. In that case, polynomial regression performs better, however, increases the model complexity. The regularization techniques such as Ridge, Lasso, Elastic-Net, etc. [ 85 , 105 ] can be used to optimize the linear regression model. Besides, support vector regression, decision tree regression, random forest regression techniques [ 85 , 105 ] can be used for building effective regression models depending on the problem type, e.g., non-linear tasks. Financial forecasting or prediction, cost estimation, trend analysis, marketing, time-series estimation, drug response modeling, etc. are some examples where the regression models can be used to solve real-world problems in the domain of data science and analytics.

Classification Analysis

Classification is one of the most widely used and best-known data science processes. This is a form of supervised machine learning approach that also refers to a predictive modeling problem in which a class label is predicted for a given example [ 38 ]. Spam identification, such as ‘spam’ and ‘not spam’ in email service providers, can be an example of a classification problem. There are several forms of classification analysis available in the area such as binary classification—which refers to the prediction of one of two classes; multi-class classification—which involves the prediction of one of more than two classes; multi-label classification—a generalization of multiclass classification in which the problem’s classes are organized hierarchically [ 105 ].

Several popular classification techniques, such as k-nearest neighbors [ 5 ], support vector machines [ 55 ], navies Bayes [ 49 ], adaptive boosting [ 32 ], extreme gradient boosting [ 85 ], logistic regression [ 66 ], decision trees ID3 [ 92 ], C4.5 [ 93 ], and random forests [ 13 ] exist to solve classification problems. The tree-based classification technique, e.g., random forest considering multiple decision trees, performs better than others to solve real-world problems in many cases as due to its capability of producing logic rules [ 103 , 115 ]. Figure Figure4 4 shows an example of a random forest structure considering multiple decision trees. In addition, BehavDT recently proposed by Sarker et al. [ 109 ], and IntrudTree [ 106 ] can be used for building effective classification or prediction models in the relevant tasks within the domain of data science and analytics.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig4_HTML.jpg

An example of a random forest structure considering multiple decision trees

Cluster Analysis

Clustering is a form of unsupervised machine learning technique and is well-known in many data science application areas for statistical data analysis [ 38 ]. Usually, clustering techniques search for the structures inside a dataset and, if the classification is not previously identified, classify homogeneous groups of cases. This means that data points are identical to each other within a cluster, and different from data points in another cluster. Overall, the purpose of cluster analysis is to sort various data points into groups (or clusters) that are homogeneous internally and heterogeneous externally [ 105 ]. To gain insight into how data is distributed in a given dataset or as a preprocessing phase for other algorithms, clustering is often used. Data clustering, for example, assists with customer shopping behavior, sales campaigns, and retention of consumers for retail businesses, anomaly detection, etc.

Many clustering algorithms with the ability to group data have been proposed in machine learning and data science literature [ 98 , 138 , 141 ]. In our earlier paper Sarker et al. [ 105 ], we have summarized this based on several perspectives, such as partitioning methods, density-based methods, hierarchical-based methods, model-based methods, etc. In the literature, the popular K-means [ 75 ], K-Mediods [ 84 ], CLARA [ 54 ] etc. are known as partitioning methods; DBSCAN [ 30 ], OPTICS [ 8 ] etc. are known as density-based methods; single linkage [ 122 ], complete linkage [ 123 ], etc. are known as hierarchical methods. In addition, grid-based clustering methods, such as STING [ 134 ], CLIQUE [ 2 ], etc.; model-based clustering such as neural network learning [ 141 ], GMM [ 94 ], SOM [ 18 , 104 ], etc.; constrained-based methods such as COP K-means [ 131 ], CMWK-Means [ 25 ], etc. are used in the area. Recently, Sarker et al. [ 111 ] proposed a hierarchical clustering method, BOTS [ 111 ] based on bottom-up agglomerative technique for capturing user’s similar behavioral characteristics over time. The key benefit of agglomerative hierarchical clustering is that the tree-structure hierarchy created by agglomerative clustering is more informative than an unstructured set of flat clusters, which can assist in better decision-making in relevant application areas in data science.

Association Rule Analysis

Association rule learning is known as a rule-based machine learning system, an unsupervised learning method is typically used to establish a relationship among variables. This is a descriptive technique often used to analyze large datasets for discovering interesting relationships or patterns. The association learning technique’s main strength is its comprehensiveness, as it produces all associations that meet user-specified constraints including minimum support and confidence value [ 138 ].

Association rules allow a data scientist to identify trends, associations, and co-occurrences between data sets inside large data collections. In a supermarket, for example, associations infer knowledge about the buying behavior of consumers for different items, which helps to change the marketing and sales plan. In healthcare, to better diagnose patients, physicians may use association guidelines. Doctors can assess the conditional likelihood of a given illness by comparing symptom associations in the data from previous cases using association rules and machine learning-based data analysis. Similarly, association rules are useful for consumer behavior analysis and prediction, customer market analysis, bioinformatics, weblog mining, recommendation systems, etc.

Several types of association rules have been proposed in the area, such as frequent pattern based [ 4 , 47 , 73 ], logic-based [ 31 ], tree-based [ 39 ], fuzzy-rules [ 126 ], belief rule [ 148 ] etc. The rule learning techniques such as AIS [ 3 ], Apriori [ 4 ], Apriori-TID and Apriori-Hybrid [ 4 ], FP-Tree [ 39 ], Eclat [ 144 ], RARM [ 24 ] exist to solve the relevant business problems. Apriori [ 4 ] is the most commonly used algorithm for discovering association rules from a given dataset among the association rule learning techniques [ 145 ]. The recent association rule-learning technique ABC-RuleMiner proposed in our earlier paper by Sarker et al. [ 113 ] could give significant results in terms of generating non-redundant rules that can be used for smart decision making according to human preferences, within the area of data science applications.

Time-Series Analysis and Forecasting

A time series is typically a series of data points indexed in time order particularly, by date, or timestamp [ 111 ]. Depending on the frequency, the time-series can be different types such as annually, e.g., annual budget, quarterly, e.g., expenditure, monthly, e.g., air traffic, weekly, e.g., sales quantity, daily, e.g., weather, hourly, e.g., stock price, minute-wise, e.g., inbound calls in a call center, and even second-wise, e.g., web traffic, and so on in relevant domains.

A mathematical method dealing with such time-series data, or the procedure of fitting a time series to a proper model is termed time-series analysis. Many different time series forecasting algorithms and analysis methods can be applied to extract the relevant information. For instance, to do time-series forecasting for future patterns, the autoregressive (AR) model [ 130 ] learns the behavioral trends or patterns of past data. Moving average (MA) [ 40 ] is another simple and common form of smoothing used in time series analysis and forecasting that uses past forecasted errors in a regression-like model to elaborate an averaged trend across the data. The autoregressive moving average (ARMA) [ 12 , 120 ] combines these two approaches, where autoregressive extracts the momentum and pattern of the trend and moving average capture the noise effects. The most popular and frequently used time-series model is the autoregressive integrated moving average (ARIMA) model [ 12 , 120 ]. ARIMA model, a generalization of an ARMA model, is more flexible than other statistical models such as exponential smoothing or simple linear regression. In terms of data, the ARMA model can only be used for stationary time-series data, while the ARIMA model includes the case of non-stationarity as well. Similarly, seasonal autoregressive integrated moving average (SARIMA), autoregressive fractionally integrated moving average (ARFIMA), autoregressive moving average model with exogenous inputs model (ARMAX model) are also used in time-series models [ 120 ].

In addition to the stochastic methods for time-series modeling and forecasting, machine and deep learning-based approach can be used for effective time-series analysis and forecasting. For instance, in our earlier paper, Sarker et al. [ 111 ] present a bottom-up clustering-based time-series analysis to capture the mobile usage behavioral patterns of the users. Figure Figure5 5 shows an example of producing aggregate time segments Seg_i from initial time slices TS_i based on similar behavioral characteristics that are used in our bottom-up clustering approach, where D represents the dominant behavior BH_i of the users, mentioned above [ 111 ]. The authors in [ 118 ], used a long short-term memory (LSTM) model, a kind of recurrent neural network (RNN) deep learning model, in forecasting time-series that outperform traditional approaches such as the ARIMA model. Time-series analysis is commonly used these days in various fields such as financial, manufacturing, business, social media, event data (e.g., clickstreams and system events), IoT and smartphone data, and generally in any applied science and engineering temporal measurement domain. Thus, it covers a wide range of application areas in data science.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig5_HTML.jpg

An example of producing aggregate time segments from initial time slices based on similar behavioral characteristics

Opinion Mining and Sentiment Analysis

Sentiment analysis or opinion mining is the computational study of the opinions, thoughts, emotions, assessments, and attitudes of people towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [ 71 ]. There are three kinds of sentiments: positive, negative, and neutral, along with more extreme feelings such as angry, happy and sad, or interested or not interested, etc. More refined sentiments to evaluate the feelings of individuals in various situations can also be found according to the problem domain.

Although the task of opinion mining and sentiment analysis is very challenging from a technical point of view, it’s very useful in real-world practice. For instance, a business always aims to obtain an opinion from the public or customers about its products and services to refine the business policy as well as a better business decision. It can thus benefit a business to understand the social opinion of their brand, product, or service. Besides, potential customers want to know what consumers believe they have when they use a service or purchase a product. Document-level, sentence level, aspect level, and concept level, are the possible levels of opinion mining in the area [ 45 ].

Several popular techniques such as lexicon-based including dictionary-based and corpus-based methods, machine learning including supervised and unsupervised learning, deep learning, and hybrid methods are used in sentiment analysis-related tasks [ 70 ]. To systematically define, extract, measure, and analyze affective states and subjective knowledge, it incorporates the use of statistics, natural language processing (NLP), machine learning as well as deep learning methods. Sentiment analysis is widely used in many applications, such as reviews and survey data, web and social media, and healthcare content, ranging from marketing and customer support to clinical practice. Thus sentiment analysis has a big influence in many data science applications, where public sentiment is involved in various real-world issues.

Behavioral Data and Cohort Analysis

Behavioral analytics is a recent trend that typically reveals new insights into e-commerce sites, online gaming, mobile and smartphone applications, IoT user behavior, and many more [ 112 ]. The behavioral analysis aims to understand how and why the consumers or users behave, allowing accurate predictions of how they are likely to behave in the future. For instance, it allows advertisers to make the best offers with the right client segments at the right time. Behavioral analytics, including traffic data such as navigation paths, clicks, social media interactions, purchase decisions, and marketing responsiveness, use the large quantities of raw user event information gathered during sessions in which people use apps, games, or websites. In our earlier papers Sarker et al. [ 101 , 111 , 113 ] we have discussed how to extract users phone usage behavioral patterns utilizing real-life phone log data for various purposes.

In the real-world scenario, behavioral analytics is often used in e-commerce, social media, call centers, billing systems, IoT systems, political campaigns, and other applications, to find opportunities for optimization to achieve particular outcomes. Cohort analysis is a branch of behavioral analytics that involves studying groups of people over time to see how their behavior changes. For instance, it takes data from a given data set (e.g., an e-commerce website, web application, or online game) and separates it into related groups for analysis. Various machine learning techniques such as behavioral data clustering [ 111 ], behavioral decision tree classification [ 109 ], behavioral association rules [ 113 ], etc. can be used in the area according to the goal. Besides, the concept of RecencyMiner, proposed in our earlier paper Sarker et al. [ 108 ] that takes into account recent behavioral patterns could be effective while analyzing behavioral data as it may not be static in the real-world changes over time.

Anomaly Detection or Outlier Analysis

Anomaly detection, also known as Outlier analysis is a data mining step that detects data points, events, and/or findings that deviate from the regularities or normal behavior of a dataset. Anomalies are usually referred to as outliers, abnormalities, novelties, noise, inconsistency, irregularities, and exceptions [ 63 , 114 ]. Techniques of anomaly detection may discover new situations or cases as deviant based on historical data through analyzing the data patterns. For instance, identifying fraud or irregular transactions in finance is an example of anomaly detection.

It is often used in preprocessing tasks for the deletion of anomalous or inconsistency in the real-world data collected from various data sources including user logs, devices, networks, and servers. For anomaly detection, several machine learning techniques can be used, such as k-nearest neighbors, isolation forests, cluster analysis, etc [ 105 ]. The exclusion of anomalous data from the dataset also results in a statistically significant improvement in accuracy during supervised learning [ 101 ]. However, extracting appropriate features, identifying normal behaviors, managing imbalanced data distribution, addressing variations in abnormal behavior or irregularities, the sparse occurrence of abnormal events, environmental variations, etc. could be challenging in the process of anomaly detection. Detection of anomalies can be applicable in a variety of domains such as cybersecurity analytics, intrusion detections, fraud detection, fault detection, health analytics, identifying irregularities, detecting ecosystem disturbances, and many more. This anomaly detection can be considered a significant task for building effective systems with higher accuracy within the area of data science.

Factor Analysis

Factor analysis is a collection of techniques for describing the relationships or correlations between variables in terms of more fundamental entities known as factors [ 23 ]. It’s usually used to organize variables into a small number of clusters based on their common variance, where mathematical or statistical procedures are used. The goals of factor analysis are to determine the number of fundamental influences underlying a set of variables, calculate the degree to which each variable is associated with the factors, and learn more about the existence of the factors by examining which factors contribute to output on which variables. The broad purpose of factor analysis is to summarize data so that relationships and patterns can be easily interpreted and understood [ 143 ].

Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are the two most popular factor analysis techniques. EFA seeks to discover complex trends by analyzing the dataset and testing predictions, while CFA tries to validate hypotheses and uses path analysis diagrams to represent variables and factors [ 143 ]. Factor analysis is one of the algorithms for unsupervised machine learning that is used for minimizing dimensionality. The most common methods for factor analytics are principal components analysis (PCA), principal axis factoring (PAF), and maximum likelihood (ML) [ 48 ]. Methods of correlation analysis such as Pearson correlation, canonical correlation, etc. may also be useful in the field as they can quantify the statistical relationship between two continuous variables, or association. Factor analysis is commonly used in finance, marketing, advertising, product management, psychology, and operations research, and thus can be considered as another significant analytical method within the area of data science.

Log Analysis

Logs are commonly used in system management as logs are often the only data available that record detailed system runtime activities or behaviors in production [ 44 ]. Log analysis is thus can be considered as the method of analyzing, interpreting, and capable of understanding computer-generated records or messages, also known as logs. This can be device log, server log, system log, network log, event log, audit trail, audit record, etc. The process of creating such records is called data logging.

Logs are generated by a wide variety of programmable technologies, including networking devices, operating systems, software, and more. Phone call logs [ 88 , 110 ], SMS Logs [ 28 ], mobile apps usages logs [ 124 , 149 ], notification logs [ 77 ], game Logs [ 82 ], context logs [ 16 , 149 ], web logs [ 37 ], smartphone life logs [ 95 ], etc. are some examples of log data for smartphone devices. The main characteristics of these log data is that it contains users’ actual behavioral activities with their devices. Similar other log data can be search logs [ 50 , 133 ], application logs [ 26 ], server logs [ 33 ], network logs [ 57 ], event logs [ 83 ], network and security logs [ 142 ] etc.

Several techniques such as classification and tagging, correlation analysis, pattern recognition methods, anomaly detection methods, machine learning modeling, etc. [ 105 ] can be used for effective log analysis. Log analysis can assist in compliance with security policies and industry regulations, as well as provide a better user experience by encouraging the troubleshooting of technical problems and identifying areas where efficiency can be improved. For instance, web servers use log files to record data about website visitors. Windows event log analysis can help an investigator draw a timeline based on the logging information and the discovered artifacts. Overall, advanced analytics methods by taking into account machine learning modeling can play a significant role to extract insightful patterns from these log data, which can be used for building automated and smart applications, and thus can be considered as a key working area in data science.

Neural Networks and Deep Learning Analysis

Deep learning is a form of machine learning that uses artificial neural networks to create a computational architecture that learns from data by combining multiple processing layers, such as the input, hidden, and output layers [ 38 ]. The key benefit of deep learning over conventional machine learning methods is that it performs better in a variety of situations, particularly when learning from large datasets [ 114 , 140 ].

The most common deep learning algorithms are: multi-layer perceptron (MLP) [ 85 ], convolutional neural network (CNN or ConvNet) [ 67 ], long short term memory recurrent neural network (LSTM-RNN) [ 34 ]. Figure Figure6 6 shows a structure of an artificial neural network modeling with multiple processing layers. The Backpropagation technique [ 38 ] is used to adjust the weight values internally while building the model. Convolutional neural networks (CNNs) [ 67 ] improve on the design of traditional artificial neural networks (ANNs), which include convolutional layers, pooling layers, and fully connected layers. It is commonly used in a variety of fields, including natural language processing, speech recognition, image processing, and other autocorrelated data since it takes advantage of the two-dimensional (2D) structure of the input data. AlexNet [ 60 ], Xception [ 21 ], Inception [ 125 ], Visual Geometry Group (VGG) [ 42 ], ResNet [ 43 ], etc., and other advanced deep learning models based on CNN are also used in the field.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig6_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

In addition to CNN, recurrent neural network (RNN) architecture is another popular method used in deep learning. Long short-term memory (LSTM) is a popular type of recurrent neural network architecture used broadly in the area of deep learning. Unlike traditional feed-forward neural networks, LSTM has feedback connections. Thus, LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, sorting, and predicting data based on time-series data. Therefore, when the data is in a sequential format, such as time, sentence, etc., LSTM can be used, and it is widely used in the areas of time-series analysis, natural language processing, speech recognition, and so on.

In addition to the most popular deep learning methods mentioned above, several other deep learning approaches [ 104 ] exist in the field for various purposes. The self-organizing map (SOM) [ 58 ], for example, uses unsupervised learning to represent high-dimensional data as a 2D grid map, reducing dimensionality. Another learning technique that is commonly used for dimensionality reduction and feature extraction in unsupervised learning tasks is the autoencoder (AE) [ 10 ]. Restricted Boltzmann machines (RBM) can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling, according to [ 46 ]. A deep belief network (DBN) is usually made up of a backpropagation neural network and unsupervised networks like restricted Boltzmann machines (RBMs) or autoencoders (BPNN) [ 136 ]. A generative adversarial network (GAN) [ 35 ] is a deep learning network that can produce data with characteristics that are similar to the input data. Transfer learning is common worldwide presently because it can train deep neural networks with a small amount of data, which is usually the re-use of a pre-trained model on a new problem [ 137 ]. These deep learning methods can perform well, particularly, when learning from large-scale datasets [ 105 , 140 ]. In our previous article Sarker et al. [ 104 ], we have summarized a brief discussion of various artificial neural networks (ANN) and deep learning (DL) models mentioned above, which can be used in a variety of data science and analytics tasks.

Real-World Application Domains

Almost every industry or organization is impacted by data, and thus “Data Science” including advanced analytics with machine learning modeling can be used in business, marketing, finance, IoT systems, cybersecurity, urban management, health care, government policies, and every possible industries, where data gets generated. In the following, we discuss ten most popular application areas based on data science and analytics.

Business or financial data science: In general, business data science can be considered as the study of business or e-commerce data to obtain insights about a business that can typically lead to smart decision-making as well as taking high-quality actions [ 90 ]. Data scientists can develop algorithms or data-driven models predicting customer behavior, identifying patterns and trends based on historical business data, which can help companies to reduce costs, improve service delivery, and generate recommendations for better decision-making. Eventually, business automation, intelligence, and efficiency can be achieved through the data science process discussed earlier, where various advanced analytics methods and machine learning modeling based on the collected data are the keys. Many online retailers, such as Amazon [ 76 ], can improve inventory management, avoid out-of-stock situations, and optimize logistics and warehousing using predictive modeling based on machine learning techniques [ 105 ]. In terms of finance, the historical data is related to financial institutions to make high-stakes business decisions, which is mostly used for risk management, fraud prevention, credit allocation, customer analytics, personalized services, algorithmic trading, etc. Overall, data science methodologies can play a key role in the future generation business or finance industry, particularly in terms of business automation, intelligence, and smart decision-making and systems.
Manufacturing or industrial data science: To compete in global production capability, quality, and cost, manufacturing industries have gone through many industrial revolutions [ 14 ]. The latest fourth industrial revolution, also known as Industry 4.0, is the emerging trend of automation and data exchange in manufacturing technology. Thus industrial data science, which is the study of industrial data to obtain insights that can typically lead to optimizing industrial applications, can play a vital role in such revolution. Manufacturing industries generate a large amount of data from various sources such as sensors, devices, networks, systems, and applications [ 6 , 68 ]. The main categories of industrial data include large-scale data devices, life-cycle production data, enterprise operation data, manufacturing value chain sources, and collaboration data from external sources [ 132 ]. The data needs to be processed, analyzed, and secured to help improve the system’s efficiency, safety, and scalability. Data science modeling thus can be used to maximize production, reduce costs and raise profits in manufacturing industries.
Medical or health data science: Healthcare is one of the most notable fields where data science is making major improvements. Health data science involves the extrapolation of actionable insights from sets of patient data, typically collected from electronic health records. To help organizations, improve the quality of treatment, lower the cost of care, and improve the patient experience, data can be obtained from several sources, e.g., the electronic health record, billing claims, cost estimates, and patient satisfaction surveys, etc., to analyze. In reality, healthcare analytics using machine learning modeling can minimize medical costs, predict infectious outbreaks, prevent preventable diseases, and generally improve the quality of life [ 81 , 119 ]. Across the global population, the average human lifespan is growing, presenting new challenges to today’s methods of delivery of care. Thus health data science modeling can play a role in analyzing current and historical data to predict trends, improve services, and even better monitor the spread of diseases. Eventually, it may lead to new approaches to improve patient care, clinical expertise, diagnosis, and management.
IoT data science: Internet of things (IoT) [ 9 ] is a revolutionary technical field that turns every electronic system into a smarter one and is therefore considered to be the big frontier that can enhance almost all activities in our lives. Machine learning has become a key technology for IoT applications because it uses expertise to identify patterns and generate models that help predict future behavior and events [ 112 ]. One of the IoT’s main fields of application is a smart city, which uses technology to improve city services and citizens’ living experiences. For example, using the relevant data, data science methods can be used for traffic prediction in smart cities, to estimate the total usage of energy of the citizens for a particular period. Deep learning-based models in data science can be built based on a large scale of IoT datasets [ 7 , 104 ]. Overall, data science and analytics approaches can aid modeling in a variety of IoT and smart city services, including smart governance, smart homes, education, connectivity, transportation, business, agriculture, health care, and industry, and many others.
Cybersecurity data science: Cybersecurity, or the practice of defending networks, systems, hardware, and data from digital attacks, is one of the most important fields of Industry 4.0 [ 114 , 121 ]. Data science techniques, particularly machine learning, have become a crucial cybersecurity technology that continually learns to identify trends by analyzing data, better detecting malware in encrypted traffic, finding insider threats, predicting where bad neighborhoods are online, keeping people safe while surfing, or protecting information in the cloud by uncovering suspicious user activity [ 114 ]. For instance, machine learning and deep learning-based security modeling can be used to effectively detect various types of cyberattacks or anomalies [ 103 , 106 ]. To generate security policy rules, association rule learning can play a significant role to build rule-based systems [ 102 ]. Deep learning-based security models can perform better when utilizing the large scale of security datasets [ 140 ]. Thus data science modeling can enable professionals in cybersecurity to be more proactive in preventing threats and reacting in real-time to active attacks, through extracting actionable insights from the security datasets.
Behavioral data science: Behavioral data is information produced as a result of activities, most commonly commercial behavior, performed on a variety of Internet-connected devices, such as a PC, tablet, or smartphones [ 112 ]. Websites, mobile applications, marketing automation systems, call centers, help desks, and billing systems, etc. are all common sources of behavioral data. Behavioral data is much more than just data, which is not static data [ 108 ]. Advanced analytics of these data including machine learning modeling can facilitate in several areas such as predicting future sales trends and product recommendations in e-commerce and retail; predicting usage trends, load, and user preferences in future releases in online gaming; determining how users use an application to predict future usage and preferences in application development; breaking users down into similar groups to gain a more focused understanding of their behavior in cohort analysis; detecting compromised credentials and insider threats by locating anomalous behavior, or making suggestions, etc. Overall, behavioral data science modeling typically enables to make the right offers to the right consumers at the right time on various common platforms such as e-commerce platforms, online games, web and mobile applications, and IoT. In social context, analyzing the behavioral data of human being using advanced analytics methods and the extracted insights from social data can be used for data-driven intelligent social services, which can be considered as social data science.
Mobile data science: Today’s smart mobile phones are considered as “next-generation, multi-functional cell phones that facilitate data processing, as well as enhanced wireless connectivity” [ 146 ]. In our earlier paper [ 112 ], we have shown that users’ interest in “Mobile Phones” is more and more than other platforms like “Desktop Computer”, “Laptop Computer” or “Tablet Computer” in recent years. People use smartphones for a variety of activities, including e-mailing, instant messaging, online shopping, Internet surfing, entertainment, social media such as Facebook, Linkedin, and Twitter, and various IoT services such as smart cities, health, and transportation services, and many others. Intelligent apps are based on the extracted insight from the relevant datasets depending on apps characteristics, such as action-oriented, adaptive in nature, suggestive and decision-oriented, data-driven, context-awareness, and cross-platform operation [ 112 ]. As a result, mobile data science, which involves gathering a large amount of mobile data from various sources and analyzing it using machine learning techniques to discover useful insights or data-driven trends, can play an important role in the development of intelligent smartphone applications.
Multimedia data science: Over the last few years, a big data revolution in multimedia management systems has resulted from the rapid and widespread use of multimedia data, such as image, audio, video, and text, as well as the ease of access and availability of multimedia sources. Currently, multimedia sharing websites, such as Yahoo Flickr, iCloud, and YouTube, and social networks such as Facebook, Instagram, and Twitter, are considered as valuable sources of multimedia big data [ 89 ]. People, particularly younger generations, spend a lot of time on the Internet and social networks to connect with others, exchange information, and create multimedia data, thanks to the advent of new technology and the advanced capabilities of smartphones and tablets. Multimedia analytics deals with the problem of effectively and efficiently manipulating, handling, mining, interpreting, and visualizing various forms of data to solve real-world problems. Text analysis, image or video processing, computer vision, audio or speech processing, and database management are among the solutions available for a range of applications including healthcare, education, entertainment, and mobile devices.
Smart cities or urban data science: Today, more than half of the world’s population live in urban areas or cities [ 80 ] and considered as drivers or hubs of economic growth, wealth creation, well-being, and social activity [ 96 , 116 ]. In addition to cities, “Urban area” can refer to the surrounding areas such as towns, conurbations, or suburbs. Thus, a large amount of data documenting daily events, perceptions, thoughts, and emotions of citizens or people are recorded, that are loosely categorized into personal data, e.g., household, education, employment, health, immigration, crime, etc., proprietary data, e.g., banking, retail, online platforms data, etc., government data, e.g., citywide crime statistics, or government institutions, etc., Open and public data, e.g., data.gov, ordnance survey, and organic and crowdsourced data, e.g., user-generated web data, social media, Wikipedia, etc. [ 29 ]. The field of urban data science typically focuses on providing more effective solutions from a data-driven perspective, through extracting knowledge and actionable insights from such urban data. Advanced analytics of these data using machine learning techniques [ 105 ] can facilitate the efficient management of urban areas including real-time management, e.g., traffic flow management, evidence-based planning decisions which pertain to the longer-term strategic role of forecasting for urban planning, e.g., crime prevention, public safety, and security, or framing the future, e.g., political decision-making [ 29 ]. Overall, it can contribute to government and public planning, as well as relevant sectors including retail, financial services, mobility, health, policing, and utilities within a data-rich urban environment through data-driven smart decision-making and policies, which lead to smart cities and improve the quality of human life.
Smart villages or rural data science: Rural areas or countryside are the opposite of urban areas, that include villages, hamlets, or agricultural areas. The field of rural data science typically focuses on making better decisions and providing more effective solutions that include protecting public safety, providing critical health services, agriculture, and fostering economic development from a data-driven perspective, through extracting knowledge and actionable insights from the collected rural data. Advanced analytics of rural data including machine learning [ 105 ] modeling can facilitate providing new opportunities for them to build insights and capacity to meet current needs and prepare for their futures. For instance, machine learning modeling [ 105 ] can help farmers to enhance their decisions to adopt sustainable agriculture utilizing the increasing amount of data captured by emerging technologies, e.g., the internet of things (IoT), mobile technologies and devices, etc. [ 1 , 51 , 52 ]. Thus, rural data science can play a very important role in the economic and social development of rural areas, through agriculture, business, self-employment, construction, banking, healthcare, governance, or other services, etc. that lead to smarter villages.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in almost every sector in our real-world life, where the relevant data is available to analyze. To gather the right data and extract useful knowledge or actionable insights from the data for making smart decisions is the key to data science modeling in any application domain. Based on our discussion on the above ten potential real-world application domains by taking into account data-driven smart computing and decision making, we can say that the prospects of data science and the role of data scientists are huge for the future world. The “Data Scientists” typically analyze information from multiple sources to better understand the data and business problems, and develop machine learning-based analytical modeling or algorithms, or data-driven tools, or solutions, focused on advanced analytics, which can make today’s computing process smarter, automated, and intelligent.

Challenges and Research Directions

Our study on data science and analytics, particularly data science modeling in “ Understanding data science modeling ”, advanced analytics methods and smart computing in “ Advanced analytics methods and smart computing ”, and real-world application areas in “ Real-world application domains ” open several research issues in the area of data-driven business solutions and eventual data products. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions to build data-driven products.

Understanding the real-world business problems and associated data including nature, e.g., what forms, type, size, labels, etc., is the first challenge in the data science modeling, discussed briefly in “ Understanding data science modeling ”. This is actually to identify, specify, represent and quantify the domain-specific business problems and data according to the requirements. For a data-driven effective business solution, there must be a well-defined workflow before beginning the actual data analysis work. Furthermore, gathering business data is difficult because data sources can be numerous and dynamic. As a result, gathering different forms of real-world data, such as structured, or unstructured, related to a specific business issue with legal access, which varies from application to application, is challenging. Moreover, data annotation, which is typically the process of categorization, tagging, or labeling of raw data, for the purpose of building data-driven models, is another challenging issue. Thus, the primary task is to conduct a more in-depth analysis of data collection and dynamic annotation methods. Therefore, understanding the business problem, as well as integrating and managing the raw data gathered for efficient data analysis, may be one of the most challenging aspects of working in the field of data science and analytics.
The next challenge is the extraction of the relevant and accurate information from the collected data mentioned above. The main focus of data scientists is typically to disclose, describe, represent, and capture data-driven intelligence for actionable insights from data. However, the real-world data may contain many ambiguous values, missing values, outliers, and meaningless data [ 101 ]. The advanced analytics methods including machine and deep learning modeling, discussed in “ Advanced analytics methods and smart computing ”, highly impact the quality, and availability of the data. Thus understanding real-world business scenario and associated data, to whether, how, and why they are insufficient, missing, or problematic, then extend or redevelop the existing methods, such as large-scale hypothesis testing, learning inconsistency, and uncertainty, etc. to address the complexities in data and business problems is important. Therefore, developing new techniques to effectively pre-process the diverse data collected from multiple sources, according to their nature and characteristics could be another challenging task.
Understanding and selecting the appropriate analytical methods to extract the useful insights for smart decision-making for a particular business problem is the main issue in the area of data science. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to granular data analysis. Thus, understanding the advanced analytics methods, especially machine and deep learning-based modeling is the key. The traditional learning techniques mentioned in “ Advanced analytics methods and smart computing ” may not be directly applicable for the expected outcome in many cases. For instance, in a rule-based system, the traditional association rule learning technique [ 4 ] may produce redundant rules from the data that makes the decision-making process complex and ineffective [ 113 ]. Thus, a scientific understanding of the learning algorithms, mathematical properties, how the techniques are robust or fragile to input data, is needed to understand. Therefore, a deeper understanding of the strengths and drawbacks of the existing machine and deep learning methods [ 38 , 105 ] to solve a particular business problem is needed, consequently to improve or optimize the learning algorithms according to the data characteristics, or to propose the new algorithm/techniques with higher accuracy becomes a significant challenging issue for the future generation data scientists.
The traditional data-driven models or systems typically use a large amount of business data to generate data-driven decisions. In several application fields, however, the new trends are more likely to be interesting and useful for modeling and predicting the future than older ones. For example, smartphone user behavior modeling, IoT services, stock market forecasting, health or transport service, job market analysis, and other related areas where time-series and actual human interests or preferences are involved over time. Thus, rather than considering the traditional data analysis, the concept of RecencyMiner, i.e., recent pattern-based extracted insight or knowledge proposed in our earlier paper Sarker et al. [ 108 ] might be effective. Therefore, to propose the new techniques by taking into account the recent data patterns, and consequently to build a recency-based data-driven model for solving real-world problems, is another significant challenging issue in the area.
The most crucial task for a data-driven smart system is to create a framework that supports data science modeling discussed in “ Understanding data science modeling ”. As a result, advanced analytical methods based on machine learning or deep learning techniques can be considered in such a system to make the framework capable of resolving the issues. Besides, incorporating contextual information such as temporal context, spatial context, social context, environmental context, etc. [ 100 ] can be used for building an adaptive, context-aware, and dynamic model or framework, depending on the problem domain. As a result, a well-designed data-driven framework, as well as experimental evaluation, is a very important direction to effectively solve a business problem in a particular domain, as well as a big challenge for the data scientists.
In several important application areas such as autonomous cars, criminal justice, health care, recruitment, housing, management of the human resource, public safety, where decisions made by models, or AI agents, have a direct effect on human lives. As a result, there is growing concerned about whether these decisions can be trusted, to be right, reasonable, ethical, personalized, accurate, robust, and secure, particularly in the context of adversarial attacks [ 104 ]. If we can explain the result in a meaningful way, then the model can be better trusted by the end-user. For machine-learned models, new trust properties yield new trade-offs, such as privacy versus accuracy; robustness versus efficiency; fairness versus robustness. Therefore, incorporating trustworthy AI particularly, data-driven or machine learning modeling could be another challenging issue in the area.

In the above, we have summarized and discussed several challenges and the potential research opportunities and directions, within the scope of our study in the area of data science and advanced analytics. The data scientists in academia/industry and the researchers in the relevant area have the opportunity to contribute to each issue identified above and build effective data-driven models or systems, to make smart decisions in the corresponding business domains.

In this paper, we have presented a comprehensive view on data science including various types of advanced analytical methods that can be applied to enhance the intelligence and the capabilities of an application. We have also visualized the current popularity of data science and machine learning-based advanced analytical modeling and also differentiate these from the relevant terms used in the area, to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according to our goal, we have briefly discussed how different data modules can play a significant role in a data-driven business solution through the data science process. For this, we have also summarized various types of advanced analytical methods and outcomes as well as machine learning modeling that are needed to solve the associated business problems. Thus, this study’s key contribution has been identified as the explanation of different advanced analytical methods and their applicability in various real-world data-driven applications areas including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making.

Finally, within the scope of our study, we have outlined and discussed the challenges we faced, as well as possible research opportunities and future directions. As a result, the challenges identified provide promising research opportunities in the field that can be explored with effective solutions to improve the data-driven model and systems. Overall, we conclude that our study of advanced analytical solutions based on data science and machine learning methods, leads in a positive direction and can be used as a reference guide for future research and applications in the field of data science and its real-world applications by both academia and industry professionals.

Declarations

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data. According to Shamoo and Resnik (2003) various analytic procedures “provide a way of drawing inductive inferences from data and distinguishing the signal (the phenomenon of interest) from the noise (statistical fluctuations) present in the data”..

While data analysis in qualitative research can include statistical procedures, many times analysis becomes an ongoing iterative process where data is continuously collected and analyzed almost simultaneously. Indeed, researchers generally analyze for patterns in observations through the entire data collection phase (Savenye, Robinson, 2004). The form of the analysis is determined by the specific qualitative approach taken (field study, ethnography content analysis, oral history, biography, research) and the form of the data (field notes, documents, audiotape, videotape).

An essential component of ensuring data integrity is the accurate and appropriate analysis of research findings. Improper statistical analyses distort scientific findings, mislead casual readers (Shepard, 2002), and may negatively influence the public perception of research. Integrity issues are just as relevant to analysis of non-statistical data as well.

Considerations/issues in data analysis

There are a number of issues that researchers should be cognizant of with respect to data analysis. These include:

when analyzing qualitative data

A tacit assumption of investigators is that they have received training sufficient to demonstrate a high standard of research practice. Unintentional ‘scientific misconduct' is likely the result of poor instruction and follow-up. A number of studies suggest this may be the case more often than believed (Nowak, 1994; Silverman, Manson, 2003). For example, Sica found that adequate training of physicians in medical schools in the proper design, implementation and evaluation of clinical trials is “abysmally small” (Sica, cited in Nowak, 1994). Indeed, a single course in biostatistics is the most that is usually offered (Christopher Williams, cited in Nowak, 1994).

A common practice of investigators is to defer the selection of analytic procedure to a research team ‘statistician’. Ideally, investigators should have substantially more than a basic understanding of the rationale for selecting one method of analysis over another. This can allow investigators to better supervise staff who conduct the data analyses process and make informed decisions

While methods of analysis may differ by scientific discipline, the optimal stage for determining appropriate analytic procedures occurs early in the research process and should not be an afterthought. According to Smeeton and Goda (2003), “Statistical advice should be obtained at the stage of initial planning of an investigation so that, for example, the method of sampling and design of questionnaire are appropriate”.

The chief aim of analysis is to distinguish between an event occurring as either reflecting a true effect versus a false one. Any bias occurring in the collection of the data, or selection of method of analysis, will increase the likelihood of drawing a biased inference. Bias can occur when recruitment of study participants falls below minimum number required to demonstrate statistical power or failure to maintain a sufficient follow-up period needed to demonstrate an effect (Altman, 2001).

When failing to demonstrate statistically different levels between treatment groups, investigators may resort to breaking down the analysis to smaller and smaller subgroups in order to find a difference. Although this practice may not inherently be unethical, these analyses should be proposed before beginning the study even if the intent is exploratory in nature. If it the study is exploratory in nature, the investigator should make this explicit so that readers understand that the research is more of a hunting expedition rather than being primarily theory driven. Although a researcher may not have a theory-based hypothesis for testing relationships between previously untested variables, a theory will have to be developed to explain an unanticipated finding. Indeed, in exploratory science, there are no a priori hypotheses therefore there are no hypothetical tests. Although theories can often drive the processes used in the investigation of qualitative studies, many times patterns of behavior or occurrences derived from analyzed data can result in developing new theoretical frameworks rather than determined (Savenye, Robinson, 2004).

It is conceivable that multiple statistical tests could yield a significant finding by chance alone rather than reflecting a true effect. Integrity is compromised if the investigator only reports tests with significant findings, and neglects to mention a large number of tests failing to reach significance. While access to computer-based statistical packages can facilitate application of increasingly complex analytic procedures, inappropriate uses of these packages can result in abuses as well.

Every field of study has developed its accepted practices for data analysis. Resnik (2000) states that it is prudent for investigators to follow these accepted norms. Resnik further states that the norms are ‘…based on two factors:

(1) the nature of the variables used (i.e., quantitative, comparative, or qualitative),

(2) assumptions about the population from which the data are drawn (i.e., random distribution, independence, sample size, etc.). If one uses unconventional norms, it is crucial to clearly state this is being done, and to show how this new and possibly unaccepted method of analysis is being used, as well as how it differs from other more traditional methods. For example, Schroder, Carey, and Vanable (2003) juxtapose their identification of new and powerful data analytic solutions developed to count data in the area of HIV contraction risk with a discussion of the limitations of commonly applied methods.

If one uses unconventional norms, it is crucial to clearly state this is being done, and to show how this new and possibly unaccepted method of analysis is being used, as well as how it differs from other more traditional methods. For example, Schroder, Carey, and Vanable (2003) juxtapose their identification of new and powerful data analytic solutions developed to count data in the area of HIV contraction risk with a discussion of the limitations of commonly applied methods.

While the conventional practice is to establish a standard of acceptability for statistical significance, with certain disciplines, it may also be appropriate to discuss whether attaining statistical significance has a true practical meaning, i.e., . Jeans (1992) defines ‘clinical significance’ as “the potential for research findings to make a real and important difference to clients or clinical practice, to health status or to any other problem identified as a relevant priority for the discipline”.

Kendall and Grove (1988) define clinical significance in terms of what happens when “… troubled and disordered clients are now, after treatment, not distinguishable from a meaningful and representative non-disturbed reference group”. Thompson and Noferi (2002) suggest that readers of counseling literature should expect authors to report either practical or clinical significance indices, or both, within their research reports. Shepard (2003) questions why some authors fail to point out that the magnitude of observed changes may too small to have any clinical or practical significance, “sometimes, a supposed change may be described in some detail, but the investigator fails to disclose that the trend is not statistically significant ”.

No amount of statistical analysis, regardless of the level of the sophistication, will correct poorly defined objective outcome measurements. Whether done unintentionally or by design, this practice increases the likelihood of clouding the interpretation of findings, thus potentially misleading readers.
The basis for this issue is the urgency of reducing the likelihood of statistical error. Common challenges include the exclusion of , filling in missing data, altering or otherwise changing data, data mining, and developing graphical representations of the data (Shamoo, Resnik, 2003).

At times investigators may enhance the impression of a significant finding by determining how to present (as opposed to data in its raw form), which portion of the data is shown, why, how and to whom (Shamoo, Resnik, 2003). Nowak (1994) notes that even experts do not agree in distinguishing between analyzing and massaging data. Shamoo (1989) recommends that investigators maintain a sufficient and accurate paper trail of how data was manipulated for future review.

The integrity of data analysis can be compromised by the environment or context in which data was collected i.e., face-to face interviews vs. focused group. The occurring within a dyadic relationship (interviewer-interviewee) differs from the group dynamic occurring within a focus group because of the number of participants, and how they react to each other’s responses. Since the data collection process could be influenced by the environment/context, researchers should take this into account when conducting data analysis.

Analyses could also be influenced by the method in which data was recorded. For example, research events could be documented by:

a. recording audio and/or video and transcribing later
b. either a researcher or self-administered survey
c. either or
d. preparing ethnographic field notes from a participant/observer
e. requesting that participants themselves take notes, compile and submit them to researchers.

While each methodology employed has rationale and advantages, issues of objectivity and subjectivity may be raised when data is analyzed.

During content analysis, staff researchers or ‘raters’ may use inconsistent strategies in analyzing text material. Some ‘raters’ may analyze comments as a whole while others may prefer to dissect text material by separating words, phrases, clauses, sentences or groups of sentences. Every effort should be made to reduce or eliminate inconsistencies between “raters” so that data integrity is not compromised.

A major challenge to data integrity could occur with the unmonitored supervision of inductive techniques. Content analysis requires raters to assign topics to text material (comments). The threat to integrity may arise when raters have received inconsistent training, or may have received previous training experience(s). Previous experience may affect how raters perceive the material or even perceive the nature of the analyses to be conducted. Thus one rater could assign topics or codes to material that is significantly different from another rater. Strategies to address this would include clearly stating a list of analyses procedures in the protocol manual, consistent training, and routine monitoring of raters.

Researchers performing analysis on either quantitative or qualitative analyses should be aware of challenges to reliability and validity. For example, in the area of content analysis, Gottschalk (1995) identifies three factors that can affect the reliability of analyzed data:

The potential for compromising data integrity arises when researchers cannot consistently demonstrate stability, reproducibility, or accuracy of data analysis

According Gottschalk, (1995), the validity of a content analysis study refers to the correspondence of the categories (the classification that raters’ assigned to text content) to the conclusions, and the generalizability of results to a theory (did the categories support the study’s conclusion, and was the finding adequately robust to support or be applied to a selected theoretical rationale?).

Upon coding text material for content analysis, raters must classify each code into an appropriate category of a cross-reference matrix. Relying on computer software to determine a frequency or word count can lead to inaccuracies. “One may obtain an accurate count of that word's occurrence and frequency, but not have an accurate accounting of the meaning inherent in each particular usage” (Gottschalk, 1995). Further analyses might be appropriate to discover the dimensionality of the data set or identity new meaningful underlying variables.

Whether statistical or non-statistical methods of analyses are used, researchers should be aware of the potential for compromising data integrity. While statistical analysis is typically performed on quantitative data, there are numerous analytic procedures specifically designed for qualitative material including content, thematic, and ethnographic analysis. Regardless of whether one studies quantitative or qualitative phenomena, researchers use a variety of tools to analyze data in order to test hypotheses, discern patterns of behavior, and ultimately answer research questions. Failure to understand or acknowledge data analysis issues presented can compromise data integrity.

References:

Gottschalk, L. A. (1995). Content analysis of verbal behavior: New findings and clinical applications. Hillside, NJ: Lawrence Erlbaum Associates, Inc

Jeans, M. E. (1992). Clinical significance of research: A growing concern. Canadian Journal of Nursing Research, 24, 1-4.

Lefort, S. (1993). The statistical versus clinical significance debate. Image, 25, 57-62.
Kendall, P. C., & Grove, W. (1988). Normative comparisons in therapy outcome. Behavioral Assessment, 10, 147-158.

Nowak, R. (1994). Problems in clinical trials go far beyond misconduct. Science. 264(5165): 1538-41.
Resnik, D. (2000). Statistics, ethics, and research: an agenda for educations and reform. Accountability in Research. 8: 163-88

Schroder, K.E., Carey, M.P., Venable, P.A. (2003). Methodological challenges in research on sexual risk behavior: I. Item content, scaling, and data analytic options. Ann Behav Med, 26(2): 76-103.

Shamoo, A.E., Resnik, B.R. (2003). Responsible Conduct of Research. Oxford University Press.

Shamoo, A.E. (1989). Principles of Research Data Audit. Gordon and Breach, New York.

Shepard, R.J. (2002). Ethics in exercise science research. Sports Med, 32 (3): 169-183.

Silverman, S., Manson, M. (2003). Research on teaching in physical education doctoral dissertations: a detailed investigation of focus, method, and analysis. Journal of Teaching in Physical Education, 22(3): 280-297.

Smeeton, N., Goda, D. (2003). Conducting and presenting social work research: some basic statistical considerations. Br J Soc Work, 33: 567-573.

Thompson, B., Noferi, G. 2002. Statistical, practical, clinical: How many types of significance should be considered in counseling research? Journal of Counseling & Development, 80(4):64-71.

Home » Research Data – Types Methods and Examples

Research Data – Types Methods and Examples

Table of Contents

Research Data

Research data refers to any information or evidence gathered through systematic investigation or experimentation to support or refute a hypothesis or answer a research question.

It includes both primary and secondary data, and can be in various formats such as numerical, textual, audiovisual, or visual. Research data plays a critical role in scientific inquiry and is often subject to rigorous analysis, interpretation, and dissemination to advance knowledge and inform decision-making.

Types of Research Data

There are generally four types of research data:

Quantitative Data

This type of data involves the collection and analysis of numerical data. It is often gathered through surveys, experiments, or other types of structured data collection methods. Quantitative data can be analyzed using statistical techniques to identify patterns or relationships in the data.

Qualitative Data

This type of data is non-numerical and often involves the collection and analysis of words, images, or sounds. It is often gathered through methods such as interviews, focus groups, or observation. Qualitative data can be analyzed using techniques such as content analysis, thematic analysis, or discourse analysis.

Primary Data

This type of data is collected by the researcher directly from the source. It can include data gathered through surveys, experiments, interviews, or observation. Primary data is often used to answer specific research questions or to test hypotheses.

Secondary Data

This type of data is collected by someone other than the researcher. It can include data from sources such as government reports, academic journals, or industry publications. Secondary data is often used to supplement or support primary data or to provide context for a research project.

Research Data Formates

There are several formats in which research data can be collected and stored. Some common formats include:

Text : This format includes any type of written data, such as interview transcripts, survey responses, or open-ended questionnaire answers.
Numeric : This format includes any data that can be expressed as numerical values, such as measurements or counts.
Audio : This format includes any recorded data in an audio form, such as interviews or focus group discussions.
Video : This format includes any recorded data in a video form, such as observations of behavior or experimental procedures.
Images : This format includes any visual data, such as photographs, drawings, or scans of documents.
Mixed media: This format includes any combination of the above formats, such as a survey response that includes both text and numeric data, or an observation study that includes both video and audio recordings.
Sensor Data: This format includes data collected from various sensors or devices, such as GPS, accelerometers, or heart rate monitors.
Social Media Data: This format includes data collected from social media platforms, such as tweets, posts, or comments.
Geographic Information System (GIS) Data: This format includes data with a spatial component, such as maps or satellite imagery.
Machine-Readable Data : This format includes data that can be read and processed by machines, such as data in XML or JSON format.
Metadata: This format includes data that describes other data, such as information about the source, format, or content of a dataset.

Data Collection Methods

Some common research data collection methods include:

Surveys : Surveys involve asking participants to answer a series of questions about a particular topic. Surveys can be conducted online, over the phone, or in person.
Interviews : Interviews involve asking participants a series of open-ended questions in order to gather detailed information about their experiences or perspectives. Interviews can be conducted in person, over the phone, or via video conferencing.
Focus groups: Focus groups involve bringing together a small group of participants to discuss a particular topic or issue in depth. The group is typically led by a moderator who asks questions and encourages discussion among the participants.
Observations : Observations involve watching and recording behaviors or events as they naturally occur. Observations can be conducted in person or through the use of video or audio recordings.
Experiments : Experiments involve manipulating one or more variables in order to measure the effect on an outcome of interest. Experiments can be conducted in a laboratory or in the field.
Case studies: Case studies involve conducting an in-depth analysis of a particular individual, group, or organization. Case studies typically involve gathering data from multiple sources, including interviews, observations, and document analysis.
Secondary data analysis: Secondary data analysis involves analyzing existing data that was collected for another purpose. Examples of secondary data sources include government records, academic research studies, and market research reports.

Analysis Methods

Some common research data analysis methods include:

Descriptive statistics: Descriptive statistics involve summarizing and describing the main features of a dataset, such as the mean, median, and standard deviation. Descriptive statistics are often used to provide an initial overview of the data.
Inferential statistics: Inferential statistics involve using statistical techniques to draw conclusions about a population based on a sample of data. Inferential statistics are often used to test hypotheses and determine the statistical significance of relationships between variables.
Content analysis : Content analysis involves analyzing the content of text, audio, or video data to identify patterns, themes, or other meaningful features. Content analysis is often used in qualitative research to analyze open-ended survey responses, interviews, or other types of text data.
Discourse analysis: Discourse analysis involves analyzing the language used in text, audio, or video data to understand how meaning is constructed and communicated. Discourse analysis is often used in qualitative research to analyze interviews, focus group discussions, or other types of text data.
Grounded theory : Grounded theory involves developing a theory or model based on an analysis of qualitative data. Grounded theory is often used in exploratory research to generate new insights and hypotheses.
Network analysis: Network analysis involves analyzing the relationships between entities, such as individuals or organizations, in a network. Network analysis is often used in social network analysis to understand the structure and dynamics of social networks.
Structural equation modeling: Structural equation modeling involves using statistical techniques to test complex models that include multiple variables and relationships. Structural equation modeling is often used in social science research to test theories about the relationships between variables.

Purpose of Research Data

Research data serves several important purposes, including:

Supporting scientific discoveries : Research data provides the basis for scientific discoveries and innovations. Researchers use data to test hypotheses, develop new theories, and advance scientific knowledge in their field.
Validating research findings: Research data provides the evidence necessary to validate research findings. By analyzing and interpreting data, researchers can determine the statistical significance of relationships between variables and draw conclusions about the research question.
Informing policy decisions: Research data can be used to inform policy decisions by providing evidence about the effectiveness of different policies or interventions. Policymakers can use data to make informed decisions about how to allocate resources and address social or economic challenges.
Promoting transparency and accountability: Research data promotes transparency and accountability by allowing other researchers to verify and replicate research findings. Data sharing also promotes transparency by allowing others to examine the methods used to collect and analyze data.
Supporting education and training: Research data can be used to support education and training by providing examples of research methods, data analysis techniques, and research findings. Students and researchers can use data to learn new research skills and to develop their own research projects.

Applications of Research Data

Research data has numerous applications across various fields, including social sciences, natural sciences, engineering, and health sciences. The applications of research data can be broadly classified into the following categories:

Academic research: Research data is widely used in academic research to test hypotheses, develop new theories, and advance scientific knowledge. Researchers use data to explore complex relationships between variables, identify patterns, and make predictions.
Business and industry: Research data is used in business and industry to make informed decisions about product development, marketing, and customer engagement. Data analysis techniques such as market research, customer analytics, and financial analysis are widely used to gain insights and inform strategic decision-making.
Healthcare: Research data is used in healthcare to improve patient outcomes, develop new treatments, and identify health risks. Researchers use data to analyze health trends, track disease outbreaks, and develop evidence-based treatment protocols.
Education : Research data is used in education to improve teaching and learning outcomes. Data analysis techniques such as assessments, surveys, and evaluations are used to measure student progress, evaluate program effectiveness, and inform policy decisions.
Government and public policy: Research data is used in government and public policy to inform decision-making and policy development. Data analysis techniques such as demographic analysis, cost-benefit analysis, and impact evaluation are widely used to evaluate policy effectiveness, identify social or economic challenges, and develop evidence-based policy solutions.
Environmental management: Research data is used in environmental management to monitor environmental conditions, track changes, and identify emerging threats. Data analysis techniques such as spatial analysis, remote sensing, and modeling are used to map environmental features, monitor ecosystem health, and inform policy decisions.

Advantages of Research Data

Research data has numerous advantages, including:

Empirical evidence: Research data provides empirical evidence that can be used to support or refute theories, test hypotheses, and inform decision-making. This evidence-based approach helps to ensure that decisions are based on objective, measurable data rather than subjective opinions or assumptions.
Accuracy and reliability : Research data is typically collected using rigorous scientific methods and protocols, which helps to ensure its accuracy and reliability. Data can be validated and verified using statistical methods, which further enhances its credibility.
Replicability: Research data can be replicated and validated by other researchers, which helps to promote transparency and accountability in research. By making data available for others to analyze and interpret, researchers can ensure that their findings are robust and reliable.
Insights and discoveries : Research data can provide insights into complex relationships between variables, identify patterns and trends, and reveal new discoveries. These insights can lead to the development of new theories, treatments, and interventions that can improve outcomes in various fields.
Informed decision-making: Research data can inform decision-making in a range of fields, including healthcare, business, education, and public policy. Data analysis techniques can be used to identify trends, evaluate the effectiveness of interventions, and inform policy decisions.
Efficiency and cost-effectiveness: Research data can help to improve efficiency and cost-effectiveness by identifying areas where resources can be directed most effectively. By using data to identify the most promising approaches or interventions, researchers can optimize the use of resources and improve outcomes.

Limitations of Research Data

Research data has several limitations that researchers should be aware of, including:

Bias and subjectivity: Research data can be influenced by biases and subjectivity, which can affect the accuracy and reliability of the data. Researchers must take steps to minimize bias and subjectivity in data collection and analysis.
Incomplete data : Research data can be incomplete or missing, which can affect the validity of the findings. Researchers must ensure that data is complete and representative to ensure that their findings are reliable.
Limited scope: Research data may be limited in scope, which can limit the generalizability of the findings. Researchers must carefully consider the scope of their research and ensure that their findings are applicable to the broader population.
Data quality: Research data can be affected by issues such as measurement error, data entry errors, and missing data, which can affect the quality of the data. Researchers must ensure that data is collected and analyzed using rigorous methods to minimize these issues.
Ethical concerns: Research data can raise ethical concerns, particularly when it involves human subjects. Researchers must ensure that their research complies with ethical standards and protects the rights and privacy of human subjects.
Data security: Research data must be protected to prevent unauthorized access or use. Researchers must ensure that data is stored and transmitted securely to protect the confidentiality and integrity of the data.

About the author

Muhammad Hassan

Researcher, Academic Writer, Web developer

Primary Data – Types, Methods and Examples

Quantitative Data – Types, Methods and Examples

Secondary Data – Types, Methods and Examples

Information in Research – Types and Examples

Qualitative Data – Types, Methods and Examples

Tools and Resources
Customer Services
Original Language Spotlight
Alternative and Non-formal Education
Cognition, Emotion, and Learning
Curriculum and Pedagogy
Education and Society
Education, Change, and Development
Education, Cultures, and Ethnicities
Education, Gender, and Sexualities
Education, Health, and Social Services
Educational Administration and Leadership
Educational History
Educational Politics and Policy
Educational Purposes and Ideals
Educational Systems
Educational Theories and Philosophies
Globalization, Economics, and Education
Languages and Literacies
Professional Learning and Development
Research and Assessment Methods
Technology and Education
Share Facebook LinkedIn Twitter

Article contents

Qualitative data analysis.

Paul Mihas Paul Mihas University of North Carolina at Chapel Hill
https://doi.org/10.1093/acrefore/9780190264093.013.1195
Published online: 23 May 2019

Qualitative analysis—the analysis of textual, visual, or audio data—covers a spectrum from confirmation to exploration. Qualitative studies can be directed by a conceptual framework, suggesting, in part, a deductive thrust, or driven more by the data itself, suggesting an inductive process. Generic or basic qualitative research refers to an approach in which researchers are simply interested in solving a problem, effecting a change, or identifying relevant themes rather than attempting to position their work in a particular epistemological or ontological paradigm.

Other qualitative traditions include grounded theory, narrative analysis, and phenomenology. Grounded theory encompasses several approaches, including objectivist and constructivist traditions, and commonly invites researchers to theorize a process and perhaps identify its contexts and consequences. Narrative analysis is an approach that treats stories not only as representations of events but as narrative events in themselves. Researchers using this approach analyze the form and content of narrative data and examine how these elements serve the storyteller and the story. Other elements often considered include plot, genre, character, values, resolutions, and motifs. Phenomenology is an approach designed to “open up” a phenomenon and make sense of its invariant structure, its identifiable essence across all narrative accounts. In this approach, the focus is on the lived experiences of those deeply familiar with the phenomenon and how they experience the phenomenon as they are going through it, before it is categorized and conceptualized. Each tradition has its own investigative emphasis and particular tools for analysis—specific approaches to coding, memo writing, and final products, such as diagrams, matrices, and condensed reports.

qualitative analysis
basic qualitative research
generic qualitative research
grounded theory
phenomenology
narrative analysis
memo writing
qualitative approaches
qualitative design research methods

You do not currently have access to this article

Please login to access the full content.

Access to the full content requires a subscription

Printed from Oxford Research Encyclopedias, Education. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 27 September 2024

Cookie Policy
Privacy Policy
Legal Notice
Accessibility
[185.66.15.189]
185.66.15.189

Character limit 500 /500

Accelerating Research with AI

September 27, 2024 2024-09-27

Email article
Share on LinkedIn
Share on Twitter

This article focuses on how publicly available AI tools can help UX researchers in their studies. You’ll want to use your AI tools more heavily for certain stages of research projects and avoid them for others.


Desk research	✅	Starting research and gathering sources	Inaccurate information and made-up sources
Ideation during planning	✅	Generating possible research goals, method options, interview or survey questions, and usability testing tasks	Violations of best practices
Documentation	✅	Tailoring research-related documentation for a study, including consent forms, observer instructions, notetaking instructions, and facilitation scripts	Mistakes in filling out templates provided to the system

Desk Research

We don’t always need to spend our time on primary research. Often, we can find useful guidance in existing findings from inside or outside our own field. AI can function as an excellent desk researcher , but with some serious limitations

AI can work as a starting point for your desk research, but not as the final source. It can assist in exploring a problem space or topic, but don’t use it as a fact-checker.

Always ask AI systems to cite primary sources, and then go check those sources.
Use tools specifically designed for information seeking (such as Perplexity or ScholarAI), but remember that no generative AI system will be free of misinformation, bias, or hallucinations .

Planning studies requires a fair amount of ideation work — thinking up screener questions, exploring ways to set up the study, and drafting tasks. AI excels at these.

For example, while planning a diary study on NN/g’s weekly newsletter , we asked ChatGPT to come up with different questions for study participants.

Here’s the prompt we used.

Prompt: Generate 15 different questions I might ask in the daily respondent survey. Then, review the list, and choose the best 5–10 questions you think would serve my study goals best. Order those questions in a way that will make logical sense to my respondents.

Here are the first 4 questions from the list of 8 that ChatGPT offered.

ChatGPT Response: 1. Which UX newsletters did you receive today? Starts by setting the context and identifying other newsletters for comparison. 2. Did you open the NN/g newsletter today? Why or why not? Moves into understanding the decision to engage or not with the NN/g newsletter. 3. What specific elements of the NN/g newsletter caught your attention? Dives into what worked or didn’t in capturing the respondent’s attention. 4. How much time did you spend reading the NN/g newsletter today? Assesses the level of engagement and time commitment.

A decent start! These questions aren’t perfect. We’ll want to remove some, rewrite others, and add new ones. But they’re enough to get us started.

Ask your AI tool to follow established best practices when generating options for tasks or questions . If you don’t like the results, you may need to explicitly list the characteristics you want the output to have.
Ideally, have a human research expert review your final list of ideas. If you’re an expert, that could be you. If you’re new to research, contact a more experienced researcher for guidance.

Documentation

Research planning requires some (often tedious) documentation preparation: study plans, screeners , consent forms , observer instructions, and so on. AI tools can immensely speed this process up.

However, provide your AI tool with a template to start with. Remember: AI systems can’t always tell good advice from bad. Giving them a solid starting point will help them avoid mistakes.

Let’s say we’re working with ChatGPT to produce documentation for our newsletter study. After communicating important study details (like method, goals, study design, and participant details), we might upload NN/g’s standard consent-form template and provide the following prompt.

Example Prompt: Based on the study details provided above, customize a consent form for my study. Follow the format of the attached consent-form template. Do you need any additional details before you begin?

When asking AI tools to complete research-related documentation, provide them with a template as a starting point.
Watch out for mistakes in how the system completes the documentation. For example, double-check that the correct data-collection permissions are outlined in your consent form.

Current AI tools have limited usefulness during moderated studies: they are not currently capable of observing , facilitating , or analyzing usability testing.


Notetaking during interviews	✅	Meeting notetakers can document conversations in real time.	Misunderstandings or or misattributing comments to the wrong speaker
Conducting interviews	❓	Emerging tools (e.g., Versive) can ask custom followup questions in real time.	Shallow insights
Notetaking, facilitating, or analyzing usability testing	❌	AI tools can’t currently “watch” usability tests, whether live or recorded.	Misleading claims about the AI’s abilities to run user tests or analyze behavioral data

Conducting Qualitative Behavioral Studies (Like User Testing)

Behavioral data is about what users do. While current AI tools are superb at processing text, they cannot understand and interpret users’ actions or nonverbal interactions with an interface.

So, AI tools can’t observe usability testing or field studies.

While AI systems are capable of analyzing video (text and object recognition, facial-expression interpretation, etc.), we have yet to see an AI tool that can “watch” usability tests. Beyond technical limitations, observing or facilitating a usability test requires excellent contextual awareness — something that current tools lack.

As a result, don’t expect AI tools to be capable of facilitating or even proper notetaking during usability testing. Some products market themselves as having this capacity. So far, all the tools that we’ve tested that make that claim simply analyze a usability-testing transcript — not what the user actually did in the session. That is not good enough. People often say one thing but do another.

(However, AI can help with processing usability testing data after it’s been processed by a human. For example, a researcher may use an unmoderated testing tool to collect quantitative data from a usability test. That human researcher will need to check the data and video recordings to catch any problem participants. But once the data set is ready, AI can help the researcher perform statistical analysis on that data.)

Conducting Attitudinal Studies (like Interviews)

While generative AI tools can’t handle behavioral data yet, they do much better with self-reported or attitudinal data gathered through methods like interviews, diary studies, and surveys. This is because that data is language-based.

AI systems can act as a backup notetaker during interviews . General-purpose meeting notetakers (like Otter.ai ) can transcribe conversations in real time and summarize primary points of discussion. However, like every other AI tool, they aren’t perfect. These products often misunderstand the context or what’s most important, and they can get confused about who’s speaking.

Generative AI tools do have great potential for conducting interviews at scale — especially structured or semistructured interviews , which follow a script, with some tailored followup questions. However, it’s unclear if these AI tools will be able to deliver the same level of critical rapport building as a human interviewer.

At the time of writing, there are only a few tools offering real-time AI-generated interviewing. ( Versive and Outset are two, though we might categorize their method as more of a survey with custom followup questions than a real interview.)

Consider using an AI assistant for notetaking during interviews, especially if you are a UX team of one.
However, live notetaking isn’t necessary if you’re using an analysis tool that provides transcription (covered in the next section).

This section focuses on analyzing text-based data gathered from methods like interviews, surveys, and diary studies, as well as numerical data.


Transcribing and summarizing interviews	✅	AI features transcribe conversation recordings with linked timestamps, and summarize key points.	Higher error rates for some languages or accents Misunderstandings or omissions
Cleaning and sanitizing data	✅	Tools will prepare and scrub any personally identifying information from raw data.	Mistakes
Preliminary coding and clustering of qualitative data	✅	AI features can take a first pass through your data looking for commonalities or rough themes.
Assisting in quantitative analysis	✅	AI can advise on the correct statistical procedures and conduct some steps in the analysis.

Timestamps and Summaries

Many researchers have been using AI-based video transcription for years. This feature seems to be improving, particularly in its ability to handle more languages and accents.

Particularly useful are the timestamps linking the transcript text to relevant moments in the video recording.

Another AI-based feature that is becoming mainstream in research tools is transcript summarization.

Double-check summaries: AI-powered transcription and summarization features can misunderstand context.
Watch out for missing summaries for sections of the session or interview.

Cleaning and Sanitizing Data

Some research-analysis tools will scrub any personally identifying information (names, email addresses, credit card numbers, etc.) from raw data, thus helping us protect participant data while reducing work during a tedious and time-consuming step of analysis — particularly for quantitative studies with large amounts of data.

These tools do make mistakes, however. For example, in one study about research tools, an AI feature removed all tool names from an interview. The system mistakenly thought it was protecting the participant’s privacy by removing the name of the company they worked for, but in reality, it removed information about which tool the participant was reviewing.

Preliminary Coding and Clustering of Qualitative Data

If you have transcripts, AI can identify codes (or tags) for your data, by paying attention to common language used in your data set. Not all AI-generated codes will be useful to you, but they may provide a starting point.

If your team takes notes in a white-boarding tool like Miro, you can use AI features to cluster stickies. The accuracy of the clustering depends on the clarity of the text in the stickies.

AI-generated clusters are rarely perfect — for example, many stickies will be grouped in an “Other” category. However, having the AI take a first pass through your items can accelerate the analysis process.

To get better results at this stage, provide context. Tools like Dovetail allow you to provide your research questions, which are critical for the system to return codes of any value at all.
Make sure your notes, stickies, or highlights are complete enough that the feature can make sense of them

Assisting in Quantitative Analysis

AI tools can speed up quantitative analysis by either advising on the correct statistical procedures or by conducting steps in the analysis, including:

Handling missing or incomplete data records
Transforming or sanitizing raw data
Descriptive or inferential statistics
Rough sentiment analysis

They can even generate some decent data-visualization charts from your data.

Be sure to thoroughly spot-check any analysis or data processing you ask the AI to perform.
Ask the AI tool to follow data-visualization best practices when creating charts.

Limitations of AI Analysis

Never rely on AI tools to perform all your analysis for you. AI is stochastic — it can choose to pay attention to certain things but disregard others. That might mean it’ll focus on the wrong aspects of your data. It might miss, misinterpret, or even manufacture insights.

For thematic analysis , a human perspective is needed to connect the dots in ways AI (currently) can’t. When analyzing qualitative data, a good human researcher will consider contextual questions like:

How does this participant’s statement contrast with what else they said?
What’s this person’s underlying mental model?
How was this data collected? Did the interviewer accidentally prime the participant?
Might the participant have felt embarrassed to tell the truth?
Was this participant not a good fit for the study’s recruitment criteria?

That level of complex, context-informed consideration is beyond the capacity of current AI tools.

Use AI transcription, summarization, and coding features to speed up the initial steps in your analysis process.
Remember that AI systems can handle data from interviews, surveys, and diary studies, but they can’t observe usability testing or watch video clips like a human can.
Treat AI’s coding as an initial pass. A human still needs to make sense of the data and translate it into insights.
Don’t attempt to use AI analysis tools for usability testing.


Drafting deliverables	✅	AI can generate elements or first drafts of deliverables like personas or journey maps.	Made-up data or details that are not rooted in research
Copyediting and revision	✅	AI can polish or shorten text.	Outputs that don’t fit your audience’s communication style
Summarizing research findings	✅	In some research repositories, AI chats can summarize findings and provide quick answers.	Hallucinations

AI chatbots can be helpful writing assistants for any kind of communication, including research reports, summaries, or artifacts. Clear communication is essential for building buy-in with stakeholders. Consider asking AI systems (such as Claude, which seems particularly good at this) to help with:

Grammar and copyediting
Tailoring communication to your specific audience
Adjusting tone of voice
Avoiding UX jargon (especially for a nonUX audience)

You can also use AI tools to get started on deliverables like user personas or journey maps — as long as they’re based on real research.

Finally, we’re excited about the potential for AI tools to improve the dissemination of research findings within organizations. We’re already seeing improvements in findability in many research-repository tools, as well as in general-purpose knowledge tools like Notion.

In a Notion AIchat window, the user types,

Rather than having to search a research repository only by keywords and then sift through a collection of tags, clips, or highlighted notes, your stakeholders can ask questions. AI tools can search your data, synthesize it, and curate a response that answers a particular question. (This is a case where an AI chatbot is actually a good solution. )

These features can support better adoption, but teams may still need to reach out to researchers for a thorough explanation of what the research means, what its limitations are, and whether further research is needed.

Unless you’re comfortable with producing intern-level work, don’t expect to be able to have AI tools do your work for you. Human oversight, guidance, and review are still critical.

Current AI tools have many limitations. Like interns, they work best when you provide ample instructions, context, constraints, and corrections.

Despite the annoying need to double-check the output of AI systems, these tools still have the potential to accelerate your UX-research workflows. They are becoming increasingly important in a world where researchers often struggle to keep up with fast-paced work environments.

Related Courses

Practical ai for ux professionals.

Leverage artificial intelligence tools to enhance your UX work and save valuable time

Interaction

Analyzing Qualitative UX Data and Reporting Insights

Apply systematic analyses to uncover themes and user insights

ResearchOps: Scaling User Research

Orchestrate and optimize research to amplify its impact

Learn More:

You're Not Too Late to Use AI

Caleb Sponheim · 3 min

AI Isn't Ready for UX Design

AI on Intranets: 5 Valuable Features

Anna Kaley · 3 min

Leverage AI for Mock Tables and Charts When Testing Prototypes

Evan Sunwall · 13 min

When Should We Trust AI? Magic-8-Ball Thinking

Caleb Sponheim · 8 min

Prompt Controls in GenAI Chatbots: 4 Main Uses and Best Practices

Feifei Liu · 11 min

Planning Research with Generative AI

Maria Rosala · 7 min

Artificial Intelligence: Glossary

Caleb Sponheim · 5 min

Synthetic Users: If, When, and How to Use AI-Generated “Research”

Maria Rosala and Kate Moran · 14 min

Search Menu
Sign in through your institution
Advance articles
Editor's Choice
Author Guidelines
Submission Site
Open Access
About The British Journal of Social Work
About the British Association of Social Workers
Editorial Board
Advertising and Corporate Services
Journals Career Network
Self-Archiving Policy
Dispatch Dates
Journals on Oxford Academic
Books on Oxford Academic

Article Contents

Introduction, hbv in a swedish context, the aim and rationale of the study, the rigour of the study, acknowledgements, conflict of interest statement.

< Previous

Voices of Women Exposed to Honour-Based Violence: On Vulnerability, Needs and Support from Social Services

Article contents
Figures & tables
Supplementary Data

Helén Olsson, Voices of Women Exposed to Honour-Based Violence: On Vulnerability, Needs and Support from Social Services, The British Journal of Social Work , Volume 54, Issue 6, September 2024, Pages 2623–2641, https://doi.org/10.1093/bjsw/bcae044

Permissions Icon Permissions

Swedish social services play a pivotal role in preventing men’s violence against women, including honour-based violence. This type of family-based violence is based on norms that disregard human rights. Individuals growing up in an honour context have limited possibilities to choose their own paths in life. The study comprises young women aged eighteen to twenty-five who look back on their vulnerable positions and the circumstances resulting in their seeking help from social services. They also share their experiences of the support that they had received from social services. Semi-structured interviews with ten women were conducted. The data were analysed through a qualitative content analysis. The findings show that the interviewees are not a homogenous group; circumstances, needs and exposure varied considerably between participants. Professional attention is necessary when threats ensue after divorce, for example, in the case of a bride price refund. Child perspectives must be better considered in the event of a divorce. Women or families that reject standards of honour are subject to harassment in public by people aiming to coerce them into adopting honour norms. Leaving the family was facilitated if one’s fears were taken seriously and experiences of being met with understanding and being well cared for.

Swedish social services are increasingly paying attention to children and young adults under the threat of honour-based violence (HBV). In an honour context, the status and honour of a family and social group depend on the chaste and respectable behaviour of girls and women ( Bhanbhro et al. , 2013 ; Idriss, 2017 ). Boys and men are often expected or forced to maintain the hierarchical male order and ensure it is upheld ( Gill, 2008 ). Research has increasingly drawn attention to men’s victimisations in environments where HBV is present. Men can be victims of the same violence as women because of their ‘bodies’, discovery of their sexual orientation and being forced into marriage. Further, state reforms are more focused on protecting victimised women. Victimised men are not noticed to the same extent, and their opportunities for support services are more limited ( Idriss, 2021 ).

Standards of honour involve controlling the sexuality of girls and women, as well as protecting girls’ virginity until marriage ( Lidman and Hong, 2018 ). If violations occur, the collective face must be saved and restored by punishing the norm-breaker, often through psychological or physical violence ( Sedem and Ferrer-Wreder, 2014 ). Women may be controlled through everyday restrictions, for example, by being prohibited from wearing Western clothing, having mobile phones, using social networks or not being allowed to choose with whom they will spend time ( Dickson, 2014 ; Björktomta, 2019 ; Strid et al. , 2021 ). Breaches of honour norms may result in deadly violence; the UNFPA (2000) estimates that around 5,000 girls and women around the world were exposed to honour killings by members of their families. According to global estimates, 650 million girls and women are forced into marriage ( UNICEF, 2021 ). The proportion of girls and women who have been exposed to genital mutilation is estimated to amount to 200 million ( UNICEF, 2020 ).

Honour culture also finds expression in female genital mutilation as well as child and forced marriage (cf Bhanbhro et al. , 2016 ; Gangoli et al. , 2018 ). Yet there are wide varieties of honour norms; female genital mutilation and child marriage do not necessarily form part of honour cultures. Research has also shown that honour norms are not always the motive for child or forced marriage. Poverty is a factor in many low- and middle-income countries because parents are relieved of the financial burden of their daughters. In addition, marriage may hold economic advantages in the form of the bride price the family receives when their daughter is married off ( Lee-Rife et al. , 2012 ).

Several studies have shown that HBV cannot be related to a specific religion, ethnicity or country ( Gill, 2008 ; Dickson, 2014 ; Bhanbhro et al. , 2016 ; Lidman and Hong, 2018 ). Patriarchal honour codes occur in Arabia, the Horn of Africa and Sought Asia. Cohen and Nisbett (1994) found that the masculinity norms of the southern US states show significant overlap with honour cultures. Patriarchal honour codes based on women’s subjugation and behaviour about men’s honour and status can also be found in Latin American cultures ( Dietrich and Schuett, 2013 ). The Swedish Gender Equality Agency advocates a broader perspective on HBV about the EU Anti-Trafficking Directive (2011 /36/EU). Financial transactions or so-called bride purchases in connection with child and forced marriages largely overlap with what is regarded as human trafficking in the EU directive. Swedish authorities have identified individuals exposed to HBV as particularly vulnerable. These may include persons, often from Eastern Europe, who are exploited for financial gain in connection with child and forced marriage; sometimes, they are forced to beg, and occasionally, some of the exposed have functional impairment ( Olsson et al. , 2022 ).

There are similarities as well as differences between intimate partner violence and HBV: both are based on hierarchical, patriarchal, gender-based power structures, but there are some differences in expression ( Gill, 2008 ; Idriss, 2017 ). HBV has a collective character where both men and women from a family or clan come together to punish a person who is considered to violate the group’s rigid gender roles. That should be underlined in a comparison of individualist and collectivist approaches in family patterns and child-rearing. Kağıtçıbaşı (2017) has pointed out that both types of systems can provide children with warm, nurturing environments or could conversely be characterised by control, oppression and violence.

In Sweden, 40,000 girls and women, of which 7,400 are minors, are estimated to have been genitally mutilated before arriving in Sweden ( National Board of Health and Welfare, 2016 ). Swedish statistics indicate that 15 per cent of the approximately twenty murders annually classified as domestic violence are related to HBV ( National Board of Health and Welfare, 2018 ). A Swedish investigation including three surveys of 6,002 fifteen-year-olds in the country’s three largest cities has shown that every sixth teenager lives in an honour context ( Strid et al. , 2021 ).

It is difficult to estimate how many children and adults are exposed to violence in their families, but the area is probably largely unrecorded. Linell (2017) found that children avoid disclosing abuse because they do not know their rights, do not understand the abuse as wrong, feel too young or are afraid of the consequences of a report. Dealing with violence in close relationships is not an unusual task at Swedish social services; violence often occurs in ethnically Swedish families. It should be noted that cases where HBV is present are not as common in Swedish social services. BRÅ (2023a) states that only 10 per cent of reported crimes involving violence against children are related to HBV. Swedish statistics on crimes of honour reported to the police (year 2020 to November 2023) show that there were 163 reports regarding coercion of marriage and misleading marriage, forty-five reports regarding child marriage and 166 reports regarding female genital mutilation. The majority of those exposed were girls aged 13–17, and the suspected parents were relatively new arrivals to Sweden. A large proportion of the suspected parents have mental or social problems BRÅ (2023b) , the statistics of reported crimes).

Based on international human rights agreements specifically focused on violence against women and children, Swedish legislation for social work professionals has been clarified. Clarifications have, for instance, resulted from Sweden’s incorporation of the Convention on the Rights of the Child into law in 2020, the new criminal classification of honour-based oppression ( Swedish Government Bill, 2021 /22), as well as revised regulations on how public authorities should address such violence ( HSLF-FS 2022 :39).

Despite stricter legislation, research indicates that HBV is still complex and challenging to handle and remains associated with multiple dilemmas ( Olsson and Bergman, 2021 ). Some studies have shown that public authorities, by differentiating between intimate partner violence and HBV, operate according to stereotypical ideas that stigmatise and categorise people with migrant backgrounds ( Baianstovu and Strid, 2024 ). Dickson (2014) has argued that this has created uncertainty, leading to a lack of active initiatives against HBV from authorities.

The study aims 2-fold: examining the circumstances that lead individuals exposed to HBV to seek help from social services, as well as their experiences of the support and relief offered by social services.

As HBV is a complex issue that affects individuals’ living conditions in many aspects, scholarship on the subject is multifaceted. Indeed, this is a strength, but it also makes obtaining an overview of the research field and its findings challenging. Much of this research has focused on societal power differentiations and the underlying causes of HBV. As such, this study does not examine the underlying causes of HBV. This study aims to shed light on an understudied area where the focus is on how individuals exposed to violence perceive the support received from social work agencies in Sweden. To our knowledge, there has been no examination of specific experiences related to support from social services thus far. The searchlight focuses on field experience rather than formal hypothesis or theory testing.

A qualitative exploratory design was chosen based on the epistemological assumption that this type of research involves narratives through which people contribute to a broader understanding of a phenomenon based on their contexts, experiences and words ( Patton, 2015 ). The study employed purposeful sampling since the stories of exceptionally knowledgeable key informants opened a window into their lives and conditions ( Patton, 2015 , p. 13).

Sample and procedures

At first, the aim was to recruit as many men as women, but social services indicated that it is highly unusual for boys or men subject to HBV to seek help or be identified via reports of concern. There are several explanations for the absence of men in this study. Studies show that boys do not seem to seek help to the same extent as girls because they do not believe that professionals understand the problem ( Strid et al. , 2021 ). Further, as long as the vulnerability of boys and men is not recognised, traditional masculine norms are allowed in which a real man is not considered a victim ( Idriss, 2021 ).

Inclusion criteria were young people aged eighteen to twenty-five who had had first contact with social services at least six months before the interview. The client should have/have needed support because of being/having been subject to HBV. Clients either could have contacted social services on their initiative or have been identified to social services via so-called reports of concern, for example, reported suspicions that a child is being neglected or abused. Exclusion criteria were clients deemed at risk of deteriorating mental health if they were to participate.

Managers from fifteen social services in mid-Sweden assisted with the recruitment of clients. The recruitment process, therefore, consciously risked skewed selection: social services may choose to contact satisfied users. This was discussed with the involved social services during the study. It should be added that the managers who mediated the contact with the clients also conveyed that the study was an important contribution to knowledge. For example, to better understand the situation of the vulnerable and get guidance in the work with operational development.

Data collection was conducted from September 2018 to June 2020 and included individual face-to-face interviews with ten women from four municipalities. Despite the limitations of only ten participants, this does not undervalue the women’s experiences. It is difficult to recruit participants for this type of research because the group is vulnerable and lives at risk of being exposed to threats and violence. Thus, careful security assessments are required to carry out risk-free interviews.

The included women had been in contact with social services for an average of two to six years. All the respondents were born in Africa or the Middle East. See Table 1 for an overview of why participants had contact with social services.

An overview of the interviewees' needs and granted stake.

Number of participants .	Reasons for needing support .	Granted stake .
3	Violence and oppression in the parental home	Sheltered housing/foster family
5	Violence and oppression in the form of forced marriage, of which three child marriages	Sheltered housing/foster family/bodyguard
2	Violence and oppression in connection with the family’s refusal to adhere to honour norms	Counselling for one participant; foster family/sheltered housing for the other.

Number of participants .	Reasons for needing support .	Granted stake .
3	Violence and oppression in the parental home	Sheltered housing/foster family
5	Violence and oppression in the form of forced marriage, of which three child marriages	Sheltered housing/foster family/bodyguard
2	Violence and oppression in connection with the family’s refusal to adhere to honour norms	Counselling for one participant; foster family/sheltered housing for the other.

A panel comprising four young women with experience living with HBV contributed to constructing the interview guide but did not participate in the interviews. The interviews were recorded and were 40–60 min long. The author carried out, transcribed and analysed the interviews.

The interviews addressed the following areas:

Interviewees’ situation before contacting social services.

Interviewees’ experiences of the treatment and support received from social services.

Whether interviewees still feel subject to HBV that limits their possibilities to determine their lives and future choices freely.

The study was approved by the Ethics Review Board in Uppsala, Sweden (Dnr 2018/255), and strictly adhered to guidelines on processing data (General Data Protection Regulation). Interviewees were informed of consent and were guaranteed anonymity at all stages of the study. Because the study concerns vulnerable people, great care was taken to adhere to the ethical principles guiding research in the humanities and social sciences Swedish Research Council (2017) .

Data analysis

Interview data were analysed using Qualitative Content Analysis ( Graneheim and Lundman, 2004 ; Graneheim et al. , 2017 ). The analysis involved several systematic, processual steps, facilitating the identification of similarities and differences in the material. One advantage of this logical, transparent structure is that the result is easily accessible to users and professionals. The printed transcripts were read several times to obtain an overall impression of the essence of the material. Then, meaning units were extracted based on the aim of this study. A meaning unit comprises sentences or central paragraphs related to content and context, so-called ‘red threads’ ( Graneheim et al. , 2017 ). After identifying a number of common red threads, the text was condensed, and a code was assigned to each unit. These codes were carefully analysed and resulted in several subthemes. The final phase of the analysis resulted in three main themes based on the data as a whole.

Although the interview guide did not include questions about the respondents’ families, the women described their families as desperate and suffering. The interview guide also did not have specific questions on exposure to violence for two reasons: to avoid unprocessed trauma and because highlighting these violent incidents was not part of the aim of the study. The women’s stories were extensive and detailed, which could indicate an experience of a trusting and relaxed atmosphere. During the interviews, it was clear that there is a great need to relate previous experiences of violence and be allowed to tell one’s story. The results comprise four themes derived from the analysis: Denied autonomy, lack of living space and vulnerability; Feelings of non-recognition; Being cared for and offered a supportive network; and Being limited by honour norms in public areas.

Theme 1: Denied autonomy, lack of living space and vulnerability

For us, it was school and home, school and home. I mean, we’re not allowed to be anywhere, they know the bus schedules, every step we take. They can even use a GPS to see us (R4).

It started when I wanted to ride a bicycle. How can I explain? I have to fight to be allowed to have a job and to bike (R6).

Later, it became a huge problem for me. I said if they weren’t planning on killing me before, they were planning it now (R9).

Another woman described how her child had been affected when she sought help for divorce. She did not dare assert visitation rights due to escalating threats, and the child was taken to a war-ridden country. She did not think she could pursue the matter, and the family did not allow her to see her child until many years later.

If you go to social services like I did, they will tell you that you are not a good person. Because I turned against my family and went to social services (R8).

Several women described feelings of anxiety, vulnerability, exposure and fear after deciding to break up with their families and flee. The women said that they only had broken with their families because it was a matter of life or death in an escalating emergency—he hit me every day (R5)—or because they could no longer stand the restricted living space allowed them.

Theme 2: Feelings of non-recognition

I couldn’t speak Swedish, so my friend tried to contact social services. My appointment was not that soon; it was far away (R5).

You have to try to teach people not just the language, but also about freedom…//.. teach people that there are social services (R6).

This was my worst nightmare. Maybe the social workers think that some teens want to rebel, that they’re just complaining; they don’t realise that there are people who need protection and support (R4).

A woman and child come and need help. They can’t just place them and say goodbye (R9).

Social services may think that the most important thing is that your basic physical needs are covered, like in Maslow’s hierarchy of needs, but I’d rather be in a refugee camp with people I love and with people who love me than have everything and yet nothing (R4).

Theme 3: Being cared for and offered a supportive network

One woman said that the decision to contact social services also meant breaking off family ties, and she had then been excluded from the community. This grieving process had to be planned: ‘Because you know that you will lose a lot’ (R4). Cherished family members may, for example, never be seen again. Some women talk about feelings of guilt because they have brought shame upon the family. They described how family members expressed both sadness and anger. For example, sad siblings or a crying father who, at the same time, failed to stop the oppression. Social workers had sometimes created a support network for the women, which was described as an action speaking of care. They had, for instance, assisted in establishing contacts with other women subject to HBV who were prepared to provide support: ‘She put me into contact with them. [This is] good when I’m unhappy, or when my family pressures me too much by phone. I ask my friend for advice then’ (R2).

They [social services] arranged a room where nobody could see me and parked so that I could just hop out of the car and into the doctor’s examination room (R2).

R5: Yes, they understand. It was natural … they started asking about my husband. I told them about the knives and assaults and being locked inside and prevented from seeing a doctor, so they understood everything (R5).

I wanted to stuff myself into a jar with a lid and just sit there in the social worker’s office and feel safe (R10).

I’ve received all the help I needed. We [she and social services] are still sometimes in touch. I love them; they’re like angels, and they’re always there when I’m worried (R10).

With this family, it feels like winning the lottery (R4).

Theme 4: Being limited by honour norms in public spaces

My younger sisters and I refuse to wear the veil. We are exposed and constantly mocked by our "own" in school. Our brothers are mocked for not keeping us sisters in order (R3).

Social services can protect us from our men but not from our traditions (R5).

When I married, he [her husband] paid my dad a lot of money. He wants the money back, but my dad no longer has it.

She further said that people from her country of origin quickly found out that social services had helped her to divorce. She was being pressured, and one of the men told her: ‘You are [religious affiliation], and he is your husband; you have to return’ (R1).

This study aimed to examine the circumstances that lead women exposed to HBV to seek help from social services, as well as their experiences of the support and relief offered by social services. The findings from the interviews with women exposed to HBV highlight the many ways in which various forms of abuse have marked their lives. Ultimately, it was about not being recognised as an individual with self-determination, but that also included being exposed to physical abuse, the risk of being isolated and living with the consequences of being constantly monitored. The motive for seeking help to leave the family was ultimately about trying to create a more unburdened life and escape restrictions and violence. In some cases, the escape from the family was due to a fear of being exposed to lethal violence. It appears that girls from a young age are expected to take responsibility for the family household and the family members that are part of it. Taking one's own initiative outside the family, for example, riding a bicycle, going to a public swimming hall or getting a job, was not allowed and could lead to correction and punishment.

In recent years, social services’ recognised jurisdiction to intervene and protect exposed individuals generates greater possibilities to act against family-based violence ( Olsson and Bergman, 2021 ). At the same time, this possibility is not unproblematic in an honour context. The results show that contact with authorities, particularly in family matters, is regarded as a significant contravention of standards of honour, which in several cases had led to severe threats being levelled at the women.

Regarding the women’s experience of support and help from social services, it should be emphasised that most women were satisfied with the support they had received from social services over time. Nevertheless, they also described experiences of non-recognition. The concepts of recognition, or being denied recognition, are discussed by the social philosopher Axel Honneth. Honneth (2007) means that the self-respect of individuals or groups depends on others’ recognition of their rights and capabilities. The struggle for recognition concerns whether individuals or groups receive recognition from society, for instance, in terms of love, social- and legal rights and solidarity.

When the women described experiences of not being understood or not recognised, it was about being rejected or treated passively. For example, having to wait a long time to get in touch with a social worker, or it seemed easier to get help from social services as a minor than as an adult, or when a young exposed woman is forced to sit in a meeting with her parents and explain herself. This parenting focus is not unusual because social services have been accused of incompetence when maintaining a traditional family approach. Such an approach gives parents the prerogative of formulating problems, which could deny young victims of violence legal protection and lead to an increase in serious violence ( Schlytter and Linell, 2010 ; Wikström and Ghazinour, 2010 ; Heimer et al. , 2018 ). Furthermore, the children’s needs seemed to be overlooked when mothers with children fled from families where standards of honour were followed.

Likewise, the results show that the initiation of a forced marriage also involved the prospective husband paying a so-called bride price, in this case, to the bride’s father. The authorities who encounter people who are under duress on the subject of financial transactions need to make two kinds of considerations. First, in the event of a divorce, social services and the police need to carry out risk assessments that pay particular attention to the risk of future violence, as the husband or his family will probably reclaim the bride price. Secondly, to prevent the victims from not being recognised, these authorities need to make careful considerations about whether the case can also be related to human trafficking ( Olsson et al. , 2022 ).

Finally, another form of non-recognition was when the women were offered family home placements that did not give them access to supportive and empathetic relationships. Being in an exposed position and being granted half-hearted help can produce overwhelming feelings of vulnerability. Therefore, interventions that do not cover the need for support can be risky because previous research has shown that vulnerable girls and women can lose motivation and determination and return home ( Wikström and Ghazinour, 2010 ).

Being taken care of seemed to be about the extent to which the social worker had enough knowledge to understand what needed to be dealt with about protection and support. The result showed that it was valuable support to have access to other young women in the same situation. Similarly, research indicates that support from women with similar experiences strengthened their ability to handle life difficulties ( Shanthakumari et al. , 2014 ).

Experiences that social workers have enough knowledge to make well-thought-out security assessments contributed to creating security. Successful placements in supportive housing not only contributed to gaining access to a platform to build a new and better life. Previous interviews with women exposed to honour violence showed that supportive housing also reduced the experience of having a preoccupied social worker ( Olsson et al. , 2022 ). One lesson is that social services (in case of staff turnover or high workload) should be careful to routinely establish well-thought-out risk management and treatment plans. These plans can prevent organisational weaknesses that could adversely affect the client Olsson et al . (2023) .

This study shows that most women have experiences of being exposed to and limited by standards of honour even outside the family sphere. Feeling rejected by the social services was, in some cases, referred to as an experience that the organisation did not pay attention to the fact that women could feel limited when honour norms occurred in the public local environment, for example, in public schools or in SFI. Other women state that despite having received help from social services, they are at the same time forced to continue a limited life. For example, when the woman was advised not to be in public environments or to be ordered to return to the husband after divorce. Furthermore, it was about having expectations of oneself to dress according to tradition. Other experiences were about being mocked as a brother for not controlling his female family members or being bullied as a mother for allowing her children to live in a Western way.

Several factors contribute to girls and women from honour-contextual environments exhibiting particular vulnerability. Cater and Sjögren (2016) found that exposure to obedience-demanding violence in childhood results in children constantly having to adapt to and heed reprimands. This type of subjection to violence can negatively affect the possibility of developing one’s inner moral sense and, thereby, a coherent, separate personality with the right to one’s thoughts and feelings.

Moreover, most of the interviewed women are still in the process of integrating and are struggling to be a well-functioning society member. Arriving in a new country involves introducing a new culture and society, and past experiences must be integrated with the present and future ( SBU, 2018 ). While this process is ongoing, these women are also involved in a struggle for autonomy and escaping HBV, all while losing access to their primary relations. In connection with migration, psychological transition processes are characterised by increased vulnerability, leading to greater exposure ( Meleis et al., 2000 ; Kralik et al. , 2005 ). These significant vulnerability factors must be managed and heeded by social services since such particular vulnerability affects the women’s psychological preparedness to establish free, independent lives successfully.

At the same time, previous literature has discussed the risk of reinforcing stereotypes of women subjected to violence and abuse as vulnerable, helpless victims. In accordance with Randall (2004) , it should be remembered that these women are actively engaged in finding strategies for survival and creating resilience against HBV ( Randall, 2004 ). Fineman (2019) highlights the risks of creating a unique legal identity that classifies vulnerable groups as needing protection and lacking capacity. From a life course perspective, all humans are vulnerable, but the privileged can manage that vulnerability more easily. She advocates a responsive and just society, which makes visible and supports the vulnerable individuals’ collective success in creating resilience ( Fineman, 2019 , p. 368).

Although this article comprised a limited number of participants, the knowledge contribution should not be diminished. The study’s practical implications can be summarised as follows: first, there are significant differences in a woman’s need for support, depending on whether there are children involved or whether she arrived in Sweden as a child with her family or later in life as an asylum seeker. Likewise, age appeared to influence the possibility of support (minor or adult). Secondly, social services need to pay more attention to the threats that ensue after divorce, for example, when repayment of a bride price exists. Thirdly, children’s perspectives must be better considered, for example, when social services neglected or failed to protect a small child when the mother sought help to divorce, or when long-term isolation in various sheltered housing led to experiences of powerlessness and the failure to adopt a child’s perspective. Likewise, in cases where professionals try to arrange meetings with an exposed girl and her parents. Fourthly, this study shows that women or families that reject standards of honour are subject to systematic harassment in public by people aiming to coerce them into adapting to and following standards of honour. When involved authorities do not recognise their exposure to violence, the women describe their social and judicial position as one of powerlessness. Fifthly, the women described the importance of being met with understanding and being cared for and that fears were taken seriously. It created feelings of security in the break-up process and provided opportunities for a free and independent life. Sixth, the interviewed women showed a great need to speak about the violence and their exposure. Professionals should be alert to the healing power of narration that may contribute to processing and recovery. Seventh, HBV affects whole families and has a divisive effect on relations between adults and children, leading to tragic separations. Many families where standards of honour exist live with great psychological, social and financial stress. Society needs to pay better attention to the health conditions and needs of migrants, for example, through support networks and easily accessible parent training. Staff with conflict mediation training can be valuable to offer parents when a child has chosen to leave the family.

It is not impossible to generalise the results based on such a small sample. Instead, readers should decide whether results may be transferred to other similar contexts. Since the study population is small, and participants additionally live under real threat, there is the potential risk of revealing their identities through detailed descriptions. This poses a methodological problem since qualitative studies often strive to contribute ‘thick’ contextual descriptions. Dependability was enhanced as the same interviewer conducted and transcribed all interviews, and an interview guide with open-ended questions was used. Quotations from the interviews verify the confirmability of the four main themes. The fact that only one researcher was involved in the data analysis is a limitation. However, the work is characterised by considering the risk of taking things for granted and avoiding preconceived notions. To confirm credibility, a panel comprising four young women with experience living with HBV has contributed good advice throughout the research process. Another limitation is the lack of gender balance. The study would have contributed more knowledge if some young men had participated. In Sweden, it is highly unusual for boys or men subject to HBV to seek help or be identified via reports of concern. There are many reasons for the absence of men. Men are overlooked when it comes to state intervention for male victims, and they may find it challenging to come forward, although research shows that strict patriarchal norms have an oppressive influence on their lives.

I express my warmest thanks to all the participants who generously and courageously shared their stories.

This study was financed by a research grant awarded to the author by fifteen rural municipalities (social work agencies) in a county in mid-Sweden striving to counteract HBV.

The author declares no potential conflicts of interest concerning this article’s research, authorship or publication.

Baianstovu R. , Strid S. ( 2024 ) ‘ Complexities facing social work: Honor-based violence as lived reality and stereotype ’, Journal of Social Work , 1 – 19 .

Google Scholar

Bhanbhro S. , Wassan M. R. , Shan M. A. , Talpur A. A. , Wassan A. A. ( 2013 ) ‘ Karo-Kari—the murder of honour in Sindh Pakistan: An ethnographic study ’, International Journal of Asian Social Science , 3 ( 7 ), pp. 1467 – 84 . http://shura.shu.ac.uk/7287/ (accessed February 23, 2021).

Bhanbhro S. , de Chavez A. , Lusambili A. ( 2016 ) ‘ Honour based violence as a global public health problem: A critical review of literature ’, International Journal of Human Rights in Healthcare , 9 ( 3 ), pp. 198 – 215 .

Björktomta S. B. ( 2019 ) ‘ Honor-based violence in Sweden—norms of honor and chastity ’, Journal of Family Violence , 34 ( 5 ), pp. 449 – 60 . DOI: 10.1007/s10896-019-00039-1

BRÅ, National Council of Crime Prevention . ( 2023a ) ’Grov Fridskränkning Mot Barn [Serious Breach of Peace against Children]’, (Report 2023:6), available online at: www.bra.se (accessed August 7, 2023).

BRÅ, National Council of Crime Prevention . ( 2023b ) ’(The Statistics of Reported Crimes)’, available online at: https://statistik.bra.se/solwebb/action/index (accessed August 7, 2023).

Cater Å. K. , Sjögren J. ( 2016 ) ‘ Children exposed to intimate partner violence describe their experiences: A typology-based qualitative analysis ’, Child and Adolescent Social Work Journal , 33 ( 6 ), pp. 473 – 86 .

Cohen D. , Nisbett R. E. ( 1994 ) ‘ Self-protection and the culture of honour: Explaining Southern violence ’, Personality and Social Psychology Bulletin , 20 ( 5 ), pp. 551 – 67 .

Dickson P. ( 2014 ) ‘ Understanding victims of honour-based violence ’, Community Practitioner , 87 ( 7 ), pp. 30 – 3 .

Dietrich D. M. , Schuett J. M. ( 2013 ) ‘ Culture of honor and attitudes toward intimate partner violence in latinos ’, SAGE Open , 3 ( 2 ), pp. 215824401348968 – 11 .

EU Anti-Trafficking Directive . ( 2011 ) ’/36/EU). Directive 2011/36/EU of the European Parliament and of the Council of 5 April 2011 on Preventing and Combating Trafficking in Human Beings and Protecting its Victims, and Replacing Council Framework Decision 2002/629/JHA’, The European Parliament and the Council of the European Union, available online at: https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32011L0036&from=LT (accessed August 8, 2021).

Fineman M. A. ( 2019 ) ‘ Vulnerability and social justice ’, Valparaiso University Law Review , 53 ( 2 ), pp. 341 – 69 . https://scholar.valpo.edu/vulr/vol53/iss2/2 (accessed August 10, 2022).

Gangoli G. , Gill A. , Mulvihill N. , Hester M. ( 2018 ) ‘ Perception and barriers: Reporting female genital mutilation ’, Journal of Aggression, Conflict and Peace Research , 10 ( 4 ), pp. 251 – 60 .

Gill A. ( 2008 ) ‘ Crimes of honour’ and violence against women in the UK ’, International Journal of Comparative and Applied Criminal Justice , 32 ( 2 ), pp. 243 – 63 .

Graneheim U. H. , Lundman B. ( 2004 ) ‘ Qualitative content analysis in nursing research: Concepts, procedures and measures to achieve trustworthiness ’, Nurse Education Today , 24 ( 2 ), pp. 105 – 12 .

Graneheim U. H. , Lindgren B. M. , Lundman B. ( 2017 ) ‘ Methodological challenges in qualitative content analysis: A discussion paper ’, Nurse Education Today , 56 , 29 – 34 .

Heimer M. , Näsman E. , Palme J. ( 2018 ) ‘ Vulnerable children's rights to participation, protection, and provision: The process of defining the problem in Swedish child and family welfare ’, Child & Family Social Work , 23 ( 2 ), pp. 316 – 23 .

HSLF-FS 2022:39 . ( 2022 ) New Regulations and General Advice on Intimate Partner Violence’, available online at: https://www.socialstyrelsen.se/globalassets/sharepoint-dokument/artikelkatalog/meddelandeblad/2022-6-8012.pdf (accessed July 15, 2022).

Honneth A. ( 2007 ) Disrespect: The Normative Foundation of Critical Theory , Cambridge , Polity Press .

Google Preview

Idriss M. M. ( 2017 ) ‘ Not domestic violence or cultural tradition: Is honour-based violence distinct from domestic violence ’, Journal of Social Welfare and Family Law , 39 ( 1 ), pp. 3 – 21 .

Idriss M. M. ( 2021 ) ‘ Abused by the patriarchy: Male victims, masculinity, “honor”- based abuse and forced marriages ’, Journal of Interpersonal Violence , 37 ( 13–14 ), pp. NP11905 – NP11932 .

Kağıtçıbaşı C. ( 2017 ) Family, Self, and Human Development Across Cultures. Theory and Applications , New York, NY , Routledge .

Kralik D. , Visentin K. , van Loon A. ( 2005 ) ‘ Transition: A literature review. Integrative literature reviews and meta-analyses ’, Journal of Advanced Nursing , 55 ( 3 ), pp. 320 – 9 .

Lee-Rife S. , Malhotra A. , Warner A. , McGonagle Glinski A. ( 2012 ) ‘ What works to prevent child marriage: A review of the evidence ’, Studies in Family Planning , 43 ( 4 ), pp. 287 – 303 .

Lidman S. , Hong T. ( 2018 ) ‘ Collective violence” and honour in Finland: A survey for professionals ’, Journal of Aggression, Conflict and Peace Research , 10 ( 4 ), pp. 261 – 71 .

Linell H. ( 2017 ) ‘ The process of disclosing child abuse: A study of Swedish Social Services protection in child abuse cases ’, Child & Family Social Work , 22 ( S4 ), PP. 11 – 9 .

Meleis A. I. , Sawyer L. M. , Im E. O. , Hilfinger Messias D. K. , Schumacher K. ( 2000 ) ‘ Schumacher K. Experiencing transitions: An emerging middle-range theory ’, Advances in Nursing Science , 23 ( 1 ), pp. 12 – 28 .

National Board of Health and Welfare . ( 2016 ) Kvinnlig Könsstympning—Ett Stöd För Hälso- Och Sjukvårdens Arbete. [Female Genital Mutilation –Support for the Work of the Health and Medical Services] , Stockholm , Socialstyrelsen . https://www.socialstyrelsen.se/ (accessed July 15, 2023).

National Board of Health and Welfare . ( 2018 ) Dödsfallsutredningar 2016–2017. [Death Investigations 2016–2017] , Stockholm , Socialstyrelsen . https://www.socialstyrelsen.se/ (accessed September 1, 2018).

Olsson H. , Bergman A. ( 2021 ) ‘ From silence to recognition: Swedish social services and the handling of honor-based violence ’, European Journal of Social Work , 25 ( 2 ), pp. 198 – 209 .

Olsson H. , Strand S. , Källvik E. ( 2022 ) ’Hedersrelaterat våld och Förtryck i Kombination med Prostitution och Människohandel—Ett Vidgat Perspektiv. [Honor-related Violence and Oppression Combined with Prostitution and Human Trafficking—A Broader Perspective]’, See English Summary. Ett Samarbete Mellan Örebro- och Karlstad Universitet och Jämställdhetsmyndigheten, available online at: https://jamstalldhetsmyndigheten.se/media/qkck4bnn/hedersrelaterat-vald-och-fortryck-i-kombination-med-prostitution-och-manniskohandel.pdf (accessed June 1, 2022).

Olsson H. , Larsson A. K. L. , Susanne J. M. ( 2023 ) ‘ Social workers’ experiences of working with partner violence ’, The British Journal of Social Work , 00 , 1 – 19 .

Patton M. Q. ( 2015 ) Qualitative Research & Evaluation Methods. Integrating Theory and Practice , 4th edn, Newbury Park, CA , SAGE .

Randall M. ( 2004 ) ‘ Domestic violence and the construction of “ideal victims”: Assaulted women’s “image problems” in law ’, Saint Louis University Public Law Review , 23 ( 1 ), pp. 121 . https://scholarship.law.slu.edu/plr/vol23/iss1/8 (accessed July 1, 2022).

SBU . ( 2018 ) Stöd till Ensamkommande Barn Och Unga—Effekter, Erfarenheter Och Upplevelser. [Support for Unaccompanied Children and Young People—Effects and Experiences] , Stockholm, Sweden , Swedish Agency for Health Technology Assessment and Assessment of Social Services (SBU ). https://www.sbu.se/294 (accessed July 1, 2022).

Schlytter A. , Linell H. ( 2010 ) ‘ Girls with honour-related problems in a comparative perspective ’, International Journal of Social Welfare , 19 ( 2 ), pp. 152 – 61 .

Sedem M. , Ferrer-Wreder L. ( 2014 ) ‘ Fear of the Loss of Honor: Implications of Honor-Based Violence for the Development of Youth and Their Families ’, Child & Youth Care Forum , 44 ( 2 ), pp. 225 – 37 . DOI: 10.1007/s10566-014-9279-5

Shanthakumari R. S. , Chandra P. S. , Riazantseva E. , Stewart D. E. ( 2014 ) ‘ Difficulties come to humans and not trees and they need to be faced’: A study on resilience among Indian women experiencing intimate partner violence ’, The International Journal of Social Psychiatry , 60 ( 7 ), pp. 703 – 10 .

Strid S. , Baianstovu R. , Enelo J. M. ( 2021 ) ‘ Inequalities, isolation, and intersectionality: A quantitative study of honour-based violence among girls and boys in metropolitan Sweden ’, Women's Studies International Forum , 88 , pp. 102518 – 9 .

Swedish Government Bill . ( 2021 ) ’Regeringens Proposition 2021/22:138’, Ett särskilt Brott för Hedersförtryck. [Honor Oppression as a Special Crime] , 22: 138. https://www.regeringen.se/493134/contentassets/fa32ed4b32564e21b0763fba292df7b2/ett-sarskilt-brott-for-hedersfortryck-prop.-202122138.pdf (accessed September 1, 2017).

Swedish Research Council . ( 2017 ) God Forskningssed. [Good Research Practice] , Stockholm , Swedish Research Council . https://www.vr.se/english/analysis/reports/our-reports/2017-08-31-good-research-practice.html

UNFPA . ( 2000 ) ’The State of World Population 2000’. Lives Together, Worlds Apart. United Nation Population Fund , New York, NY, Executive Director: Nafis Sadik, available online at: https://www.unfpa.org/sites/default/files/pub-pdf/swp2000_eng.pdf (accessed July 15, 2021).

UNICEF . ( 2020 ) United Nations Children’s Fund, Female Genital Mutilation: A New Generation Calls for Ending an Old Practice , New York, NY , UNICEF . https://data.unicef.org/topic/child-protection/female-genital-mutilation/ (accessed July 15, 2021).

UNICEF . ( 2021 ) ’United Nations Children’s Fund’. Towards Ending Child Marriage. Global Trends and Profiles of Progress. New York, 2021, available online at: https://data.unicef.org/resources/towards-ending-child-marriage/ (accessed July 15, 2021).

Wikström E. , Ghazinour M. ( 2010 ) ‘ Swedish experience of sheltered housing and conflicting theories in use with special regards to honour related violence (HRV) ’, European Journal of Social Work , 13 ( 2 ), pp. 245 – 59 .

Month:	Total Views:
April 2024	110
May 2024	135
June 2024	38
July 2024	37
August 2024	72
September 2024	58

Email alerts

Citing articles via.

Recommend to your Library

Affiliations

Online ISSN 1468-263X
Print ISSN 0045-3102
About Oxford Academic
Publish journals with us
University press partners
What we publish
New features
Open access
Institutional account management
Rights and permissions
Get help with access
Accessibility
Advertising
Media enquiries
Oxford University Press
Oxford Languages
University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Cookie settings
Cookie policy
Privacy policy
Legal notice

This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Open access
Published: 27 September 2024

The mediating role of goal orientation in the relationship between formative assessment with academic engagement and procrastination in medical students

Majid Yousefi Afrashteh ORCID: orcid.org/0000-0003-2760-7112 1 &
Parisa Janjani ORCID: orcid.org/0000-0002-7394-8447 2

BMC Medical Education volume 24 , Article number: 1036 ( 2024 ) Cite this article

Metrics details

Academic involvement and academic procrastination are two behavioral variables and are among the challenges of higher education, especially medical education. The purpose of the current research is to investigate the mediating role of goal orientation in the relationship between formative assessment with academic engagement and procrastination in Iranian medical students.

The present correlational study of path way type, was performed on 388 students of Zanjan University of Medical Sciences in the 2021 selected by a convenient sampling method. Four questionnaires of Goal orientation scale )21-item), the classroom assessment approaches questionnaire (12-item), the Procrastination Assessment Scale– Students (44-item) and the student engagement scale (10-item) were used to collect data. The data were analyzed with SPSS-26 and LISREL-10.2 software.

The results of the path analysis showed formative assessment have significant direct effect on mastery orientation (β = 0.40), performance-approach (β = 0.14), avoidance orientation (β = -0.28), academic engagement (β = 0.32) and academic procrastination (β = 0.12). Also mastery orientation (β = 0.13), performance-approach (β = 0.12), avoidance orientation (β = -0.25) have a significant direct effect in the variance of academic engagement. As well mastery orientation (β = -0.43), performance-approach (β = -0.15), avoidance orientation (β = 0.30) have a significant direct effect in the variance of academic procrastination. These results confirm the direct hypotheses. Indirect effect of formative assessment to academic engagement (0.21) and academic procrastination (0.27) was significant.

It is recommended according to the results practitioners provide the basis for promoting academic engagement and decrease academic procrastination through the application of formative assessment and Improve classroom goal orientation.

Peer Review reports

Introduction

Medical education is crucial for the advancement of healthcare systems globally [ 1 , 2 ]. Its main goal is to prepare medical professionals to deliver top-quality services throughout their careers [ 3 ]. Academic learning equips students with essential knowledge and skills [ 4 ]. However, medical students face heavy academic burdens, managing rigorous schedules, teaching content, and tasks, leading to high pressure [ 5 , 6 ]. On the other hand, rapid changes in teaching and evaluation methods worldwide underscore the need for continuous improvement [ 7 ]. Assessment and evaluation aim to enhance student learning, and the content and methodology of evaluation significantly influence the quality of that learning [ 8 ]. While Iran’s medical education system has made significant progress over the last three decades, it is crucial to emphasize the importance of social accountability within the health system and at the level of medical schools. To ensure a competent healthcare workforce, accountability for the quality of services provided and the knowledge, attitude, skills, and abilities of graduates is essential [ 9 ].

Classroom assessment is a crucial tool for enhancing the learning process [ 10 ] often referred to as “assessment for learning” rather than “assessment of learning” [ 11 ]. Formative assessment, in particular, has been linked to academic engagement [ 12 , 13 ] and procrastination [ 14 ]. According to the Assessment Reform Group, formative assessment is a process where both learners and teachers seek and interpret evidence to determine the learners’ current progress, future goals, and the best path to achieve those goals [ 15 ]. Moss and Brookhart define formative assessment as a feedback-driven process during instruction, aiming to improve learning and teaching activities and ultimately increase student achievement [ 16 ]. Wafubwa’s meta-analysis revealed that formative assessment not only improves grades but also enhances academic motivation and engagement among learners [ 17 ]. By providing students with feedback and insights into their understanding of the material [ 18 , 19 ], formative assessment serves as a predictor of outcomes [ 20 , 21 ] and guides their efforts toward self-regulated learning [ 22 ]. Academic engagement and procrastination are significant variables influencing how programs and academic tasks are executed, particularly for medical students [ 23 ].

Academic procrastination, a subtype of situational procrastination [ 24 ], refers to the deliberate delay in initiating or completing tasks related to the learning process [ 25 ]. In medical education, procrastination is a prevalent issue that hinders students’ academic progress [ 26 ]. While many studies have examined academic procrastination in various educational contexts, most research has focused on the university setting [ 24 , 27 ]. Notably, students who struggle with procrastination often express a desire to overcome this habit [ 27 ].

Student engagement has gained prominence due to the increasing pressure on students to complete their studies within specified timeframes [ 28 , 29 ]. High levels of engagement are crucial for academic performance and persistence in educational tasks and institutions [ 30 , 31 ]. Academic engagement is typically characterized as a multidimensional construct, although its definitions vary across literature. The emotional, cognitive, and behavioral aspects of student engagement have been the most studied and are considered essential factors in understanding student involvement and success [ 28 , 30 ].

Goal orientation, a motivational variable, has been frequently studied in relation to procrastination [ 32 ] and academic engagement [ 33 ]. The goal orientation theory of achievement motivation is a social cognitive theory applied in educational contexts to explain student behavior [ 34 , 35 ]. It suggests that variations in behavior are not solely due to differences in motivation levels. Goal orientation refers to the reasons or purposes behind students’ learning, and these goals influence their actions, reactions, and motivation to learn [ 34 , 36 ].

Research on academic procrastination highlights the need for systematic investigations into the negative impact of procrastination on students’ academic goal achievement and the exploration of strategies to mitigate procrastination [ 24 ]. Conversely, goal orientation can enhance achievement, particularly under challenging conditions [ 37 ]. Achievement goal behaviors have been found to independently predict academic achievement and are influenced by mediating or moderating relationships with other student behaviors [ 38 , 39 ].

One of the theories that can be considered to explain the theoretical framework of this research is the achievement goal theory [ 40 ] which Cook and Artino [ 41 ] emphasized on the need to pay attention to in medical education. According to this theory, students have goals for their learning and bring them to class. These goals, which are classified as mastery, performance and avoidance, affect how to learn and academic results. But educators can also influence the learning goals of students by modifying the educational environment [ 42 ]. Teaching and assessment are two important tools of the instructor in setting up the educational environment. In new approaches, assessment is considered as a part of education and is a tool to improve teaching and learning. Therefore, the use of new assessment approaches, which are called formative assessment, enriches the educational environment and moves the learners’ goals from performance and avoidance to mastery. According to the achievement goal theory, each of the goal setting activates related behaviors. Two important learning behaviors in medical education are academic engagement and procrastination. It is expected that students who pursue performance and avoidance goals have more procrastination and less engagement in the learning process. On the other hand, the group that has mastery goals engages enthusiastically with the learning content during learning and therefore has more active participation. Although theories and limited studies [ 43 , 44 ] confirm the relationship between goal orientation and academic procrastination, but the mediating role of goal orientation has not yet been determined, considering this research gap, the present study was conducted with the aim of investigation the mediating role of goal orientation in the relationship between formative assessment with academic engagement and procrastination in Iranain medical students. The hypotheses of this study were as follows: 1- formative assessment is related to academic engagement through goal orientation, 2- formative assessment is related to academic procrastination through goal orientation, 3- The assumed model has a good fit.

Design and data collection

A correlational path analysis study was conducted on students of Zanjan University of Medical Sciences in 2022. According to Kline [ 45 ], the sample size for path analysis should ideally be 10–20 times the number of parameters. Questionnaires were distributed to 400 students, with professional interviewers encouraging participation and explaining the research objectives. Incentives, such as gift pens, were also provided. Out of the 400 questionnaires distributed, 12 were excluded due to incomplete or outlier data, resulting in a final sample size of 388, which exceeds Kline’s recommended minimum. The response rate for the study was impressive at 97%. The sampling method employed was convenience sampling.

The participants were given the option to choose between a paper or online version of the questionnaire. The first page of the questionnaire explained the purpose of the research, the criteria for participation, and included a consent form. After providing consent, participants answered demographic questions on the second page, and then proceeded to complete four additional questionnaires: the Classroom Assessment Approaches Questionnaire (CAAQ), the Procrastination Assessment Scale for Students (PASS), the Goal Orientation Scale, and the Student Engagement Scale.

Instruments

The Classroom Assessment Approaches Questionnaire (CAAQ) was employed to assess students’ perceptions of their teacher’s classroom assessment methods. This 12-item questionnaire, which includes questions such as “Did you receive any feedback on how you learned while studying?,” was designed by Yousefi Afrashteh et al. [ 46 ] specifically for use in Iran. A high score on the CAAQ indicates that the teacher utilizes formative assessment practices more frequently. The reliability coefficient of the questionnaire, as reported by Yousefi Afrashteh et al. [ 46 ]. was 0.72. In a separate study on Iranian medical students [ 47 ]. the Cronbach’s alpha coefficient for the CAAQ was found to be 0.78, indicating good internal consistency. In the present study, the Cronbach’s alpha coefficient for the CAAQ was also calculated to be 0.78, further supporting its reliability.

The Procrastination Assessment Scale– Students (PASS) (Solomon & Rothblum, 1984) was used to measure academic procrastination [ 48 ]. This 44-item questionnaire includes two subscales: Areas of Procrastination (AOP; 18 items) and Reasons for Procrastination (26 items). Respondents are queried about their procrastination frequency, their perception of procrastination as a problem, and their desire to reduce it. The present study utilized only the six AOP subscale items related to the degree of procrastination. Responses are rated on a Likert scale ranging from 1 (“never procrastinate/not at all a problem/do not want to decrease”) to 5 (“always procrastinate/always a problem/definitely want to decrease”). Jokar and Delavarpour confirmed the factorial structure of this questionnaire for the Iranian sample using factor analysis [ 49 ]. In the current study, the Cronbach’s alpha coefficient for internal consistency was 0.80, indicating good reliability.

Goal orientation scale (Bouffard et al. 1995) and the scale designed by Armes and Archer (1998) was employed to assess individuals’ goal preferences in academic contexts [ 50 ]. This scale evaluates three dimensions: learning goal orientation, performance goal orientation, and failure avoidance goal orientation. Respondents indicate their level of agreement with 21 statements on a 6-point Likert scale ranging from “absolutely disagree” to “absolutely agree.” The scale includes 8 items related to learning, 4 to performance, and 9 to failure avoidance. Khademi and Noshadi (2006) confirmed the validity of this questionnaire using internal consistency measures [ 51 ]. The reliability coefficients obtained were 0.83 for learning, 0.72 for performance, and 0.85 for failure avoidance. To assess reliability for the present study, the questionnaire was administered to 50 students, yielding Cronbach’s alpha values of 0.84 for learning, 0.78 for performance, and 0.83 for failure avoidance. The overall Cronbach’s alpha for this questionnaire in the current study was 0.76, indicating acceptable reliability.

The student engagement scale. To assess student engagement, we utilized the student engagement scale developed by Gunuc and Kuzu (2014), which has been previously employed to explore the impact of classroom technology on student engagement [ 52 ]. This scale evaluates six types of engagement, categorized under class engagement and campus engagement. The four class engagement scales, namely cognitive engagement, peer relationships, relationships with faculty members, and behavioral engagement, were tailored to refer to specific courses using a frame of reference (e.g., “I feel myself as a part/member of a student group for [COURSECODE]”),” see “Procedure and design” section for further details). The cognitive engagement scale comprised 10 items, such as “I motivate myself to learn for [COURSECODE],” and demonstrated good internal consistency (α = .89 for the official Facebook course and α = .86 for the non-official Facebook course). (see “Procedure and design” section for further details). The peer relationships scale included six items (e.g., " internal consistency (α = .87 for both official and non-official Facebook courses). The relationships with faculty members scale consisted of 10 items, such as “My teachers in [COURSECODE] show regard to my interests and needs,” and demonstrated good internal consistency (α = .89 for the official Facebook course and α = .92 for the non-official Facebook course). The behavioral engagement scale, consisting of four items (e.g., “I follow the rules in class for [COURSECODE]”), exhibited acceptable internal consistency, with Cronbach’s alpha coefficients of α = .75 for the official Facebook course and α = .66 for the non-official Facebook course. Moving beyond class engagement, the scale also included two campus engagement scales: valuing and sense of belonging. These scales were presented without a specific course frame of reference. The valuing subscale, comprising three items (e.g., “I believe university is beneficial for me”), demonstrated good internal consistency (α = .79). Meanwhile, the sense of belonging subscale, with eight items (e.g., “I feel myself as a part of the campus”), also showed good internal consistency (α = .88). All engagement scales were rated on a 5-point scale ranging from 1 (“strongly disagree”) to 5 (“strongly agree”), and item responses were averaged to create composite scales. In the present study, the overall Cronbach’s alpha coefficient for this engagement scale was 0.82, indicating good reliability.

Statistical analysis

Descriptive and inferential statistics were utilized in the analysis of the data. In the descriptive analysis, the frequency distribution table of demographic variables, as well as the mean and standard deviation of variables, were examined and reported. Inferential statistics, on the other hand, involved the use of Pearson’s correlation coefficient and path analysis. The latter was performed using LISREL v10.2, while SPSS v.26 (IBM) was employed for the remaining analysis. Prior to conducting the path analysis, its fundamental assumptions were verified, including the requirement of a minimum sample size of 200 participants, as recommended by Kline [ 45 ]. The sample of 322 people supported the assumption. The normality of the distribution of the dependent variables, as shown in Table 1 , fell within the range of -1 to 1 on the skewness index, confirming normality. The analyzed model revealed no correlation between the errors of the endogenous variables, satisfying another assumption. Furthermore, all variables were measured on an interval scale. In addition to direct effects, this study considers several indirect effects, specifically the impact of formative assessment on academic engagement and procrastination through goal orientation. Goodness of fit indices was used to evaluate the overall model fitness and determine how well the conceptual model aligns with the data. This study utilized several indices, including the likelihood ratio chi-square (χ2), the ratio of χ2 to degrees of freedom (χ2/df), the goodness of fit index (GFI), the adjusted goodness of fit (AGFI), the root mean square error of approximation (RMSEA), and the comparative fit index (CFI). These indices provide a comprehensive assessment of the model’s fitness and the strength of the relationships between the variables.

Table 2 reports the demographic information of the participants. Out of 388 students participating in this study, 20% of the participants were in the age group of less than 20 years and 47% in the age group of 20–25 years. 26% of the participants were married. 71% of the participants were undergraduate students, 21% were postgraduate students and 8% were PhD students. 51% of them were employed while studying. 32% of the participants studied in the School of Public Health, 37% in the School of Allied Medical Sciences and 31% in the School of Nursing and Midwifery. More details are showed in Table 2 .

Table 1 shows mean and standard deviation for research variables. In addition, Pearson correlation is reported to determine the relationship of all variables included in the path model. The mean and standard deviation of Academic well-being are 40.40 and 8.27, respectively. The correlation coefficient of academic well-being with formative assessment was 0.02, with self-efficacy was 0.39, with internal value was 0.28, with test anxiety was 0.26, with cognitive strategies was 0.19 and with self-regulation was 0.28. Apart from the relationship between academic well-being and formative assessment, other correlation coefficients are significant at the level of 0.001.

The results of path analysis to investigate direct, indirect and total relationships are reported in Table 3 .

Table 3 shows the direct, indirect, and total effects for the relationship of the variables in the model. According to the results of this table, formative assessment has significant direct effect on mastery orientation (β = 0.28, P < 0.001), avoidance orientation (β = -0.26, P < 0.001), academic engagement (β = 0.12, P < 0.016) and academic procrastination (β = -0.29, P < 0.001). Also, mastery orientation (β = 0.35, P < 0.001), performance-approach (β = -0.11, P < 0.013), avoidance orientation (β = -0.12, P < 0.020) have a significant direct effect in the variance of academic engagement. As well mastery orientation (β = -0.13, P < 0.006) and avoidance orientation (β = 0.03, P < 0.569) have a significant direct effect in the variance of academic procrastination. These results confirm the direct hypotheses. According to the results of Table 4 , indirect effect of formative assessment to academic engagement (0.12, P < 0.001) and academic procrastination (-0.09, P < 0.001) is significant. These results confirm that goal orientation plays a mediating role in the relationship between formative assessment with academic engagement and academic procrastination. In fact, part of the relationship between formative assessment with academic engagement and academic procrastination occurs through changes in coping self-efficacy.

Standard estimate (and t-value) for relationship between variables has showed in Fig. 1 .

Standard estimate (and t-value) for relationship between variables

The goodness-of-fit indices reported in Table 4 shows that the analyzed model has an acceptable fit ( P -value = 0.60; chi square = 0.99; df = 2; chi square/df = 0.49; RMSEA = 0.001; CFI = 0.99; AGFI = 0.99).

Goodness-fit indices are reported in Table 4 .

This research investigates how formative assessment affects students’ academic engagement and procrastination, examining the role of goal orientation as a key intermediary. The findings, based on Structural Equation Modeling (SEM) analysis, show that the proposed model accurately represents the relationships between these variables, highlighting a significant connection between formative assessment, goal orientation, and both procrastination and engagement. Importantly, the analysis revealed that age and gender did not have a moderating effect on these relationships, thereby confirming the study’s initial hypotheses regarding the impact of formative assessment on student behavior.

The study’s first hypothesis, which proposed a link between formative assessment and academic procrastination via goal orientation, was supported by the findings. In essence, the results suggest that formative assessment influences procrastination indirectly through its impact on goal orientation. While this specific pathway has not been previously explored, the current study’s results align with and build upon existing research that has investigated the individual relationships within this pathway, providing new insights into the complex dynamics between formative assessment, goal orientation, and procrastination [ 53 , 54 ]. According to Elliot and McGregor’s model (2001), which informs the interpretation of the current results, there are four distinct types of goal orientation: mastery-oriented, mastery-avoidance, performance-oriented, and performance-avoidance goals [ 55 ]. Each of the four goal orientations (mastery-oriented, mastery-avoidance, performance-oriented, and performance-avoidance) is characterized by distinct cognitive, behavioral, and emotional patterns. Research suggests that mastery and performance goals are negatively correlated with procrastination, as they promote effective self-regulation. In contrast, performance-avoidance goals are positively linked to procrastination, as they involve self-regulatory processes that are incompatible with task engagement and instead foster avoidance behaviors [ 49 ]. According to Midgley and Urdan (2001), the structure of an individual’s goals serves as a potent cognitive framework that can significantly influence both their goal-setting and overall performance [ 56 ]. In contrast, studies emphasize that formative assessment should focus not only on the ultimate goals, but also on the methods and instruments used to achieve them. By doing so, formative assessment becomes a vital tool for continuously improving and adapting educational programs and activities, ultimately ensuring they meet their desired outcomes [ 34 ]. Due to its dynamism and breadth [ 7 ], formative assessment can be leveraged to explore the relationship between goal orientation and academic procrastination, offering a nuanced understanding of how these factors intersect and influence one another.

The study’s second hypothesis, which posited that formative assessment is linked to academic engagement via goal orientation, was supported by the findings. This suggests that formative assessment can indeed influence academic engagement indirectly through its impact on goal orientation. While no prior research has directly examined this specific pathway, the current study’s results are consistent with existing literature on the individual relationships that comprise this path [ 57 , 58 ].

Research by Koh, Lim, and Habib (2010) found that formative assessment in Singapore has a positive impact on both teacher and student learning outcomes, largely due to the integration of professional development strategies into instructional planning [ 59 ]. Through formative assessment, teachers employ a range of assessment activities and strategies in the classroom to gain a comprehensive understanding of student learning. This information is then used to inform instruction, provide constructive feedback, and adjust teaching approaches. Students play an active role in this process, not only participating in learning activities but also using assessment data to set personal goals, make informed decisions about their learning, and develop a sense of self-efficacy in their academic pursuits.

Wuest and Fisette (2012) suggest that formative assessments serve as a valuable tool for teachers, providing insight into student learning and informing instructional planning for future lessons, thereby enabling teachers to adjust their teaching strategies and better meet the needs of their students [ 60 ]. According to Ritchhart, Church, and Morrison (2011), this educational approach empowers students to take ownership of their learning from the outset, entrusting them with responsibility for their own educational journey and fostering a sense of agency and autonomy in the learning process [ 61 ]. This approach enables students to actively construct their own understanding of the subject matter, work collaboratively with their peers, and progress towards more sophisticated knowledge and insights. As noted by Moss and Brookhart (2009), one key benefit of sharing learning objectives with students is that it allows them to engage in tasks that are explicitly aligned with those objectives, promoting a clear sense of direction and purpose in their learning [ 62 ].

According to Heritage (2008), when students are aware of the learning goals and criteria, they can transform from passive recipients of information to active participants in the learning process, taking a more engaged and invested role in their own education [ 63 ]. When introducing a new subject, it is crucial to clearly communicate the learning objectives, requirements, and criteria to students, ensuring they have a shared understanding of what is expected and what they will be working towards [ 64 , 65 ].

In this educational approach, students are entrusted with autonomy over their own learning from the outset, as Ritchhart, Church, and Morrison (2011) suggest, empowering them to take ownership of their educational journey and make informed decisions about their own learning process [ 61 ]. By adopting this approach, students are able to take an active role in building their own understanding of the subject matter, working together with their peers, and progressing towards deeper and more nuanced knowledge. As Moss and Brookhart (2009) point out, sharing clear learning objectives with students has the added benefit of enabling them to engage in targeted tasks that directly align with those objectives, thereby promoting focused and effective learning [ 62 ]. The evidence suggests that formative assessments, when integrated with goal-oriented approaches, can have a profound impact on student learning, fostering increased academic engagement and transforming the learning process into a more vibrant and interactive experience.

Application of study results

Given the potential of formative assessment through goal orientation to address procrastination and enhance academic engagement, the findings of this study offer a valuable resource for policymakers and education officials at both local and national levels, providing a potential solution to improve the country’s education system and promote more effective learning outcomes. The aforementioned study offers a beacon of hope in this landscape, furnishing actionable insights that can catalyze tangible change. By identifying the nexus between reducing work procrastination and augmenting academic participation, the research underscores the pivotal role of proactive and disciplined work habits in driving student engagement. Through a comprehensive analysis of the study findings, educators and administrators in the medical education domain can glean invaluable strategies to curtail procrastination and invigorate student involvement. In addition to these insights, the study advocates for the integration of collaborative and interactive learning methodologies to enhance academic participation. By fostering an inclusive and participatory learning environment, educators can engender a sense of camaraderie and shared accountability among students, dissuading them from withdrawing into the quagmire of procrastination. Encouraging peer-to-peer interaction, group discussions, and collaborative projects can infuse the academic milieu with dynamism, propelling students to actively partake in the educational discourse.

The limitations of the study

One of the limitations of this study is related to the cross-sectional nature of the study. These studies provide information in a specific period of time and have less predictive power compared to longitudinal studies. Also, although structural equation modelling has been used in this study, the nature of the relationships obtained is relational and not causal, and due to the statistical method used and the cross-sectional nature of this study, causal perceptions This type of study is not suitable. In addition, this study was used on a sample of Iranians, and due to the existence of differences in educational systems, extreme caution should be exercised in generalizing these findings to other societies.

Data availability

The datasets during and/or analyzed during the current study available from the corresponding author on reasonable request.

Abbreviations

Comparative fit index

Adjusted goodness fit index

Root mean square error of approximation

The classroom assessment approaches questionnaire

The Procrastination Assessment Scale– Students

Swanwick T. Understanding medical education. Understanding Medical Education: Evidence, Theory, and Practice. 2018:1–6.

Frank JR, Snell LS, Cate OT, Holmboe ES, Carraccio C, Swing SR, et al. Competency-based medical education: theory to practice. Med Teach. 2010;32(8):638–45.

Article Google Scholar

Ferguson E, James D, Madeley L. Factors associated with success in medical school: systematic review of the literature. BMJ. 2002;324(7343):952–7.

Mansfield KJ, Peoples GE, Parker-Newlyn L, Skropeta D. Approaches to learning: does medical school attract students with the motivation to go deeper? Educ Sci. 2020;10(11):302.

Caverzagie KJ, Nousiainen MT, Ferguson PC, Ten Cate O, Ross S, Harris KA, et al. Overarching challenges to the implementation of competency-based medical education. Med Teach. 2017;39(6):588–93.

Kötter T, Wagner J, Brüheim L, Voltmer E. Perceived medical school stress of undergraduate medical students predicts academic performance: an observational study. BMC Med Educ. 2017;17(1):1–6.

Rastegar T. Evaluation at the service of education: new approaches in assessment and evaluation with emphasis on continuous assessment and dynamic and effective feedback to students in the learning process. Tehran: Ministry of Education, Cultural and precursor to training Institute[Persian]; 2003.

Google Scholar

P. SH. Performance measurement in the teaching-learning process, Ministry of Education, Research Institute of Education. Conference on Reform Engineering in Education 2004.

Azizi F. Challenges and perspectives of medical education in Iran. The Quarterly Journal of School of Medicine, Shahid Beheshti University of Medical Sciences, Research in Medicine. 2015;39(1):1–3.

J. G. Assessment and learning. editor: Sage; 2012.

Voinea L. Formative assessment as assessment for learning development. Revista De Pedagogie. 2018;66(1):7–23.

Viegas C, Alves G, Lima N, editors. Formative assessment diversity to foster students engagement. 2015 International Conference on Interactive Collaborative Learning (ICL); 2015: IEEE.

Barana A, Marchisio M, Rabellino S, editors. Empowering engagement through automatic formative assessment. 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC); 2019: IEEE.

Salas Vicente F, Escuder ÁV, Pérez Puig MÁ, Segovia López F. Effect on procrastination and learning of mistakes in the design of the formative and summative assessments: a case study. Educ Sci. 2021;11(8):428.

Group AR. Assessment for learning: 10 principles. Research-based principles to guide classroom practice. London: Assessment Reform Group; 2002.

Moss CM, Brookhart SM. Advancing formative assessment in every classroom: a guide for instructional leaders. ASCD; 2019.

Wafubwa RN. Role of formative assessment in improving students’ motivation, engagement, and achievement: a systematic review of literature. Int J Assess Evaluation. 2020;28(1):17–31.

Rolfe I, McPherson J. Formative assessment: how am I doing? Lancet. 1995;345(8953):837–9.

Iahad N, Dafoulas GA, Kalaitzakis E, Macaulay LA, editors. Evaluation of online assessment: The role of feedback in learner-centered e-learning. 37th Annual Hawaii International Conference on System Sciences, 2004 Proceedings of the; 2004: IEEE.

Dobson JL. The use of formative online quizzes to enhance class preparation and scores on summative exams. Adv Physiol Educ. 2008;32(4):297–302.

Rauf A, Shamim MS, Aly SM, Chundrigar T, Alam SN. Formative assessment in undergraduate medical education: concept, implementation and hurdles. J Pak Med Assoc. 2014;64(64):72–5.

Clark I. Formative assessment: Assessment is for self-regulated learning. Educational Psychol Rev. 2012;24:205–49.

Azizian F, Ramak N, Vahid F, Rezaee S, Sangani A. The effectiveness of cognitive simulation techniques group training on academic engagement and academic procrastination in nursing students. J Nurs Educ. 2020;9(2):54–62.

Moonaghi HK, Beydokhti TB. Academic procrastination and its characteristics: a narrative review. Future Med Educ J. 2017;7(2).

Steel P. The nature of procrastination: a meta-analytic and theoretical review of quintessential self-regulatory failure. Psychol Bull. 2007;133(1):65.

Hayat AA, Jahanian M, Bazrafcan L, Shokrpour N. Prevalence of academic procrastination among medical students and its relationship with their academic achievement. Shiraz E-Medical J. 2020;21(7).

Patrzek J, Grunschel C, Fries S. Academic procrastination: the perspective of university counsellors. Int J Advancement Counselling. 2012;34:185–201.

Truta C, Parv L, Topala I. Academic engagement and intention to drop out: levers for sustainability in higher education. Sustainability. 2018;10(12):4637.

Perkmann M, Salandra R, Tartari V, McKelvey M, Hughes A. Academic engagement: a review of the literature 2011–2019. Res Policy. 2021;50(1):104114.

Fredricks JA, Blumenfeld PC, Paris AH. School engagement: potential of the concept, state of the evidence. Rev Educ Res. 2004;74(1):59–109.

Fredricks JA, Filsecker M, Lawson MA. Student engagement, context, and adjustment: addressing definitional, measurement, and methodological issues. Elsevier; 2016. pp. 1–4.

Ariani DW, Susilo YS. Why do it later? Goal orientation, self-efficacy, test anxiety, on procrastination. J Educational Cult Psychol Stud (ECPS Journal). 2018(17):45–73.

Miller AL, Fassett KT, Palmer DL. Achievement goal orientation: a predictor of student engagement in higher education. Motivation Emot. 2021;45:327–44.

Kaplan A, Maehr ML. The contributions and prospects of goal orientation theory. Educational Psychol Rev. 2007;19:141–84.

Arias JdlF. Recent perspectives in the study of motivation: goal orientation theory. Electron J Res Educational Psychol. 2004;2(1):35–62.

Cumming JHC. The relationship between goal orientation and self-efficacy for exercise. J Appl Soc Psychol. 2004;34(4):747–63.

Senko C, Durik AM, Patel L, Lovejoy CM, Valentiner D. Performance-approach goal effects on achievement under low versus high challenge conditions. Learn Instruction. 2013;23:60–8.

Karlen Y, Suter F, Hirt C, Merki KM. The role of implicit theories in students’ grit, achievement goals, intrinsic and extrinsic motivation, and achievement in the context of a long-term challenging task. Learn Individual Differences. 2019;74:101757.

Lee YJ, Anderman EM. Profiles of perfectionism and their relations to educational outcomes in college students: the moderating role of achievement goals. Learn Individual Differences. 2020;77:101813.

Elliot AJ, Dweck CS. Handbook of competence and motivation. Guilford; 2013.

Cook DA, Artino AR Jr. Motivation to learn: an overview of contemporary theories. Med Educ. 2016;50(10):997–1014.

Daniels L, Daniels V. Internal medicine residents’ achievement goals and efficacy, emotions, and assessments. Can Med Educ J. 2018;9(4):e59.

Yousefi afrashteh M SLaRA. The relationship between goal orientation and academic achievement: a study Metaanalysis Quarterly of Educatinal psychology Allameh Tabataba’i University. 2019;15(51):71–9.

Bairami MHT, Abdullahi AA, Alaei P. Prediction of learning strategies, self-efficacy and academic progress based on goals the progress of second year high school students in Tabriz city. New Educational Ideas. 2011;7(1):65–86.

Kline RB. Structural equation modeling. New York: Guilford; 1998.

Yousefi afrashteh M SLaRA. The relationship between classroom assessment methods and students’ learning approaches and their preferences machinery. Educ Meas. 2015;5(17).

M YA. The Relationship Between formative assessment with Academic engagement and Using Metacognitive Strategies in Medical Students. educational strategies. 2019.

Solomon LJ, Rothblum ED. Academic procrastination: frequency and cognitive-behavioral correlates. J Couns Psychol. 1984;31(4):503.

Jokar B, Delavarpour M. The relationship between educational procrastination and achievement goals. J New Thoughts Educ. 2007;3(3):61–80.

Midgley C, Kaplan A, Middleton M, Maehr ML, Urdan T, Anderman LH, et al. The development and validation of scales assessing students’ achievement goal orientations. Contemp Educ Psychol. 1998;23(2):113–31.

Khademi M, Noshadi N. The relationship between goal orientation and learning self-regulation and academic achievement in Shiraz Pre-university students. J Social Hum Sci Shiraz Univ. 2006;49:63–78.

Gunuc S, Kuzu A. Student engagement scale: development, reliability and validity. Assess Evaluation High Educ. 2015;40(4):587–610.

Norouzi N, Mohammadipour M, Mehdian H. Relationship between goal orientation and academic procrastination with academic burnout with emphasis on the mediating role of academic self-regulation in nursing students. Iran J Nurs Res. 2021;16(2):69–78.

Hashemi Razini HM, Shiri SM. The relationship between the orientation of progress goals and motivational beliefs with procrastination and academic self-handicapping of students. Educational School Stud Q. 2021;3(11):2–15.

Elliot AJ, McGregor HA. A 2× 2 achievement goal framework. J Personal Soc Psychol. 2001;80(3):501.

Midgley C, Urdan T. Academic self-handicapping and achievement goals: a further examination. Contemp Educ Psychol. 2001;26(1):61–75.

Sepasi H. Investigating the impact of formative assessment on the academic progress of third grade middle school students in mathematics. J Daneshvar Rahvar. 2003(3).

Jenkins CE. The relationship between formative assessment and student engagement at Walters State Community College. East Tennessee State University; 2010.

Koh K, Lim L, Habib M, editors. Building teachers’ capacity in classroom-based formative Assessment. 36th International Association for Educational Assessment (IAEA) Annual Conference, Assessment for the Future Generations Bangkok, Thailand(August 2010) http://www.iaea.info/documents/paper_4d520f18 pdf (accessed August 2015); 2010.

Fisette JL, Franck MD. How teachers can use PE metrics for formative assessment. J Phys Educ Recreation Dance. 2012;83(5):23–34.

Ritchhart R, Church M, Morrison K. Making thinking visible: how to promote engagement, understanding, and independence for all learners. Wiley; 2011.

Brookhart S, Moss C, Long B. Promoting student ownership of learning through high-impact formative assessment practices. J MultiDisciplinary Evaluation. 2009;6(12):52–67.

Heritage M. Learning progressions: Supporting instruction and formative assessment. 2008.

Gioka O. Assessment for learning in biology lessons. J Biol Educ. 2007;41(3):113–6.

Häggström J, Boswood A, O’Grady M, Jöns O, Smith S, Swift S, et al. Longitudinal analysis of quality of life, clinical, radiographic, echocardiographic, and laboratory variables in dogs with myxomatous mitral valve disease receiving pimobendan or benazepril: the QUEST study. J Vet Intern Med. 2013;27(6):1441–51.

Download references

Acknowledgements

We sincerely thank the students of Zanjan University of Medical Sciences for their participation in this study.

The authors received no specific funding for this work.

Author information

Authors and affiliations.

Department of Psychology, Faculty of Humanities, University of Zanjan, Zanjan, Iran

Majid Yousefi Afrashteh

Cardiovascular Research Center, Health Institute, Imam Ali Hospital, Kermanshah University of Medical Sciences, Kermanshah, Iran

Parisa Janjani

You can also search for this author in PubMed Google Scholar

Contributions

MYA conceived and designed the research; MYA collected, organized and analyzed the dada; PJ and MYA wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Majid Yousefi Afrashteh .

Ethics declarations

Ethics approval and consent to participate.

Ethical approval was obtained and approved for the study from the Ethics Committee at the Kermanshah University of Medical Sciences. the ethic code allocated to this study is IR.KUMS.REC.1401.537. Informed consent and written was obtained from all subjects in the case of after a clear explanation of the study objectives and to ensure data confidentiality. All methods were performed in accordance with the relevant guidelines and regulations.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Yousefi Afrashteh, M., Janjani, P. The mediating role of goal orientation in the relationship between formative assessment with academic engagement and procrastination in medical students. BMC Med Educ 24 , 1036 (2024). https://doi.org/10.1186/s12909-024-05965-3

Download citation

Received : 12 September 2023

Accepted : 28 August 2024

Published : 27 September 2024

DOI : https://doi.org/10.1186/s12909-024-05965-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Goal orientation
Formative assessment
Academic engagement
Academic procrastination
Medical students

BMC Medical Education

ISSN: 1472-6920

General enquiries: [email protected]

IMAGES

What is Data Analysis in Research
Standard statistical tools in research and data analysis
Why Data Analysis is a Crucial Element in Academic Research
Data Analysis in Research: Types & Methods
The Best 19 Data Analysis In Research
What Is the Data Analysis Process? (A Complete Guide)

VIDEO

A very brief Introduction to Data Analysis (part 1)
T-Curve in Ph.D. #tcurve #labtechstudio #labtech #phdadmissions
Showcase of IBM SPSS Workflow for PhD Research in Public Health
Data Analysis in Research
Arturo Medina explains how to structure your Dissertation Research
Power of Bibliometric Analysis

COMMENTS

Data Analysis in Research: Types & Methods
Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. Three essential things occur during the data ...
A practical guide to data analysis in general literature reviews
This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.
Introduction to Data Analysis
Data analysis can be quantitative, qualitative, or mixed methods. Quantitative research typically involves numbers and "close-ended questions and responses" (Creswell & Creswell, 2018, p. 3).Quantitative research tests variables against objective theories, usually measured and collected on instruments and analyzed using statistical procedures (Creswell & Creswell, 2018, p. 4).
Data analysis
data analysis, the process of systematically collecting, cleaning, transforming, describing, modeling, and interpreting data, generally employing statistical techniques. Data analysis is an important part of both scientific research and business, where demand has grown in recent years for data-driven decision making.
An Overview of Data Analysis and Interpretations in Research
Research is a scientific field which helps to generate new knowledge and solve the existing problem. So, data analysis is the cru cial part of research which makes the result of the stu dy more ...
What is Data Analysis? An Expert Guide With Examples
Data analysis is a comprehensive method of inspecting, cleansing, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It is a multifaceted process involving various techniques and methodologies to interpret data from various sources in different formats, both structured and unstructured.
Data Analysis
Data Analysis. Definition: Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, drawing conclusions, and supporting decision-making. It involves applying various statistical and computational techniques to interpret and derive insights from large datasets.
Data Analysis Techniques in Research
Data analysis techniques in research are essential because they allow researchers to derive meaningful insights from data sets to support their hypotheses or research objectives. ... Perform Regression Analysis to assess the relationship between the time spent on online platforms and academic performance. 3) Predictive Analysis:
Creating a Data Analysis Plan: What to Consider When Choosing
For those interested in conducting qualitative research, previous articles in this Research Primer series have provided information on the design and analysis of such studies. 2, 3 Information in the current article is divided into 3 main sections: an overview of terms and concepts used in data analysis, a review of common methods used to ...
Introduction to Research Statistical Analysis: An Overview of the
Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.
Principles for data analysis workflows
A systematic and reproducible "workflow"—the process that moves a scientific investigation from raw data to coherent research question to insightful contribution—should be a fundamental part of academic data-intensive research practice. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining 3 phases: the Explore, Refine, and Produce Phases ...
(PDF) Different Types of Data Analysis; Data Analysis Methods and
Data analysis is simply the process of converting the gathered data to meanin gf ul information. Different techniques such as modeling to reach trends, relatio nships, and therefore conclusions to ...
Types of data analysis
Concept map analysis; Discourse or narrative analysis; Grouded theory; Phenomenological analysis or interpretative phenomenological analysis (IPA) Further reading and resources. As a starting point for most of these, we would recommend the relevant chapter from Part 5 of Cohen, Manion and Morrison (2018), Research Methods in Education.
Different Types of Data Analysis; Data Analysis Methods and Techniques
Different Types of Data Analysis; Data Analysis Methods and Techniques in Research Projects Authors Hamed Taherdoost To cite this version: Hamed Taherdoost. Different Types of Data Analysis; Data Analysis Methods and Techniques in ... International Journal of Academic Research in Management Volume 9, Issue 1, 2020, ISSN: 2296-1747 www.elvedit.com 2
Research Methods
To analyze data collected in a statistically valid manner (e.g. from experiments, surveys, and observations). Meta-analysis. Quantitative. To statistically analyze the results of a large collection of studies. Can only be applied to studies that collected data in a statistically valid manner. Thematic analysis.
Learning to Do Qualitative Data Analysis: A Starting Point
Yonjoo Cho is an associate professor of Instructional Systems Technology focusing on human resource development (HRD) at Indiana University. Her research interests include action learning in organizations, international HRD, and women in leadership. She serves as an associate editor of Human Resource Development Review and served as a board member of the Academy of Human Resource Development ...
Data Science and Analytics: An Overview from Data-Driven Smart
Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods . This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct ...
Considerations/issues in data analysis
Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data. According to Shamoo and Resnik (2003) various analytic procedures "provide a way of drawing inductive inferences from data and distinguishing the signal (the phenomenon of interest) from the noise (statistical fluctuations) present ...
Research Data
Some common research data analysis methods include: Descriptive statistics: Descriptive statistics involve summarizing and describing the main features of a dataset, such as the mean, median, and standard deviation. Descriptive statistics are often used to provide an initial overview of the data. ... Academic research: Research data is widely ...
Qualitative Data Analysis
Qualitative analysis—the analysis of textual, visual, or audio data—covers a spectrum from confirmation to exploration. Qualitative studies can be directed by a conceptual framework, suggesting, in part, a deductive thrust, or driven more by the data itself, suggesting an inductive process. Generic or basic qualitative research refers to an ...
Secondary Data Analysis: Using existing data to answer new questions
Secondary data analysis is a valuable research approach that can be used to advance knowledge across many disciplines through the use of quantitative, qualitative, or mixed methods data to answer new research questions (Polit & Beck, 2021).This research method dates to the 1960s and involves the utilization of existing or primary data, originally collected for a variety, diverse, or assorted ...
PDF An Overview of Data Analysis and Interpretations in Research
This procedure is referred to as tabulation. Thus, tabulation is the process of summarizing raw data and displaying the same in compact form (i.e., in the form of statistical tables) for further analysis. In a broader sense, tabulation is an orderly arrangement of data in columns and rows.
Accelerating Research with AI
But once the data set is ready, AI can help the researcher perform statistical analysis on that data.) Conducting Attitudinal Studies (like Interviews) While generative AI tools can't handle behavioral data yet, they do much better with self-reported or attitudinal data gathered through methods like interviews, diary studies, and surveys ...
Voices of Women Exposed to Honour-Based Violence: On ...
Because the study concerns vulnerable people, great care was taken to adhere to the ethical principles guiding research in the humanities and social sciences Swedish Research Council (2017). Data analysis. Interview data were analysed using Qualitative Content Analysis (Graneheim and Lundman, 2004; Graneheim et al., 2017). The analysis involved ...
The mediating role of goal orientation in the relationship between
Background Academic involvement and academic procrastination are two behavioral variables and are among the challenges of higher education, especially medical education. The purpose of the current research is to investigate the mediating role of goal orientation in the relationship between formative assessment with academic engagement and procrastination in Iranian medical students. Methods ...

Data Analysis in Research: Types & Methods

What is data analysis in research?

Why analyze data in research?

Types of data in research

Data analysis in qualitative research

Finding patterns in the qualitative data

Methods used for data analysis in qualitative research

Data analysis in quantitative research

Phase I: Data Validation

Phase II: Data Editing

Phase III: Data Coding

Methods used for data analysis in quantitative research

Descriptive statistics

Measures of Frequency

Measures of Central Tendency

Measures of Dispersion or Variation

Measures of Position

Inferential statistics

Considerations in research data analysis

MORE LIKE THIS

User Behavior Analytics: What it is, Importance, Uses & Tools

Data Security: What it is, Types, Risk & Strategies to Follow

User Behavior: What it is, How to Understand, Track & Uses

Mass Personalization is not Personalization! — Tuesday CX Thoughts

Data Analysis

What is Data Analysis?

Why Analyze Data?

What are the Types of Data Analysis?

Data collection

data analysis

Data Analysis Techniques in Research – Methods, Tools & Examples

What is Data Analysis?

Types of Data Analysis Techniques in Research

1) Qualitative Analysis:

2) Quantitative Analysis:

Data Analysis Techniques in Research Examples

Research Objective:

Data Collection:

Data Analysis Techniques Applied:

Data Analysis Techniques in Quantitative Research

1) Descriptive Statistics:

2) Inferential Statistics:

3) Regression Analysis:

4) Correlation Analysis:

5) Factor Analysis:

6) Time Series Analysis:

7) ANOVA (Analysis of Variance):

8) Chi-Square Tests:

Data Analysis Methods

3) Exploratory Data Analysis (EDA):

4) Predictive Analytics:

5) Prescriptive Analytics:

6) Qualitative Data Analysis:

7) Big Data Analytics:

8) Text Analytics:

Data Analysis Tools

1) Microsoft Excel:

2) R Programming Language :

3) Python (with Libraries like Pandas, NumPy, Matplotlib, and Seaborn):

4) SPSS (Statistical Package for the Social Sciences):

5) SAS (Statistical Analysis System):

6) Tableau:

7) Power BI:

8) SQL (Structured Query Language) Databases (e.g., MySQL, PostgreSQL, Microsoft SQL Server):

9) Apache Spark:

10) IBM SPSS Modeler:

Importance of Data Analysis in Research

Data Analysis Techniques in Research FAQs

What are techniques of data analysis in research?

What are the 3 methods of data analysis?

What are the four types of data analysis techniques?

Related Articles

Principles for data analysis workflows

Introduction

Box 1. Terminology

The Explore, Refine, Produce (ERP) workflow for data-intensive research

Phase 1: Explore

Designing data analysis: Goals and standards of the Explore Phase

Analogies to software development in the Explore Phase

Testing: Comparing expectations to output.