Business growth

Business tips

What is data analysis? Examples and how to get started

A hero image with an icon of a line graph / chart

Even with years of professional experience working with data, the term "data analysis" still sets off a panic button in my soul. And yes, when it comes to serious data analysis for your business, you'll eventually want data scientists on your side. But if you're just getting started, no panic attacks are required.

Table of contents:

Quick review: What is data analysis?

Why is data analysis important, types of data analysis (with examples), data analysis process: how to get started, frequently asked questions.

Zapier is the leader in workflow automation—integrating with 6,000+ apps from partners like Google, Salesforce, and Microsoft. Use interfaces, data tables, and logic to build secure, automated systems for your business-critical workflows across your organization's technology stack. Learn more .

Data analysis is the process of examining, filtering, adapting, and modeling data to help solve problems. Data analysis helps determine what is and isn't working, so you can make the changes needed to achieve your business goals. 

Keep in mind that data analysis includes analyzing both quantitative data (e.g., profits and sales) and qualitative data (e.g., surveys and case studies) to paint the whole picture. Here are two simple examples (of a nuanced topic) to show you what I mean.

An example of quantitative data analysis is an online jewelry store owner using inventory data to forecast and improve reordering accuracy. The owner looks at their sales from the past six months and sees that, on average, they sold 210 gold pieces and 105 silver pieces per month, but they only had 100 gold pieces and 100 silver pieces in stock. By collecting and analyzing inventory data on these SKUs, they're forecasting to improve reordering accuracy. The next time they order inventory, they order twice as many gold pieces as silver to meet customer demand.

An example of qualitative data analysis is a fitness studio owner collecting customer feedback to improve class offerings. The studio owner sends out an open-ended survey asking customers what types of exercises they enjoy the most. The owner then performs qualitative content analysis to identify the most frequently suggested exercises and incorporates these into future workout classes.

Here's why it's worth implementing data analysis for your business:

Understand your target audience: You might think you know how to best target your audience, but are your assumptions backed by data? Data analysis can help answer questions like, "What demographics define my target audience?" or "What is my audience motivated by?"

Inform decisions: You don't need to toss and turn over a decision when the data points clearly to the answer. For instance, a restaurant could analyze which dishes on the menu are selling the most, helping them decide which ones to keep and which ones to change.

Adjust budgets: Similarly, data analysis can highlight areas in your business that are performing well and are worth investing more in, as well as areas that aren't generating enough revenue and should be cut. For example, a B2B software company might discover their product for enterprises is thriving while their small business solution lags behind. This discovery could prompt them to allocate more budget toward the enterprise product, resulting in better resource utilization.

Identify and solve problems: Let's say a cell phone manufacturer notices data showing a lot of customers returning a certain model. When they investigate, they find that model also happens to have the highest number of crashes. Once they identify and solve the technical issue, they can reduce the number of returns.

There are five main types of data analysis—with increasingly scary-sounding names. Each one serves a different purpose, so take a look to see which makes the most sense for your situation. It's ok if you can't pronounce the one you choose. 

Types of data analysis including text analysis, statistical analysis, diagnostic analysis, predictive analysis, and prescriptive analysis.

Text analysis: What is happening?

Text analysis, AKA data mining , involves pulling insights from large amounts of unstructured, text-based data sources : emails, social media, support tickets, reviews, and so on. You would use text analysis when the volume of data is too large to sift through manually. 

Here are a few methods used to perform text analysis, to give you a sense of how it's different from a human reading through the text: 

Word frequency identifies the most frequently used words. For example, a restaurant monitors social media mentions and measures the frequency of positive and negative keywords like "delicious" or "expensive" to determine how customers feel about their experience. 

Language detection indicates the language of text. For example, a global software company may use language detection on support tickets to connect customers with the appropriate agent. 

Keyword extraction automatically identifies the most used terms. For example, instead of sifting through thousands of reviews, a popular brand uses a keyword extractor to summarize the words or phrases that are most relevant. 

Because text analysis is based on words, not numbers, it's a bit more subjective. Words can have multiple meanings, of course, and Gen Z makes things even tougher with constant coinage. Natural language processing (NLP) software will help you get the most accurate text analysis, but it's rarely as objective as numerical analysis. 

Statistical analysis: What happened?

Statistical analysis pulls past data to identify meaningful trends. Two primary categories of statistical analysis exist: descriptive and inferential.

Descriptive analysis

Descriptive analysis looks at numerical data and calculations to determine what happened in a business. Companies use descriptive analysis to determine customer satisfaction , track campaigns, generate reports, and evaluate performance. 

Here are a few methods used to perform descriptive analysis: 

Measures of frequency identify how frequently an event occurs. For example, a popular coffee chain sends out a survey asking customers what their favorite holiday drink is and uses measures of frequency to determine how often a particular drink is selected. 

Measures of central tendency use mean, median, and mode to identify results. For example, a dating app company might use measures of central tendency to determine the average age of its users.

Measures of dispersion measure how data is distributed across a range. For example, HR may use measures of dispersion to determine what salary to offer in a given field. 

Inferential analysis

Inferential analysis uses a sample of data to draw conclusions about a much larger population. This type of analysis is used when the population you're interested in analyzing is very large. 

Here are a few methods used when performing inferential analysis: 

Hypothesis testing identifies which variables impact a particular topic. For example, a business uses hypothesis testing to determine if increased sales were the result of a specific marketing campaign. 

Confidence intervals indicates how accurate an estimate is. For example, a company using market research to survey customers about a new product may want to determine how confident they are that the individuals surveyed make up their target market. 

Regression analysis shows the effect of independent variables on a dependent variable. For example, a rental car company may use regression analysis to determine the relationship between wait times and number of bad reviews. 

Diagnostic analysis: Why did it happen?

Diagnostic analysis, also referred to as root cause analysis, uncovers the causes of certain events or results. 

Here are a few methods used to perform diagnostic analysis: 

Time-series analysis analyzes data collected over a period of time. A retail store may use time-series analysis to determine that sales increase between October and December every year. 

Data drilling uses business intelligence (BI) to show a more detailed view of data. For example, a business owner could use data drilling to see a detailed view of sales by state to determine if certain regions are driving increased sales.

Correlation analysis determines the strength of the relationship between variables. For example, a local ice cream shop may determine that as the temperature in the area rises, so do ice cream sales. 

Predictive analysis: What is likely to happen?

Predictive analysis aims to anticipate future developments and events. By analyzing past data, companies can predict future scenarios and make strategic decisions.  

Here are a few methods used to perform predictive analysis: 

Machine learning uses AI and algorithms to predict outcomes. For example, search engines employ machine learning to recommend products to online shoppers that they are likely to buy based on their browsing history. 

Decision trees map out possible courses of action and outcomes. For example, a business may use a decision tree when deciding whether to downsize or expand. 

Prescriptive analysis: What action should we take?

The highest level of analysis, prescriptive analysis, aims to find the best action plan. Typically, AI tools model different outcomes to predict the best approach. While these tools serve to provide insight, they don't replace human consideration, so always use your human brain before going with the conclusion of your prescriptive analysis. Otherwise, your GPS might drive you into a lake.

Here are a few methods used to perform prescriptive analysis: 

Lead scoring is used in sales departments to assign values to leads based on their perceived interest. For example, a sales team uses lead scoring to rank leads on a scale of 1-100 depending on the actions they take (e.g., opening an email or downloading an eBook). They then prioritize the leads that are most likely to convert. 

Algorithms are used in technology to perform specific tasks. For example, banks use prescriptive algorithms to monitor customers' spending and recommend that they deactivate their credit card if fraud is suspected. 

The actual analysis is just one step in a much bigger process of using data to move your business forward. Here's a quick look at all the steps you need to take to make sure you're making informed decisions. 

Circle chart with data decision, data collection, data cleaning, data analysis, data interpretation, and data visualization.

Data decision

As with almost any project, the first step is to determine what problem you're trying to solve through data analysis. 

Make sure you get specific here. For example, a food delivery service may want to understand why customers are canceling their subscriptions. But to enable the most effective data analysis, they should pose a more targeted question, such as "How can we reduce customer churn without raising costs?" 

These questions will help you determine your KPIs and what type(s) of data analysis you'll conduct , so spend time honing the question—otherwise your analysis won't provide the actionable insights you want.

Data collection

Next, collect the required data from both internal and external sources. 

Internal data comes from within your business (think CRM software, internal reports, and archives), and helps you understand your business and processes.

External data originates from outside of the company (surveys, questionnaires, public data) and helps you understand your industry and your customers. 

You'll rely heavily on software for this part of the process. Your analytics or business dashboard tool, along with reports from any other internal tools like CRMs , will give you the internal data. For external data, you'll use survey apps and other data collection tools to get the information you need.

Data cleaning

Data can be seriously misleading if it's not clean. So before you analyze, make sure you review the data you collected.  Depending on the type of data you have, cleanup will look different, but it might include: 

Removing unnecessary information 

Addressing structural errors like misspellings

Deleting duplicates

Trimming whitespace

Human checking for accuracy 

You can use your spreadsheet's cleanup suggestions to quickly and effectively clean data, but a human review is always important.

Data analysis

Now that you've compiled and cleaned the data, use one or more of the above types of data analysis to find relationships, patterns, and trends. 

Data analysis tools can speed up the data analysis process and remove the risk of inevitable human error. Here are some examples.

Spreadsheets sort, filter, analyze, and visualize data. 

Business intelligence platforms model data and create dashboards. 

Structured query language (SQL) tools manage and extract data in relational databases. 

Data interpretation

After you analyze the data, you'll need to go back to the original question you posed and draw conclusions from your findings. Here are some common pitfalls to avoid:

Correlation vs. causation: Just because two variables are associated doesn't mean they're necessarily related or dependent on one another. 

Confirmation bias: This occurs when you interpret data in a way that confirms your own preconceived notions. To avoid this, have multiple people interpret the data. 

Small sample size: If your sample size is too small or doesn't represent the demographics of your customers, you may get misleading results. If you run into this, consider widening your sample size to give you a more accurate representation. 

Data visualization

Last but not least, visualizing the data in the form of graphs, maps, reports, charts, and dashboards can help you explain your findings to decision-makers and stakeholders. While it's not absolutely necessary, it will help tell the story of your data in a way that everyone in the business can understand and make decisions based on. 

Automate your data collection

Data doesn't live in one place. To make sure data is where it needs to be—and isn't duplicative or conflicting—make sure all your apps talk to each other. Zapier automates the process of moving data from one place to another, so you can focus on the work that matters to move your business forward.

Need a quick summary or still have a few nagging data analysis questions? I'm here for you.

What are the five types of data analysis?

The five types of data analysis are text analysis, statistical analysis, diagnostic analysis, predictive analysis, and prescriptive analysis. Each type offers a unique lens for understanding data: text analysis provides insights into text-based content, statistical analysis focuses on numerical trends, diagnostic analysis looks into problem causes, predictive analysis deals with what may happen in the future, and prescriptive analysis gives actionable recommendations.

What is the data analysis process?

The data analysis process involves data decision, collection, cleaning, analysis, interpretation, and visualization. Every stage comes together to transform raw data into meaningful insights. Decision determines what data to collect, collection gathers the relevant information, cleaning ensures accuracy, analysis uncovers patterns, interpretation assigns meaning, and visualization presents the insights.

What is the main purpose of data analysis?

In business, the main purpose of data analysis is to uncover patterns, trends, and anomalies, and then use that information to make decisions, solve problems, and reach your business goals.

Related reading: 

How to get started with data collection and analytics at your business

How to conduct your own market research survey

Automatically find and match related data across apps

How to build an analysis assistant with ChatGPT

What can the ChatGPT data analysis chatbot do?

This article was originally published in October 2022 and has since been updated with contributions from Cecilia Gillen. The most recent update was in September 2023.

Get productivity tips delivered straight to your inbox

We’ll email you 1-3 times per week—and never share your information.

Shea Stevens picture

Shea Stevens

Shea is a content writer currently living in Charlotte, North Carolina. After graduating with a degree in Marketing from East Carolina University, she joined the digital marketing industry focusing on content and social media. In her free time, you can find Shea visiting her local farmers market, attending a country music concert, or planning her next adventure.

  • Data & analytics
  • Small business

What is data extraction? And how to automate the process

Data extraction is the process of taking actionable information from larger, less structured sources to be further refined or analyzed. Here's how to do it.

Related articles

PDF icon, which looks like a blank page with the top-right corner folded inward, against a peach-colored background.

How to write a statement of work (with template and example)

How to write a statement of work (with...

Hero image with an icon of a Gantt chart for product roadmaps and project management

21 project management templates to organize any workflow

21 project management templates to organize...

Hero image with an icon representing company core values

Company core values: AI core value generator (and 8 examples)

Company core values: AI core value generator...

A cog with a heart, dollar sign, smiley face, and star surrounding it, representing a CRM.

What is lead scoring—and how do you get started?

What is lead scoring—and how do you get...

Improve your productivity automatically. Use Zapier to get your apps working together.

A Zap with the trigger 'When I get a new lead from Facebook,' and the action 'Notify my team in Slack'

How to analyze a problem

May 7, 2023 Companies that harness the power of data have the upper hand when it comes to problem solving. Rather than defaulting to solving problems by developing lengthy—sometimes multiyear—road maps, they’re empowered to ask how innovative data techniques could resolve challenges in hours, days or weeks, write  senior partner Kayvaun Rowshankish  and coauthors. But when organizations have more data than ever at their disposal, which data should they leverage to analyze a problem? Before jumping in, it’s crucial to plan the analysis, decide which analytical tools to use, and ensure rigor. Check out these insights to uncover ways data can take your problem-solving techniques to the next level, and stay tuned for an upcoming post on the potential power of generative AI in problem-solving.

The data-driven enterprise of 2025

How data can help tech companies thrive amid economic uncertainty

How to unlock the full value of data? Manage it like a product

Data ethics: What it means and what it takes

Author Talks: Think digital

Five insights about harnessing data and AI from leaders at the frontier

Real-world data quality: What are the opportunities and challenges?

How a tech company went from moving data to using data: An interview with Ericsson’s Sonia Boije

Harnessing the power of external data

Data Analytics with R

1 problem solving with data, 1.1 introduction.

This chapter will introduce you to a general approach to solving problems and answering questions using data. Throughout the rest of the module, we will reference back to this chapter as you work your way through your own data analysis exercises.

The approach is applicable to actuaries, data scientists, general data analysts, or anyone who intends to critically analyze data and develop insights from data.

This framework, which some may refer to as The Data Science Process includes the following five main components:

  • Data Collection
  • Data Cleaning
  • Exploratory Data Analysis
  • Model Building
  • Inference and Communication

solving problems with data

Note that all five steps may not be applicable in every situation, but these steps should guide you as you think about how to approach each analysis you perform.

In the subsections below, we’ll dive into each of these in more detail.

1.2 Data Collection

In order to solve a problem or answer a question using data, it seems obvious that you must need some sort of data to start with. Obtaining data may come in the form of pre-existing or generating new data (think surveys). As an actuary, your data will often come from pre-existing sources within your company. This could include querying data from databases or APIs, being sent excel files, text files, etc. You may also find supplemental data online to assist you with your project.

For example, let’s say you work for a health insurance company and you are interested in determining the average drive time for your insured population to the nearest in-network primary care providers to see if it would be prudent to contract with additional doctors in the area. You would need to collect at least three pieces of data:

  • Addresses of your insured population (internal company source/database)
  • Addresses of primary care provider offices (internal company source/database)
  • Google Maps travel time API to calculate drive times between addresses (external data source)

In summary, data collection provides the fundamental pieces needed to solve your problem or answer your question.

1.3 Data Cleaning

We’ll discuss data cleaning in a little more detail in later chapters, but this phase generally refers to the process of taking the data you collected in step 1, and turning it into a usable format for your analysis. This phase can often be the most time consuming as it may involve handling missing data as well as pre-processing the data to be as error free as possible.

Depending on where you source your data will have major implications for how long this phase takes. For example, many of us actuaries benefit from devoted data engineers and resources within our companies who exert much effort to make our data as clean as possible for us to use. However, if you are sourcing your data from raw files on the internet, you may find this phase to be exceptionally difficult and time intensive.

1.4 Exploratory Data Analysis

Exploratory Data Analysis , or EDA, is an entire subject itself. In short, EDA is an iterative process whereby you:

  • Generate questions about your data
  • Search for answers, patterns, and characteristics of your data by transforming, visualizing, and summarizing your data
  • Use learnings from step 2 to generate new questions and insights about your data

We’ll cover some basics of EDA in Chapter 4 on Data Manipulation and Chapter 5 on Data Visualization, but we’ll only be able to scratch the surface of this topic.

A successful EDA approach will allow you to better understand your data and the relationships between variables within your data. Sometimes, you may be able to answer your question or solve your problem after the EDA step alone. Other times, you may apply what you learned in the EDA step to help build a model for your data.

1.5 Model Building

In this step, we build a model, often using machine learning algorithms, in an effort to make sense of our data and gain insights that can be used for decision making or communicating to an audience. Examples of models could include regression approaches, classification algorithms, tree-based models, time-series applications, neural networks, and many, many more. Later in this module, we will practice building our own models using introductory machine learning algorithms.

It’s important to note that while model building gets a lot of attention (because it’s fun to learn and apply new types of models), it typically encompasses a relatively small portion of your overall analysis from a time perspective.

It’s also important to note that building a model doesn’t have to mean applying machine learning algorithms. In fact, in actuarial science, you may find more often than not that the actuarial models you create are Microsoft Excel-based models that blend together historical data, assumptions about the business, and other factors that allow you make projections or understand the business better.

1.6 Inference and Communication

The final phase of the framework is to use everything you’ve learned about your data up to this point to draw inferences and conclusions about the data, and to communicate those out to an audience. Your audience may be your boss, a client, or perhaps a group of actuaries at an SOA conference.

In any instance, it is critical for you to be able to condense what you’ve learned into clear and concise insights and convince your audience why your insights are important. In some cases, these insights will lend themselves to actionable next steps, or perhaps recommendations for a client. In other cases, the results will simply help you to better understand the world, or your business, and to make more informed decisions going forward.

1.7 Wrap-Up

As we conclude this chapter, take a few minutes to look at a couple alternative visualizations that others have used to describe the processes and components of performing analyses. What do they have in common?

  • Karl Rohe - Professor of Statistics at the University of Wisconsin-Madison
  • Chanin Nantasenamat - Associate Professor of Bioinformatics and Youtuber at the “Data Professor” channel

solving problems with data

Solving Common Data Challenges

Once you know what predictive analytics solution you want to build, it’s all about the data. The reliability of predictions depends on the quality of the data used to discover variables and generate, train, and test predictive models.

In this chapter, you’ll learn how to avoid data problems that could slow your time to market. We’ll also give you some guidelines for getting the most predictive power from your data and the best performance from your models.

Find Icon

Find the Data You Need

Find the Data Hero

You’ve chosen a business problem you want to solve. Now, what data do you need to solve it?

Some of the data may be easy to obtain, while other numbers will require more time and effort to assemble. Fortunately, there’s no need to have everything in hand from the beginning. You shouldn’t wait until everything is 100 percent right to deploy a predictive model. If you do, you’ll be lagging behind competitors or facing totally different customer requirements. Instead of a perfect predictive model, aim to develop one that’s better than what you have in place now.

Start by building a model with a few essential data elements. Put the application out there and collect user feedback while you gather additional data. Bring new data elements into your model in subsequent releases to deliver incremental performance enhancements or additional capabilities.

Database Icon

Choose the Right Database

Right Database Hero

You’ll need to store three types of data for your predictive projects:

  • Historical data for creating, training, and testing your models
  • New data your models will analyze to make predictions, such as customer transactions from yesterday or last week
  • Predictions (outputs) from your models

Historical data usually amounts to the biggest data volume. One terabyte of storage for historical data is a good baseline for most small and medium business’s predictive applications. Traditional databases like SQL Server, Oracle, PostGres, and MySQL usually work fine for this purpose, and they’re also adequate for generating predictions on new data.

If your application is targeted at significantly larger businesses, both your development team and your customers may need to store much more historical data, perhaps as much as 100 terabytes. And if your application requires collecting very detailed data from transactions or sensors, both the historical data you use for training and testing and the new data from which you’re generating predictions will rapidly multiply. In these cases, look at non-relational databases that scale vertically (by adding CPU/RAM resource nodes) and horizontally (by adding containers and clusters).

Hygiene Icon

Practice Database Hygiene

Database Hygiene Hero

It’s usually best to keep the three categories of data—historical, new, and predictions—in separate databases/tables.

In some cases, however, it may be preferable to use the same database/table for both historical and new data. This can be done by creating filters to separate the data. For example, a six-month filter could be used to access historical data, with a last-day or last-week filter for new data.

Within the category of historical data, you also want to separate the data for training your model from the data for testing it. Using the same data in both the build and validation stages would be like having exam questions ahead of the exam—a great score, but the results are meaningless because there’s no evidence anything has actually been learned. By creating two sections—one for training and the other for testing—you ensure a fair evaluation because you’re measuring the accuracy of your model on data it has never seen before.

Clean Icon

Cleanse Your Data

Clean Your Data Hero

Predictive models are only as good as the data they learn from. Accuracy of predictions depends on clean data. As with other business intelligence projects, the task of cleansing data has traditionally been lengthy—taking up as much as 60 percent of time in predictive projects.

Fortunately, that’s changing. Machine learning is increasingly being used to detect and resolve two of the most common data problems:

1. Missing values

To appreciate the value of machine learning, it’s helpful to understand some of the options for correcting this problem without it. In manual data cleansing, a typical approach has been to remove all the rows with missing values. But doing so can bias the data—for example, if most of the deleted rows are for males, that could bias the data against males. Also, removing a significant portion of usable data could impair predictive accuracy.

Missing Value Table

A more sophisticated approach would be to calculate the mean of other similar records—for example, filling in a salary value based on the mean salary for customers in the same age group. The problem is that age might not be the only—or even the best—predictor of salary.

Machine learning can resolve missing values faster and more accurately than manual methods. Instead of simply calculating the mean salary for an age group, algorithms can examine more complex relationships between multiple factors—for example, age group, work experience, education, and salary—and infer missing values based on these multidimensional similarities.

2. Outliers

Data points that fall significantly outside of the normal distribution for a particular variable can skew predictive models, especially when the training dataset is small. These outliers may be caused by errors in measurement, recording, or transmission of data. They could also be the unintended result of how the dataset was defined or simply be natural deviations in a population.

Outliers Table

In any case, outliers need to be analyzed to determine if they contain valuable data or are just meaningless aberrations. Depending on which it is, there are various appropriate ways of dealing with outliers, including removal or replacement with the nearest reliable value. All these methods have traditionally required analytics expertise.

Machine learning makes detection even more critical, since results from ML-based predictive models can be significantly impacted by outliers. But machine learning also helps with detection. Feeding ML outputs to visualizations such as scatter plots is a fast, easy way to spot outliers. And ML tools can automate fixes, such as clipping values above or below specified thresholds. We expect to see more holistic, large-scale ML data cleansing solutions soon.

Bias Hero

Avoid Bias in Your Data and Models

Avoid Bias Hero

Models learn to predict outcomes by identifying patterns (correlations between specific input variables and outcomes) in historical data. So if there’s bias in your data—which there typically is—there will be bias in your models. If you don’t correct for it, the predictions your models make will be inaccurate, as they reinforce and perpetuate the bias.

Watch out for these two common types of bias:

1. Data Bias

Let’s say you’re creating a model to predict the best applicants for sales jobs. If you train your model with data on your current sales team, which has historically had a majority of young white males, then your model will learn patterns based on those characteristics.

In production, the model will then score young white males higher than applicants with other characteristics. You may not be able to fix the problem by just excluding the biased variables. Other variables could also be contributing bias—for example, zip code may be a proxy for race or age.

One way to correct for data bias is through over-sampling or under-sampling. Under-sampling works best when you have a large amount of data, while over-sampling is better when data is limited.

Imagine you want your model to predict a particular outcome like customer churn or propensity to buy a product. If your historical data contains 98 percent of customers who did not have this outcome and just 2 percent who did (or vice versa), then you have a class imbalance problem. The model you train from this data may not be very accurate at predicting the 2 percent outcome because it didn’t have enough data to adequately learn the characteristics that correlate with it. You can address this imbalance by either taking more rows of data in your sample from the under-represented outcome (over-sampling to augment the 2 percent) or fewer rows of data from the over-represented outcome (under-sampling to reduce the 98 percent).

Another approach is to use the Synthetic Minority Oversampling Technique (SMOTE). It uses over-sampling, but also fills in artificial data points—which serve as additional “synthetic” instances of the under-represented class—based on the existing data points.

Data Bias Table

2. Selection bias

Selection bias happens when the data you use to train your model differs in some significant way from the data it analyzes in production. The resulting predictions will be inaccurate and may also be unfair.

Here are the most common reasons for selection bias and how to avoid them:

Or consider a propensity-to-purchase model built to help expand a product’s market by identifying new customers. If the training data was exclusively drawn from the current customer base, it will be biased toward current customer characteristics. As a result, it may not be very good at identifying the characteristics of other potential market segments—and will overlook opportunities. You can reduce selection bias and improve performance by enriching the data with experiments that capture data outside of your current customer base.

Validate Icon

Validate your model is working and establish a performance baseline

Validate Hero

To validate that your model is working prior to launch, run it on your test data. This is historical data you’ve stored separately from the data you used to build the model. Since you already know the actual outcomes, you can measure how close the prediction gets to what happened in the real world.

Accuracy measures the total number of correctly predicted outcomes as a percentage of all outcomes.

You don’t need a model with 100 percent accuracy. To determine if you’re ready for your initial predictive software launch and to establish a performance baseline, compare your model’s accuracy with what your beta customers were achieving without it. If your customers were making mostly judgment-based “gut” decisions, the predictive analytics model simply needs to do better than that baseline.

Let’s say the model predicts customers likely to churn, and validation shows it correctly identifies 65 percent of customers in the historical data testing sample that actually did churn. While there’s room for improvement, this accuracy may be an order of magnitude better than judgment-based approaches companies often use to segment and target customers for retention programs.

Watch for Imbalanced Data While you don’t need 100 percent accuracy from your models, you do need to fix any problems of data imbalance that show up in validation testing.

Imagine a healthcare application that predicts if a patient needs to be screened for cancer. In the historical data you’re using to train the model, 95 percent of patients do not need to be screened for cancer, and 5 percent do. This is an imbalanced dataset, as one class (patients who don’t need screening) has a huge majority over the other.

Depending on the application, this imbalance could be a serious problem. Let’s say validation testing shows that your model has an 85 percent accuracy rate overall. That 15 percent error rate may not seem like a lot, but if the model assigns NO SCREENING for patients that actually need to be screened, it could prevent crucial early intervention treatments. The consequences associated with those wrong predictions can be disastrous.

One way to fix problems like this is by sampling the data to balance it. Sampling involves equalizing the number of data points in majority and minority classes. You can do that by “down sampling” (removing some data points from the majority class) or by “up sampling” (creating new synthetic data points in the minority class using statistical inference techniques based on existing data points).

Once you’ve validated a reasonable level of model accuracy and have fixed any data imbalance problems, “call it good,” as they say, and move ahead to launch. Get your predictive analytics solution to market—then work on making it better. Since you’ve already established a performance baseline without predictive analytics, you can chart progress post-launch as you make incremental improvements.

We’ll share some techniques for how to do that in the next chapter.

5 Steps on How to Approach a New Data Science Problem

Many companies struggle to reorganize their decision making around data and implement a coherent data strategy. The problem certainly isn’t lack of data but inability to transform it into actionable insights. Here's how to do it right.




Data has become the new gold. 85 percent of companies are trying to be data-driven, according to last year’s survey by NewVantage Partners , and the global data science platform market is expected to reach $128.21 billion by 2022, up from $19.75 billion in 2016.

Clearly, data science is not just another buzzword with limited real-world use cases. Yet, many companies struggle to reorganize their decision making around data and implement a coherent data strategy. The problem certainly isn’t lack of data.

In the past few years alone, 90 percent of all of the world’s data has been created, and our current daily data output has reached 2.5 quintillion bytes , which is such a mind-bogglingly large number that it’s difficult to fully appreciate the break-neck pace at which we generate new data.

The real problem is the inability of companies to transform the data they have at their disposal into actionable insights that can be used to make better business decisions, stop threats, and mitigate risks.

In fact, there’s often too much data available to make a clear decision, which is why it’s crucial for companies to know how to approach a new data science problem and understand what types of questions data science can answer.

What types of questions can data science answer?

“Data science and statistics are not magic. They won’t magically fix all of a company’s problems. However, they are useful tools to help companies make more accurate decisions and automate repetitive work and choices that teams need to make,” writes Seattle Data Guy , a data-driven consulting agency.

The questions that can be answered with the help of data science fall under following categories:

  • Identifying themes in large data sets : Which server in my server farm needs maintenance the most?
  • Identifying anomalies in large data sets : Is this combination of purchases different from what this customer has ordered in the past?
  • Predicting the likelihood of something happening : How likely is this user to click on my video?
  • Showing how things are connected to one another : What is the topic of this online article?
  • Categorizing individual data points : Is this an image of a cat or a mouse?

Of course, this is by no means a complete list of all questions that data science can answer. Even if it were, data science is evolving at such a rapid pace that it would most likely be completely outdated within a year or two from its publication.

Now that we’ve established the types of questions that can be reasonably expected to be answered with the help of data science, it’s time to lay down the steps most data scientists would take when approaching a new data science problem.

Step 1: Define the problem

First, it’s necessary to accurately define the data problem that is to be solved. The problem should be clear, concise, and measurable . Many companies are too vague when defining data problems, which makes it difficult or even impossible for data scientists to translate them into machine code.

Here are some basic characteristics of a well-defined data problem:

  • The solution to the problem is likely to have enough positive impact to justify the effort.
  • Enough data is available in a usable format.
  • Stakeholders are interested in applying data science to solve the problem.

Step 2: Decide on an approach

There are many data science algorithms that can be applied to data, and they can be roughly grouped into the following families:

  • Two-class classification : useful for any question that has just two possible answers.
  • Multi-class classification : answers a question that has multiple possible answers.
  • Anomaly detection : identifies data points that are not normal.
  • Regression : gives a real-valued answer and is useful when looking for a number instead of a class or category.
  • Multi-class classification as regression : useful for questions that occur as rankings or comparisons.
  • Two-class classification as regression : useful for binary classification problems that can also be reformulated as regression.
  • Clustering : answer questions about how data is organized by seeking to separate out a data set into intuitive chunks.
  • Dimensionality reduction : reduces the number of random variables under consideration by obtaining a set of principal variables.
  • Reinforcement learning algorithms : focus on taking action in an environment so as to maximize some notion of cumulative reward.

Step 3: Collect data

With the problem clearly defined and a suitable approach selected, it’s time to collect data. All collected data should be organized in a log along with collection dates and other helpful metadata.

It’s important to understand that collected data is seldom ready for analysis right away. Most data scientists spend much of their time on data cleaning , which includes removing missing values, identifying duplicate records, and correcting incorrect values.

Step 4: Analyze data

The next step after data collection and cleanup is data analysis. At this stage, there’s a certain chance that the selected data science approach won’t work. This is to be expected and accounted for. Generally, it’s recommended to start with trying all the basic machine learning approaches as they have fewer parameters to alter.

There are many excellent open source data science libraries that can be used to analyze data. Most data science tools are written in Python, Java, or C++.

<blockquote><p>“Tempting as these cool toys are, for most applications the smart initial choice will be to pick a much simpler model, for example using scikit-learn and modeling techniques like simple logistic regression,” – advises Francine Bennett, the CEO and co-founder of Mastodon C.</p></blockquote>

Step 5: Interpret results

After data analysis, it’s finally time to interpret the results. The most important thing to consider is whether the original problem has been solved. You might discover that your model is working but producing subpar results. One way how to deal with this is to add more data and keep retraining the model until satisfied with it.

Most companies today are drowning in data. The global leaders are already using the data they generate to gain competitive advantage, and others are realizing that they must do the same or perish. While transforming an organization to become data-driven is no easy task, the reward is more than worth the effort.

The 5 steps on how to approach a new data science problem we’ve described in this article are meant to illustrate the general problem-solving mindset companies must adopt to successfully face the challenges of our current data-centric era.

Frequently Asked Questions

Our promise

Every year, Brainhub helps 750,000+ founders, leaders and software engineers make smart tech decisions. We earn that trust by openly sharing our insights based on practical software engineering experience.

solving problems with data

A serial entrepreneur, passionate R&D engineer, with 15 years of experience in the tech industry.

Popular this month

Get smarter in engineering and leadership in less than 60 seconds.

Join 300+ founders and engineering leaders, and get a weekly newsletter that takes our CEO 5-6 hours to prepare.

previous article in this collection

It's the first one.

next article in this collection

It's the last one.

solving problems with data

  • Get started

19 Big Data Problems You Need to Solve

Download Free Expense Analytics Data Sheet

In the last two years, over 90% of the world’s data was created, and with 2.5 quintillion bytes of data generated daily, it is clear that the future is filled with more data, which can also mean more data problems.

Whilst it is clear that companies can benefit from this growth in data, executives must be cautious and aware of the challenges they will need to overcome, particularly around:

  • Collecting, storing, sharing and securing data.
  • Creating and utilising meaningful insights from their data.

Download Now: Connect all Finance Data

Luckily, there are pragmatic solutions that companies can take to overcome their data problems and thrive in the data-driven economy.

Here’s a look at some common data problems and how you can solve them.

What is Big Data?

1. finding and fixing data quality issues, 2. long response times from system, 3. confusion with big data tool selection, 4. real time big data problems, 5. lack of understanding, 6. high cost of data solutions, 7. too many choices, 8. complex systems for managing data, 9. security gaps, 10. low quality and inaccurate data, 11. compliance hurdles, 12. using data for meaning.

13. Keeping Up With Growth in Data

14. Accessibility

15. pace of technology, 16. lack of skilled workers, 17. data integration, 18. processing large data sets, 19. constantly changing data.

solving problems with data

A commonly used buzzword, big data is everywhere, which is why it’s always being talked about! Data is collected and generated from almost everything we do- be it streaming a show, submitting a form online, sending an email, reading a text, creating a report, etc.

To define what big data is, the four V’s are used, namely:

  • Volume: The size of data (i.e. petabytes or exabytes)
  • Velocity: The speed at which data flows
  • Veracity: The validity of the data
  • Variety: The nature of the data (structured and unstructured formats)

When dealing with the data, the utmost importance is its accuracy. After all, every insight you glean from data will depend on the data itself. It all begins during the data collection phase. At this time, you want to be sure that you’re collecting data from the right sources at the right time if you’re going to apply the data for outputs.

Along with the collection of data, data quality will also depend on how you store the data. It must be made accessible in order to be analysed (this is where automation solutions come into play).

During the data lifecycle, you must also maintain data properly so that it can be used by the right team at any point in time for application. This data usage is what breeds decision-making abilities. To learn how to keep data clean and reduce data quality issues, check out this guide.

Clean and accurate data is just as important as data being accessible when you need it. If you’re using a data tool that’s slow, then by the time your data is available for use, it could be considered outdated and old.

You want data to be able to be processed immediately when you input it so you can make use of your output in a timely manner.

One way to fix long response times from your system is to ensure that data is being stored efficiently by performing data re-engineering. Or, look for a more optimised data system that’s scalable for your growing data needs.

Another challenge when dealing with big data is choosing the right big data tool for your business’ needs. Since your big data tool is geared towards reducing big data problems, it’d be a shame if it became a problem in itself!

In order to overcome this challenge, it’s best to take time performing research and not jump too quickly into a specific tool. Additionally, be sure to review what kind of support the tools you’re considering offer.

If the option exists to schedule a demo, take advantage of it because it will give you a view of how the big data solution will work specifically for your business.

As we mentioned, big data exists everywhere. As such, data is constantly changing and evolving, which thus impacts the insights you glean from it. Technically, this requires a tool that can provide up-to-date filtering and remove redundant or irrelevant data from the picture when you’re applying it.

A surefire way to overcome real-time big data issues is to deploy an automation solution that utilises artificial intelligence (AI) to process, analyse, and structure data in real-time. By doing so, you can avoid big data problems at every turn.

Companies can leverage data to boost performance in many areas. Some of the best use cases for data are to: decrease expenses, create innovation, launch new products, grow the bottom line, and increase efficiency, to name a few. Despite the benefits, companies have been slow to adopt data technology or put a plan in place for how to create a data-centric culture. In fact, according to a Gartner study, out of 196 companies surveyed, 91% say they have yet to reach a “transformational” level of maturity in their data and analytics.

Solution: One way to combat the slow adoption is to take a top-down approach for introducing and training your organisation on data usage and procedures. If your in-house team doesn’t have the resources to take this on, consider bringing in IT specialists or consultants and holding workshops to educate your organisation.

After understanding how your business will benefit most from implementing data solutions, you’re likely to find that buying and maintaining the necessary components can be expensive. Along with hardware like servers and storage to software, there also comes the cost of human resources and time.

Solution: To make the most informed decision for what kind of data solution will provide the most ROI, first consider how and why you want to use data. Then, align your reasoning with your business goals, conduct research for available solutions, and implement a strategic plan to incorporate it into your organisation.

According to psychologist Barry Schwartz, less really can be more. Coined as the “paradox of choice,” Schwartz explains how option overload can cause inaction on behalf of a buyer. Instead, by limiting a consumer’s choices, anxiety and stress can be lessened. In the world of data and data tools, the options are almost as widespread as the data itself, so it is understandably overwhelming when deciding the solution that’s right for your business, especially when it will likely affect all departments and hopefully be a long-term strategy.

Solution: Like understanding data, a good solution is to leverage the experience of your in-house expert, perhaps a CTO. If that’s not an option, hire a consultancy firm to assist in the decision-making process. Use the internet and forums to source valuable information and ask questions.

Moving from a legacy data management system and integrating a new solution comes as a challenge in itself. Furthermore, with data coming from multiple sources, and IT teams creating their own data while managing data, systems can become complex quickly.

Solution: Find a solution with a single command center, implement automation whenever possible, and ensure that it can be remotely accessed 24/7.

The importance of data security cannot go unnoticed. However, as solutions are being implemented, it’s not always easy to focus on data security with many moving pieces. Data also needs to be stored properly, which starts with encryption and constant backups.

Solution: You can take a few low effort steps to dramatically increase the security of your data, like: automate security updates, automate backups, install operating system updates (which often include better security), use firewalls, etc.

Having data is only useful when it’s accurate. Low quality data not only serves no purpose, but it also uses unnecessary storage and can harm the ability to gather insights from clean data.

A few ways that data can be considered low quality is:

  • Inconsistent formatting (which will take time to correct and can happen when the same elements are spelled differently like “US” versus “U.S.”),
  • Missing data (i.e. a first name or email address is missing from a database of contacts),
  • Inaccurate data (i.e. it’s just not the right information or the data has not be updated).
  • Duplicate data (i.e. the data is being double counted)

If data is not maintained or recorded properly, it’s just like not having the data in the first place.

Solution: Begin by defining the necessary data you want to collect (again, align the information needed to the business goal). Cleanse data regularly and when it is collected from different sources, organise and normalise it before uploading it into any tool for analysis. Once you have your data uniform and cleansed, you can segment it for better analysis.

solving problems with data

When collecting information, security and government regulations come into play. With the somewhat recent introduction of the General Data Protection Regulation (GDPR), it’s even more important to understand the necessary requirements for data collection and protection, as well as the implications of failing to adhere. Companies have to be compliant and careful in how they use data to segment customers for example deciding which customer to prioritise or focus on. This means that the data must: be a representative sample of consumers, algorithms must prioritise  fairness, there is an understanding of inherent bias in data, and Big Data outcomes have to be checked against traditionally applied statistical practices.

Solution: The only solution to adhere to compliance and regulation is to be informed and well-educated on the topic. There’s no way around it other than learning because in this case, ignorance is most certainly not bliss as it carries both financial and reputational risk to your business. If you are unsure of any regulations or compliance you should consult expert legal and accounting firms specialising in those rules.

You may have the data. It’s clean, accurate and organised. But, how do you use it to provide valuable insights to improve your business? Many organisations are turning to robust data analysis tools that can help assess the big picture, as well as break down the data into meaningful bits of information that can then be transformed into actionable outcomes.

Solution: Whether this means having a consistent reporting structure or a dedicated analytics team, be sure to turn your data into measurable outcomes. This means taking data and transforming into actions for the business to take in an effort to produce wins for the company.  

solving problems with data

13. Keeping Up with Growth in Data

Like scaling a company, growing with data is a challenge. You want to make sure that you can scale your solution with the companies growth so that the costs and quality don’t decrease as it expands.

Solution: This is achievable by creating projections from the get go of introducing data and data management tools . Make sure that you select a robust data solution and know in advance that it can handle the capabilities you may need down the line. Another option is to rely on support systems and internal teams to manage aspects of growth. For example, you can define milestones for your team to be aware of so that only when you reach them will you consider moving to a more sophisticated system.

Sometimes, companies silo data to one person or one department. Not only does this put immense responsibility on a select few, but it also creates a lack of accessibility throughout the organisation in departments where the data can be of use to provide a positive impact. Data silos directly inhibit the benefits of collecting data in the first place.

Solution: It sounds simple, but it’s not done enough - integrate your data. Set clear expectations and create a unified system that can handle each department’s needs. If it’s not through finding a single integrated system, consider using APIs so that data is accessible in one, centralised location.

Inventor, author and futurist Ray Kurzweil put it best when he defined the accelerating rate of change of technology. Each subsequent technological advancement builds more quickly upon the last because they evolve at each step to become more efficient and therefore can better inform what comes next. For example, just consider how rapidly cloud computing and artificial intelligence are improving.

With the rapid advancement of technology and systems, you don’t want your data tools to become outdated, especially when you’re investing time, energy and human resources into them.

Solution: While you can’t stop progression, you can prepare for it. This begins with staying informed of information technology and its new features, products and threats.

solving problems with data

While the technological demand is high and artificial intelligence and data analysis tools are innovating swiftly, the lack of skilled workers is causing a bottleneck for many companies. The number of new, skilled graduates isn’t keeping pace with technology, and in turn, companies are asking staff to supplement this shortfall by working multiple roles.

Solution: If the solution doesn’t exist naturally, try to create it. While you can’t control how many data scientists and data analysts graduate each year, you can leverage your current workforce and provide training to instil and teach the skills you need them to have. You can also look for more powerful data tools that make the analysis work less complex, which open up recruitment to a broader pool of less specialised analysts.

Data integration consists of taking data from various sources and combining it to create valuable and usable information.

Solution: There are a few ways to go about integrating data, including the following approaches:

  • Consolidation: Combining the data from various sources in one consolidated data store
  • Propagation: Leveraging applications to copy data from one location to another
  • Federation: Using a virtual database to create a model to match data from different systems
  • Virtualisation: Viewing data in one location, but where the data is still stored separately

Large data sets are challenging to process and make sense of. The three V’s of big data include volume, velocity and variety. Volume is the amount of data, velocity is the rate that new data is created, and variety is the various formats that data exists in like images, videos and text.

Solution: The solution for problems with large data sets, regardless of their exact size, has been discussed throughout this article and include tactics that are performed by both human resources and technology. Steps to properly process data, regardless of its size, include ensuring data is accurate, integrating data, and developing a company culture that both understands and celebrates the usage of big data to make informed decisions.

Implementing the infrastructure and management of data cannot be a set-and-forget task. The nature of data is that it’s constantly changing. Your customers details and orders are always changing, as well as their interactions with your company.

Solution: Incorporate data systems with advanced machine learning and interoperability in order to adapt to the constantly changing landscape of data inputs, and in turn, outputs. You can also use systems that store historic as well as new data to understand the causes and implications of the data changes and model future trends.

Download Now: Drive Rapid Insights & Analytics ebook

In today’s data-driven world, the management of your data is essential and must not be ignored. You need to be proactive in understanding and implementing data solutions that align with your business goals. By doing so, you can effectively mitigate any big data problems.

Some organisations will need to assemble a dedicated team of experts to manage their data. That being said, modern data tools offer a simple way to augment and leverage existing staff to be able to turn data into insights for the business.

New call-to-action

Related Posts

Our top guides, our top guide, popular posts, free up time and reduce errors, intelligent reconciliation solution, intelligent rebate management solution, recommended for you.

solving problems with data

Request a Demo

Book a 30-minute call to see how our intelligent software can give you more insights and control over your data and reporting.

solving problems with data

Reconciliation Data Sheet

Download our data sheet to learn how to automate your reconciliations for increased accuracy, speed and control.

solving problems with data

Regulatory Reporting Data Sheet

Download our data sheet to learn how you can prepare, validate and submit regulatory returns 10x faster with automation.

solving problems with data

Financial Automation Data Sheet

Download our data sheet to learn how you can run your processes up to 100x faster and with 98% fewer errors.

solving problems with data

Rebate Management Data Sheet

Download our data sheet to learn how you can manage complex vendor and customer rebates and commission reporting at scale.

solving problems with data

Top 10 Automation Challenges for CFOs

Learn how you can avoid and overcome the biggest challenges facing CFOs who want to automate.

Latest Blog Posts

solving problems with data

Must Know Facts About Database Management

Database management can help to optimise a organisation. It works together with data collection, analytics and automation.

solving problems with data

Account Receivables Management: Best Practice Tips

Find out how you can maximize the effectiveness of the accounts receivables department by following best practices and leveraging software.

solving problems with data

Financial Analysis and Planning: Your Complete Overview

Financial analysis and planning helps to create budgets, forecasts and analysis for decision-making processes. Automation tools assist.

  • Solving Problems with Data Science

solving problems with data

Aakash Tandel , Former Data Scientist

Article Categories: #Strategy , #Data & Analytics

Posted on December 3, 2018

There is a systematic approach to solving data science problems and it begins with asking the right questions. This article covers some of the many questions we ask when solving data science problems at Viget.

T h e r e i s a s y s t e m a t i c a p p r o a c h t o s o l v i n g d a t a s c i e n c e p r o b l e m s a n d i t b e g i n s w i t h a s k i n g t h e r i g h t q u e s t i o n s . T h i s a r t i c l e c o v e r s s o m e o f t h e m a n y q u e s t i o n s w e a s k w h e n s o l v i n g d a t a s c i e n c e p r o b l e m s a t V i g e t .

A challenge that I’ve been wrestling with is the lack of a widely populated framework or systematic approach to solving data science problems. In our analytics work at Viget, we use a framework inspired by Avinash Kaushik’s Digital Marketing and Measurement Model . We use this framework on almost every project we undertake at Viget. I believe data science could use a similar framework that organizes and structures the data science process.

As a start, I want to share the questions we like to ask when solving a data science problem. Even though some of the questions are not specific to the data science domain, they help us efficiently and effectively solve problems with data science.

Business Problem

What is the problem we are trying to solve?

That’s the most logical first step to solving any question, right? We have to be able to articulate exactly what the issue is. Start by writing down the problem without going into the specifics, such as how the data is structured or which algorithm we think could effectively solve the problem.

Then try explaining the problem to your niece or nephew, who is a freshman in high school. It is easier than explaining the problem to a third-grader, but you still can’t dive into statistical uncertainty or convolutional versus recurrent neural networks. The act of explaining the problem at a high school stats and computer science level makes your problem, and the solution, accessible to everyone within your or your client’s organization, from the junior data scientists to the Chief Legal Officer.

Clearly defining our business problem showcases how data science is used to solve real-world problems. This high-level thinking provides us with a foundation for solving the problem. Here are a few other business problem definitions we should think about.

  • Who are the stakeholders for this project?
  • Have we solved similar problems before?
  • Has someone else documented solutions to similar problems?
  • Can we reframe the problem in any way?

And don’t be fooled by these deceivingly simple questions. Sometimes more generalized questions can be very difficult to answer. But, we believe answering these framing question is the first, and possibly most important, step in the process, because it makes the rest of the effort actionable.  

Say we work at a video game company —  let’s call the company Rocinante. Our business is built on customers subscribing to our massive online multiplayer game. Users are billed monthly. We have data about users who have cancelled their subscription and those who have continued to renew month after month. Our management team wants us to analyze our customer data.

Well, as a company, the Rocinante wants to be able to predict whether or not customers will cancel their subscription . We want to be able to predict which customers will churn, in order to address the core reasons why customers unsubscribe. Additionally, we need a plan to target specific customers with more proactive retention strategies.

Churn is the turnover of customers, also referred to as customer death. In a contractual setting - such as when a user signs a contract to join a gym - a customer “dies” when they cancel their gym membership. In a non-contractual setting, customer death is not observed and is more difficult to model. For example, Amazon does not know when you have decided to never-again purchase Adidas. Your customer death as an Amazon or Adidas customer is implied.

solving problems with data

Possible Solutions

What are the approaches we can use to solve this problem.

There are many instances when we shouldn’t be using machine learning to solve a problem. Remember, data science is one of many tools in the toolbox. There could be a simpler, and maybe cheaper, solution out there. Maybe we could answer a question by looking at descriptive statistics around web analytics data from Google Analytics. Maybe we could solve the problem with user interviews and hear what the users think in their own words. This question aims to see if spinning up EC2 instances on Amazon Web Services is worth it. If the answer to,  “Is there a simple solution,”  is, “No,” then we can ask, “ Can we use data science to solve this problem? ” This yes or no question brings about two follow-up questions:

  • “ Is the data available to solve this problem? ” A data scientist without data is not a very helpful individual. Many of the data science techniques that are highlighted in media today — such as deep learning with artificial neural networks — requires a massive amount of data. A hundred data points is unlikely to provide enough data to train and test a model. If the answer to this question is no, then we can consider acquiring more data and pipelining that data to warehouses, where it can be accessed at a later date.
  • “ Who are the team members we need in order to solve this problem? ” Your initial answer to this question will be, “The data scientist, of course!” The vast majority of the problems we face at Viget can’t or shouldn’t be solved by a lone data scientist because we are solving business problems. Our data scientists team up with UXers , designers , developers , project managers , and hardware developers to develop digital strategies and solving data science problems is one part of that strategy. Siloing your problem and siloing your data scientists isn’t helpful for anyone.

We want to predict when a customer will unsubscribe from Rocinante’s flagship game. One simple approach to solving this problem would be to take the average customer life - how long a gamer remains subscribed - and predict that all customers will churn after X amount of time. Say our data showed that on average customers churned after 72 months of subscription. Then we  could  predict a new customer would churn after 72 months of subscription. We test out this hypothesis on new data and learn that it is wildly inaccurate. The average customer lifetime for our previous data was 72 months, but our new batch of data had an average customer lifetime of 2 months. Users in the second batch of data churned much faster than those in the first batch. Our prediction of 72 months didn’t generalize well. Let’s try a more sophisticated approach using data science.

  • Is the data available to solve this problem?  The dataset contains 12,043 rows of data and 49 features. We determine that this sample of data is large enough for our use-case. We don’t need to deploy Rocinante’s data engineering team for this project.
  • Who are the team members we need in order to solve this problem?   Let’s talk with the Rocinante’s data engineering team to learn more about their data collection process. We could learn about biases in the data from the data collectors themselves. Let’s also chat with the customer retention and acquisitions team and hear about their tactics to reduce churn. Our job is to analyze data that will ultimately impact their work. Our project team will consist of the data scientist to lead the analysis, a project manager to keep the project team on task, and a UX designer to help facilitate research efforts we plan to conduct before and after the data analysis.

solving problems with data

How do we know if we have successfully solved the problem?

At Viget, we aim to be data-informed, which means we aren’t blindly driven by our data, but we are still focused on quantifiable measures of success. Our data science problems are held to the same standard.  What are the ways in which this problem could be a success? What are the ways in which this problem could be a complete and utter failure?  We often have specific success metrics and Key Performance Indicators (KPIs) that help us answer these questions.

Our UX coworker has interviewed some of the other stakeholders at Rocinante and some of the gamers who play our game. Our team believes if our analysis is inconclusive, and we continue the status quo, the project would be a failure. The project would be a success if we are able to predict a churn risk score for each subscriber. A churn risk score, coupled with our monthly churn rate (the rate at which customers leave the subscription service per month), will be useful information. The customer acquisition team will have a better idea of how many new users they need to acquire in order to keep the number of customers the same, and how many new users they need in order to grow the customer base. 

solving problems with data

Data Science-ing

What do we need to learn about the data and what analysis do we need to conduct.

At the heart of solving a data science problem are hundreds of questions. I attempted to ask these and similar questions last year in a blog post,  Data Science Workflow . Below are some of the most crucial — they’re not the only questions you could face when solving a data science problem, but are ones that our team at Viget thinks about on nearly every data problem.

  • What do we need to learn about the data?
  • What type of exploratory data analysis do we need to conduct?
  • Where is our data coming from?
  • What is the current state of our data?
  • Is this a supervised or unsupervised learning problem?
  • Is this a regression, classification, or clustering problem?
  • What biases could our data contain?
  • What type of data cleaning do we need to do?
  • What type of feature engineering could be useful?
  • What algorithms or types of models have been proven to solve similar problems well?
  • What evaluation metric are we using for our model?
  • What is our training and testing plan?
  • How can we tweak the model to make it more accurate, increase the ROC/AUC, decrease log-loss, etc. ?
  • Have we optimized the various parameters of the algorithm? Try grid search here.
  • Is this ethical?

That last question raises the conversation about ethics in data science. Unfortunately, there is no hippocratic oath for data scientists, but that doesn’t excuse the data science industry from acting unethically. We should apply ethical considerations to our standard data science workflow. Additionally, ethics in data science as a topic deserves more than a paragraph in this article — but I wanted to highlight that we should be cognizant and practice only ethical data science.

Let’s get started with the analysis. It’s  time to answer the data science questions. Because this is an example, the answer to these data science questions are entirely hypothetical.

  • We need to learn more about the time series nature of our data, as well as the format.
  • We should look into average customer lifetime durations and summary statistics around some of the features we believe could be important.
  • Our data came from login data and customer data, compiled by Rocinante’s data engineering team.
  • The data needs to be cleaned, but it is conveniently in a PostgreSQL database.
  • This is a supervised learning problem because we know which customers have churned.
  • This is a binary classification problem.
  • After conducting exploratory data analysis and speaking with the data engineering team, we do not see any biases in the data.
  • We need to reformat some of the data and use missing data imputation for features we believe are important but have some missing data points.
  • With 49 good features, we don’t believe we need to do any feature engineering.
  • We have used random forests, XGBoost, and standard logistic regressions to solve classification problems.
  • We will use ROC-AUC score as our evaluation metric.
  • We are going to use a training-test split (80% training, 20% test) to evaluate our model.
  • Let’s remove features that are statistically insignificant from our model to improve the ROC-AUC score.
  • Let’s optimize the parameters within our random forests model to improve the ROC-AUC score.
  • Our team believes we are acting ethically.

This process may look deceivingly linear, but data science is often a nonlinear practice. After doing all of the work in our example above, we could still end up with a model that doesn’t generalize well. It could be bad at predicting churn in new customers. Maybe we shouldn’t have assumed this problem was a binary classification problem and instead used survival regression to solve the problem. This part of the project will be filled with experimentation, and that’s totally normal.

solving problems with data


What is the best way to communicated and circulate our results.

Our job is typically to bring our findings to the client, explain how the process was a success or failure, and explain why. Communicating technical details and explaining to non-technical audiences is important because not all of our clients have degrees in statistics.  There are three ways in which communication of technical details can be advantageous:

  • It can be used to inspire confidence that the work is thorough and multiple options have been considered.
  • It can highlight technical considerations or caveats that stakeholders and decision-makers should be aware of.  
  • It can offer resources to learn more about specific techniques applied.
  • It can provide supplemental materials to allow the findings to be replicated where possible.

We often use blog posts and articles to circulate our work. They help spread our knowledge and the lessons we learned while working on a project to peers. I encourage every data scientist to engage with the data science community by attending and speaking at meetups and conferences, publishing their work online, and extending a helping hand to other curious data scientists and analysts.

Our method of binary classification was in fact incorrect, so we ended up using survival regression to determine there are four features that impact churn: gaming platform, geographical region, days since last update, and season. Our team aggregates all of our findings into one report, detailing the specific techniques we used, caveats about the analysis, and the multiple recommendations from our team to the customer retention and acquisition team. This report is full of the nitty-gritty details that the more technical folks, such as the data engineering team, may appreciate. Our team also creates a slide deck for the less-technical audience. This deck glosses over many of the technical details of the project and focuses on recommendations for the customer retention and acquisition team.

We give a talk at a local data science meetup, going over the trials, tribulations, and triumphs of the project and sharing them with the data science community at large.

solving problems with data

Why are we doing all of this?

I ask myself this question daily — and not in the metaphysical sense, but in the value-driven sense. Is there value in the work we have done and in the end result? I hope the answer is yes. But, let’s be honest, this is business. We don’t have three years to put together a PhD thesis-like paper. We have to move quickly and cost-effectively. Critically evaluating the value ultimately created will help you refine your approach to the next project. And, if you didn’t produce the value you’d originally hoped, then at the very least, I hope you were able to learn something and sharpen your data science skills. 

Rocinante has a better idea of how long our users will remain active on the platform based on user characteristics, and can now launch preemptive strikes in order to retain those users who look like they are about to churn. Our team eventually develops a system that alerts the customer retention and acquisition team when a user may be about to churn, and they know to reach out to that user, via email, encouraging them to try out a new feature we recently launched. Rocinante is making better data-informed decisions based on this work, and that’s great!

I hope this article will help guide your next data science project and get the wheels turning in your own mind. Maybe you will be the creator of a data science framework the world adopts! Let me know what you think about the questions, or whether I’m missing anything, in the comments below.

Related Articles

Start Your Project With an Innovation Workshop

Start Your Project With an Innovation Workshop

Kate Trenerry

Charting New Paths to Startup Product Development

Charting New Paths to Startup Product Development

Making a Business Case for Your Website Project

Making a Business Case for Your Website Project

The viget newsletter.

Nobody likes popups, so we waited until now to recommend our newsletter, featuring thoughts, opinions, and tools for building a better digital world. Read the current issue.

Subscribe Here (opens in new window)

  • Share this page
  • Post this page

solving problems with data

10 Common Data Quality Issues (And How to Solve Them)

Data quality is essential for any data-driven organization. Poor data quality results in unreliable analysis. High data quality enables actionable insights for both short-term operations and long-term planning. Identifying and correcting data quality issues can mean the difference between a successful business and a failing one. 

What data quality issues is your organization likely to encounter? Read on to discover the ten most common data quality problems—and how to solve them.

Quick Takeaways

  • Data quality issues can come from cross-system inconsistencies and human error
  • The most common data quality issues include inaccurate data, incomplete data, duplicate data, and aging data
  • Robust data quality monitoring can solve many data quality issues 

1. Inaccurate Data

Gartner says that inaccurate data costs organizations $12.9 million a year , on average. Inaccurate data is data that is wrong—customer addresses with the wrong ZIP codes, misspelled customer names, or entries marred by simple human errors. Whatever the cause, whatever the issue, inaccurate data is unusable data. If you try to use it, it can throw off your entire analysis. 

How can you solve the problem of inaccurate data? The first place to start is by automating data entry. The more you can automate, the fewer human errors you’ll find. 

Next, you need a robust data quality monitoring solution, such as FirstEigen’s DataBuck, to identify and isolate inaccurate data. You can then try to fix the flawed fields by comparing the inaccurate data with a known accurate dataset. If the data is still inaccurate, you’ll have to delete it to keep it from contaminating your data analysis.

2. Incomplete Data

Another common data quality is incomplete data. These are data records with missing information in key fields—addresses with no ZIP codes, phone numbers without area codes, and demographic information without age or gender entered. 

Incomplete data can result in flawed analysis. It can also make daily operations more problematic, as staff scurries to determine what data is missing and what it was supposed to be.

You can minimize this issue on the data entry front by requiring key fields to be completed before submission. Use systems that automatically flag and reject incomplete records when importing data from external sources. You can then try to complete any missing fields by comparing your data with another similar (and hopefully more complete data source).

3. Duplicate Data

When importing data from multiple sources, it’s not uncommon to end up with duplicate data. For example, if you’re importing customer lists from two sources, you may find several people who were customers of both retailers. You only want to count each customer once, which makes duplicative records a major issue.

Identifying duplicate records involves the process of “deduplication,” which uses various technologies to detect records with similar data. You can then delete all but one of the duplicate records—ideally, the one that better matches your internal schema. Even better, you may be able to merge the duplicative records, which can result in richer analysis as the two records might contain slightly different details that can complement each other.

4. Inconsistent Formatting

Much data can be formatted in a multitude of ways. Consider, for example, the many ways you can express a date—June 5, 2023, 6/5/2023, 6-5-23, or, in a less-structured format, the fifth of June, 2023. Different sources often use different formatting, so these inconsistencies can result in major data quality issues. 

Working with different forms of measurement can cause similar issues. If one source uses metric measurements and another feet and inches, you must settle on an internal standard and ensure that all imported data correctly converts. Using the wrong measurements can be catastrophic—as when NASA lost a $125 million Mars Climate Orbiter because the Jet Propulsion Laboratory used metric measurements and contractor Lockheed Martin Astronautics worked with the English system of feet and pounds. 

Solving this issue requires a data quality monitoring solution that profiles individual datasets and identifies these formatting issues. Once identified, it should be a simple matter of converting data from one format to another.

5. Cross-System Inconsistencies

Inconsistent formatting is often the result of combining data from two different systems. It’s common for two otherwise-similar systems to format data differently. These cross-system inconsistencies can cause major data quality issues if not identified and rectified. 

You need to decide on one standard data format when working with data from multiple sources. All incoming data must then convert to that format, which can require using artificial intelligence (AI) and machine learning (ML) technologies that can automate the matching and conversion process. 

6. Unstructured Data

While much of the third-party data you ingest will not conform to your standardized formatting, that might not be the worst problem you encounter. Some of the data you ingest may not be formatted at all. 

Key differences between structured and unstructured data.

Image Source

This unstructured data can contain valuable insights but doesn’t easily fit into most established systems. To convert unstructured data into structured records, use a data integration tool to identify and extract data from an unstructured dataset and convert it into a standardized format for use with your internal systems.

7. Dark Data

Hidden data, sometimes known as dark data , is collected and stored by an organization that is not actively used. IBM estimates that 80% of all data today is dark data . In many cases, it’s a wasted resource that many organizations don’t even know exists—even though it can account for more than half of an average organization’s data storage costs.

Dark data costs more to store than regularly-used data.

Dark data should either be used or deleted. To do either requires identifying this confidential data, evaluating its usability and usefulness, and making that data visible to key stakeholders in your organization. 

8. Orphaned Data

Orphaned data isn’t hidden. It’s simply not readily usable. In most instances, data is orphaned when it’s not fully compatible with an existing system or not easily converted into a usable format. For example, a customer record that exists in one database but not in another could be classified as an orphan.

Data quality management software should be able to identify orphaned data. Thus identified, the cause of the inconsistency can be determined and, in many instances, rectified for full utilization of the orphaned data. 

9. Stale Data

Data does not always age well. Old data becomes stale data that is more likely to be inaccurate. Consider customer addresses, for example. People today are increasingly mobile, meaning that addresses collected more than a few years previous are likely to reflect where customers used to live, not where they currently reside. 

Older data needs constant culling from your system to mitigate this issue. It’s often easier and cheaper to delete data past a certain expiration date than to deal with the data quality issues of using that stale data.

10. Irrelevant Data

Many companies capture reams of data about each customer and every transaction. Not all of this data is immediately useful. Some of this data is ultimately irrelevant to the company’s business. 

Capturing and storing irrelevant data increases an organization’s security and privacy risks. It’s better to keep only that data of immediate use to your company and either delete or not collect in the first place data of which you have little or no use.

Use DataBuck to Solve Your Data Quality Issues

When you want to solve your organization’s data quality issues, turn to FirstEigen. Our DataBuck solution uses AI and ML technologies to automate more than 70% of the data monitoring process. DataBuck identifies and fixes inaccurate, incomplete, duplicate, and inconsistent data, which improves your data quality and usability. 

Contact FirstEigen today to learn more about solving data quality issues .

Check out these articles on Data Trustability, Observability, and Data Quality. 

  • 6 Key Data Quality Metrics You Should Be Tracking ( )
  • How to Scale Your Data Quality Operations with AI and ML ( )
  • 12 Things You Can Do to Improve Data Quality ( )
  • How to Ensure Data Integrity During Cloud Migrations ( )

' src=


  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

Find the AI Approach That Fits the Problem You’re Trying to Solve

  • George Westerman,
  • Sam Ransbotham,
  • Chiara Farronato

solving problems with data

Five questions to help leaders discover the right analytics tool for the job.

AI moves quickly, but organizations change much more slowly. What works in a lab may be wrong for your company right now. If you know the right questions to ask, you can make better decisions, regardless of how fast technology changes. You can work with your technical experts to use the right tool for the right job. Then each solution today becomes a foundation to build further innovations tomorrow. But without the right questions, you’ll be starting your journey in the wrong place.

Leaders everywhere are rightly asking about how Generative AI can benefit their businesses. However, as impressive as generative AI is, it’s only one of many advanced data science and analytics techniques. While the world is focusing on generative AI, a better approach is to understand how to use the range of available analytics tools to address your company’s needs. Which analytics tool fits the problem you’re trying to solve? And how do you avoid choosing the wrong one? You don’t need to know deep details about each analytics tool at your disposal, but you do need to know enough to envision what’s possible and to ask technical experts the right questions.

  • George Westerman is a Senior Lecturer in MIT Sloan School of Management and founder of the Global Opportunity Forum  in MIT’s Office of Open Learning.
  • SR Sam Ransbotham is a Professor of Business Analytics at the Boston College Carroll School of Management. He co-hosts the “Me, Myself, and AI” podcast.
  • Chiara Farronato is the Glenn and Mary Jane Creamer Associate Professor of Business Administration at Harvard Business School and co-principal investigator at the Platform Lab at Harvard’s Digital Design Institute (D^3). She is also a fellow at the National Bureau of Economic Research (NBER) and the Center for Economic Policy Research (CEPR).

Partner Center

  • International edition
  • Australia edition
  • Europe edition

data canter

Can you solve it? The magical maths that keeps your data safe

How to protect machines against random failures

UPDATE: The solutions can be read here

I’ve temporarily moved to Berkeley, California, where I am the “science communicator in residence” at the Simons Institute , the world’s leading institute for collaborative research in theoretical computer science.

One nano-collaboration is today’s puzzle – told to me by a computer scientist at Microsoft I befriended over tea. It’s about data centres – those warehouses containing endless rows of computers that store all our data.

One problem faced by data centres is the unreliability of physical machines. Hard drives fail all the time, and when they do, all their data may be lost. How do companies like Microsoft make sure that they can recover the data from failed hard drives? The solution to the puzzle below is, in essence, the answer to this question.

An obvious strategy that a data centre could use to protect its machines from random failures is for every machine to have a duplicate. In this case, if a hard drive fails, you recover the data from the duplicate. This strategy, however, is not used because it is very inefficient. If you have 100 machines, you would need another 100 duplicates. There are better ways, as you will hopefully deduce!

The disappearing boxes

You have 100 boxes. Each box contains a single number in it, and no two boxes have the same number.

1. You are told that one of the boxes at random will be removed. But before it is removed you are given an extra box , and allowed to put a single number in it. What number do you put in the extra box that guarantees you will be able to recover the number of whichever box is removed?

2. You are told that two of the boxes at random will be removed. But before it is removed you are given two extra boxes, and allowed to put one number in each of them. What (different) numbers do you put in these two boxes that guarantees you will be able to recover the numbers of both removed boxes?

I’ll be back with the answers at 5pm UK. Meanwhile, NO SPOILERS, please discuss your favourite hard drives.

UPDATE: The solutions can be read here.

The analogy here is that each box is a hard drive, the number in the box is the data, and the removal of a box is the failure of the hard drive. With one extra hard drive, we are secure against the random failure of a single hard drive, and with two, we are secure against the failure of two. It seems magical that we can protect such a lot of information against random failures with minimal back-up.

The field of “error-correcting codes” is a large body of beautiful theories that provide answers to questions such as how to minimise the number of machines needed to protect against random failures of hard drives. And the theories work! Data centres never lose your data because of mechanical failure.

My tea companion was Sivakanth Gopi, a Principal Researcher at Microsoft. He said: “The magic of error correcting codes allows us to build reliable systems using noisy and faulty components. Thanks to them, we can communicate with someone as far away as the ends of our solar system and store billions of terabytes of data safely in the cloud. We can forget about the noise and complexity of this world and instead enjoy its beauty.”

I’ve been setting a puzzle here on alternate Mondays since 2015. I’m always on the look-out for great puzzles. If you would like to suggest one, email me .

  • Alex Bellos's Monday puzzle
  • Computer science and IT
  • Mathematics (Science)
  • Mathematics (Education)

Comments (…)

Most viewed.

Problem Solving and Data Analysis

We have lots of free resources and videos to help you prepare for the SAT. These materials are for the redesigned SAT which is for you if you are taking the SAT in March 2016 and beyond.

Related Pages More Lessons for SAT Math More Resources for SAT

Problem Solving and Data Analysis includes questions that test your ability to

  • create a representation of the problem.
  • consider the units involved.
  • pay attention to the meaning of quantities.
  • know and use different properties of mathematical properties and representations.
  • apply key principles of statistics.
  • estimate the probability of a simple or compound event.

There are many ways that you can be tested and practicing different types of questions will help you to be prepared for the SAT.

The following video lessons will show you how to solve a variety of problem solving and data analysis questions in different situations.

Ratio, Proportion, Units and Percentages

There will be questions on ratios. A ratio represents the proportional relationship between quantities. Fractions can be used to represent ratios.

There will also be questions involving percentages. Percent is a type proportion that means “per 100”.

You will need to convert units when required by the question. One way to perform unit conversion is to write it out as a series of multiplication steps.

Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Question 7 Question 8

Charts, Graphs and Tables

The questions in Problem Solving and Data Analysis focus on linear, quadratic and exponential relationships which may be represented by charts, graphs or tables. A model is linear if the difference in quantity is constant. A model is exponential if the ratio in the quantity is constant.

A line of best fit is a straight line that best represents the data on a scatterplot. It is written in y = mx + c.

You need to know the formulas for simple and compound interest. Simple Interest: A = P(1 + rt) Compound Interest: A = P(1 + r/n) nt where A is the total amount, P is the principal, r is the interest rate, t is the time period and n is the number of times the interest is compounded per year.

Probability measures how likely an event is. The formula to calculate the probability of an event is: Probability = (number of favorable outcomes)/(total number of possible outcomes)

Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Question 7 Question 8 Question 9 Question 10 Question 11 Question 12 Question 13 Question 14 Question 15

Data and Statistics

You need to know that mean, median, and mode are measures of center for a data set, while range and standard deviation are measures of spread. You will not be asked to calculate the standard deviation of a set of data, but you do need to understand that a larger standard deviation means that the values are more spread out from the mean. You may be asked to compare the standard deviation of two data sets by approximating the spread from the mean.

You do not need to calculate the margins of error or confidence level, but you do need to know what these concepts are and how to interpret them in context. Take note that the questions in the SAT will always use 95% confidence levels. Some questions may give you the confidence level and ask you to find the value for which the interval applies. When the confidence level is kept the same, the size of the margin of error is affected by the standard deviation and the sample size. The larger the standard deviation, the larger the margin of error. The larger the sample size, the smaller the margin of error. The margin of error and confidence interval are estimates for the entire population and do not apply to values of individual objects in the populations.

The results of a sample can be generalized to the entire population only if the subjects in the sample are selected randomly. Conclusions about cause and effect can appropriately be drawn only if the subjects are randomly assigned to treatment.

Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Question 7 Question 8 Question 9

Mathway Calculator Widget

We welcome your feedback, comments and questions about this site or page. Please submit your feedback or enquiries via our Feedback page.

Help | Advanced Search

Computer Science > Artificial Intelligence

Title: geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving.

Abstract: Recent advancements in Large Language Models (LLMs) and Multi-Modal Models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2000 problems, a 750 problem subset focusing on backward reasoning, an augmented subset of 2000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs on solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the challenging subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.

Submission history

Access paper:.

  • Download PDF
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains * and * are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Unit 7: Medium: Problem solving and data analysis

About this unit, ratios, rates, and proportions: medium.

  • Ratios, rates, and proportions | SAT lesson (Opens a modal)
  • Ratios, rates, and proportions — Basic example (Opens a modal)
  • Ratios, rates, and proportions — Harder example (Opens a modal)
  • Ratios, rates, and proportions: medium Get 3 of 4 questions to level up!

Unit conversion: medium

  • Unit conversion | Lesson (Opens a modal)
  • Units — Basic example (Opens a modal)
  • Units — Harder example (Opens a modal)
  • Unit conversion: medium Get 3 of 4 questions to level up!

Percentages: medium

  • Percentages | Lesson (Opens a modal)
  • Percents — Basic example (Opens a modal)
  • Percents — Harder example (Opens a modal)
  • Percentages: medium Get 3 of 4 questions to level up!

Center, spread, and shape of distributions: medium

  • Center, spread, and shape of distributions | Lesson (Opens a modal)
  • Center, spread, and shape of distributions — Basic example (Opens a modal)
  • Center, spread, and shape of distributions — Harder example (Opens a modal)
  • Center, spread, and shape of distributions: medium Get 3 of 4 questions to level up!

Data representations: medium

  • Data representations | Lesson (Opens a modal)
  • Key features of graphs — Basic example (Opens a modal)
  • Key features of graphs — Harder example (Opens a modal)
  • Data representations: medium Get 3 of 4 questions to level up!

Scatterplots: medium

  • Scatterplots | Lesson (Opens a modal)
  • Scatterplots — Basic example (Opens a modal)
  • Scatterplots — Harder example (Opens a modal)
  • Scatterplots: medium Get 3 of 4 questions to level up!

Linear and exponential growth: medium

  • Linear and exponential growth | Lesson (Opens a modal)
  • Linear and exponential growth — Basic example (Opens a modal)
  • Linear and exponential growth — Harder example (Opens a modal)
  • Linear and exponential growth: medium Get 3 of 4 questions to level up!

Probability and relative frequency: medium

  • Probability and relative frequency | Lesson (Opens a modal)
  • Table data — Basic example (Opens a modal)
  • Table data — Harder example (Opens a modal)
  • Probability and relative frequency: medium Get 3 of 4 questions to level up!

Data inferences: medium

  • Data inferences | Lesson (Opens a modal)
  • Data inferences — Basic example (Opens a modal)
  • Data inferences — Harder example (Opens a modal)
  • Data inferences: medium Get 3 of 4 questions to level up!

Evaluating statistical claims: medium

  • Evaluating statistical claims | Lesson (Opens a modal)
  • Data collection and conclusions — Basic example (Opens a modal)
  • Data collection and conclusions — Harder example (Opens a modal)
  • Evaluating statistical claims: medium Get 3 of 4 questions to level up!

Read our research on: Immigration & Migration | Podcasts | Election 2024

Regions & Countries

How americans view the situation at the u.s.-mexico border, its causes and consequences, 80% say the u.s. government is doing a bad job handling the migrant influx.

solving problems with data

Pew Research Center conducted this study to understand the public’s views about the large number of migrants seeking to enter the U.S. at the border with Mexico. For this analysis, we surveyed 5,140 adults from Jan. 16-21, 2024. Everyone who took part in this survey is a member of the Center’s American Trends Panel (ATP), an online survey panel that is recruited through national, random sampling of residential addresses. This way nearly all U.S. adults have a chance of selection. The survey is weighted to be representative of the U.S. adult population by gender, race, ethnicity, partisan affiliation, education and other categories. Read more about the ATP’s methodology .

Here are the questions used for the report and its methodology .

The growing number of migrants seeking entry into the United States at its border with Mexico has strained government resources, divided Congress and emerged as a contentious issue in the 2024 presidential campaign .

Chart shows Why do Americans think there is an influx of migrants to the United States?

Americans overwhelmingly fault the government for how it has handled the migrant situation. Beyond that, however, there are deep differences – over why the migrants are coming to the U.S., proposals for addressing the situation, and even whether it should be described as a “crisis.”

Factors behind the migrant influx

Economic factors – either poor conditions in migrants’ home countries or better economic opportunities in the United States – are widely viewed as major reasons for the migrant influx.

About seven-in-ten Americans (71%), including majorities in both parties, cite better economic opportunities in the U.S. as a major reason.

There are wider partisan differences over other factors.

About two-thirds of Americans (65%) say violence in migrants’ home countries is a major reason for why a large number of immigrants have come to the border.

Democrats and Democratic-leaning independents are 30 percentage points more likely than Republicans and Republican leaners to cite this as a major reason (79% vs. 49%).

By contrast, 76% of Republicans say the belief that U.S. immigration policies will make it easy to stay in the country once they arrive is a major factor. About half as many Democrats (39%) say the same.

For more on Americans’ views of these and other reasons, visit Chapter 2.

How serious is the situation at the border?

A sizable majority of Americans (78%) say the large number of migrants seeking to enter this country at the U.S.-Mexico border is eithera crisis (45%) or a major problem (32%), according to the Pew Research Center survey, conducted Jan. 16-21, 2024, among 5,140 adults.

Related: Migrant encounters at the U.S.-Mexico border hit a record high at the end of 2023 .

Chart shows Border situation viewed as a ‘crisis’ by most Republicans; Democrats are more likely to call it a ‘problem’

  • Republicans are much more likely than Democrats to describe the situation as a “crisis”: 70% of Republicans say this, compared with just 22% of Democrats.
  • Democrats mostly view the situation as a major problem (44%) or minor problem (26%) for the U.S. Very few Democrats (7%) say it is not a problem.

In an open-ended question , respondents voice their concerns about the migrant influx. They point to numerous issues, including worries about how the migrants are cared for and general problems with the immigration system.

Yet two concerns come up most frequently:

  • 22% point to the economic burdens associated with the migrant influx, including the strains migrants place on social services and other government resources.
  • 22% also cite security concerns. Many of these responses focus on crime (10%), terrorism (10%) and drugs (3%).

When asked specifically about the impact of the migrant influx on crime in the United States, a majority of Americans (57%) say the large number of migrants seeking to enter the country leads to more crime. Fewer (39%) say this does not have much of an impact on crime in this country.

Republicans (85%) overwhelmingly say the migrant surge leads to increased crime in the U.S. A far smaller share of Democrats (31%) say the same; 63% of Democrats instead say it does not have much of an impact.

Government widely criticized for its handling of migrant influx

For the past several years, the federal government has gotten low ratings for its handling of the situation at the U.S.-Mexico border. (Note: The wording of this question has been modified modestly to reflect circumstances at the time).

Chart shows Only about a quarter of Democrats and even fewer Republicans say the government has done a good job dealing with large number of migrants at the border

However, the current ratings are extraordinarily low.

Just 18% say the U.S. government is doing a good job dealing with the large number of migrants at the border, while 80% say it is doing a bad job, including 45% who say it is doing a very bad job.

  • Republicans’ views are overwhelmingly negative (89% say it’s doing a bad job), as they have been since Joe Biden became president.
  • 73% of Democrats also give the government negative ratings, the highest share recorded during Biden’s presidency.

For more on Americans’ evaluations of the situation, visit Chapter 1 .

Which policies could improve the border situation?

There is no single policy proposal, among the nine included on the survey, that majorities of both Republicans and Democrats say would improve the situation at the U.S.-Mexico border. There are areas of relative agreement, however.

A 60% majority of Americans say that increasing the number of immigration judges and staff in order to make decisions on asylum more quickly would make the situation better. Only 11% say it would make things worse, while 14% think it would not make much difference.

Nearly as many (56%) say creating more opportunities for people to legally immigrate to the U.S. would make the situation better.

Chart shows Most Democrats and nearly half of Republicans say boosting resources for quicker decisions on asylum cases would improve situation at Mexico border

Majorities of Democrats say each of these proposals would make the border situation better.

Republicans are less positive than are Democrats; still, about 40% or more of Republicans say each would improve the situation, while far fewer say they would make things worse.

Opinions on other proposals are more polarized. For example, a 56% majority of Democrats say that adding resources to provide safe and sanitary conditions for migrants arriving in the U.S. would be a positive step forward.

Republicans not only are far less likely than Democrats to view this proposal positively, but far more say it would make the situation worse (43%) than better (17%).

Chart shows Wide partisan gaps in views of expanding border wall, providing ‘safe and sanitary conditions’ for migrants

Building or expanding a wall along the U.S.-Mexico border was among the most divisive policies of Donald Trump’s presidency. In 2019, 82% of Republicans favored expanding the border wall , compared with just 6% of Democrats.

Today, 72% of Republicans say substantially expanding the wall along the U.S. border with Mexico would make the situation better. Just 15% of Democrats concur, with most saying either it would not make much of a difference (47%) or it would make things worse (24%).

For more on Americans’ reactions to policy proposals, visit Chapter 3 .

Sign up for our Politics newsletter

Sent weekly on Wednesday

Report Materials

Table of contents, fast facts on how greeks see migrants as greece-turkey border crisis deepens, americans’ immigration policy priorities: divisions between – and within – the two parties, from the archives: in ’60s, americans gave thumbs-up to immigration law that changed the nation, around the world, more say immigrants are a strength than a burden, latinos have become less likely to say there are too many immigrants in u.s., most popular.

About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of The Pew Charitable Trusts .


  1. Step by step process of how to solve statistics problems

    solving problems with data

  2. The 5 Steps of Problem Solving

    solving problems with data

  3. Problem solving and data analysis vector concept Stock Vector Image by

    solving problems with data

  4. Solving Problems with Data Science

    solving problems with data

  5. 5 Problem Solving Strategies to Become a Better Problem Solver

    solving problems with data

  6. 7 Steps to Problem Solving

    solving problems with data


  1. Problem Solving

  2. How to solve a problem

  3. Day 1of problem solving skills of data structure and algorithms

  4. Solve the problem and find 🍄×🍏+🍉=??

  5. Steps To Find The Solution Of Any Problem... #youtubeshorts #youtube #shorts #solution #problems

  6. Problem solving and data analysis1(1~5 videos)


  1. Solving data problems: A beginner's guide

    Solving data problems: A beginner's guide Brian Perron, PhD · Follow Published in Towards Data Science · 8 min read · Jan 23, 2022 Overturned truck (Honghe, Yunnan, China) — Image by author I enjoy working with students interested in learning to work with data to solve real-world problems.

  2. How to Use Data to Solve Problems: A 6-Step Guide

    1 Define the problem 2 Gather the data 3 Analyze the data 4 Generate solutions 5 Communicate the results 6 Implement and monitor 7 Here's what else to consider Data is everywhere, and it...

  3. How to Use Data to Solve Problems: A Guide for Project Managers

    1 Identify the problem 2 Collect and organize data 3 Analyze and interpret data 4 Generate and evaluate solutions 5 Implement and monitor the solution Be the first to add your personal...

  4. What is data analysis? Examples and how to start

    Data analysis is the process of examining, filtering, adapting, and modeling data to help solve problems. Data analysis helps determine what is and isn't working, so you can make the changes needed to achieve your business goals. Keep in mind that data analysis includes analyzing both quantitative data (e.g., profits and sales) and qualitative ...

  5. How to analyze a problem

    How to analyze a problem. May 7, 2023 Companies that harness the power of data have the upper hand when it comes to problem solving. Rather than defaulting to solving problems by developing lengthy—sometimes multiyear—road maps, they're empowered to ask how innovative data techniques could resolve challenges in hours, days or weeks, write ...

  6. Chapter 1 Problem Solving with Data

    1.1 Introduction This chapter will introduce you to a general approach to solving problems and answering questions using data. Throughout the rest of the module, we will reference back to this chapter as you work your way through your own data analysis exercises.

  7. Solving Common Data Challenges

    Solving Common Data Challenges Once you know what predictive analytics solution you want to build, it's all about the data. The reliability of predictions depends on the quality of the data used to discover variables and generate, train, and test predictive models.

  8. How to Use Data for Creative Problem Solving

    The fifth step is to implement your chosen solution (s) and monitor the results. Data can help you do this by tracking, documenting, or communicating your actions and outcomes. You can use data to ...

  9. Solve the Right Problems

    Adoption problems: Challenges in how effectively people use tools, data, and technology to make decisions and take actions that solve business problems. An adoption problem causes the organization to not realize value from their existing tools, data, and technology.

  10. 5 Steps on How to Approach a New Data Science Problem

    Step 1: Define the problem. First, it's necessary to accurately define the data problem that is to be solved. The problem should be clear, concise, and measurable. Many companies are too vague when defining data problems, which makes it difficult or even impossible for data scientists to translate them into machine code.

  11. Foundations: Problem solving and data analysis

    Unit test. Level up on all the skills in this unit and collect up to 1000 Mastery points! This unit introduces you to the foundational problem solving and data analysis skills you'll need on the SAT Math test, starting with more basic examples. Work through the skills one by one or take a unit test to test all of them at once.

  12. 19 Big Data Problems You Need to Solve

    Here's a look at some common data problems and how you can solve them. 1. Finding and Fixing Data Quality Issues 3. Confusion with Big Data Tool Selection 6. High Cost of Data Solutions 8. Complex Systems for Managing Data 10. Low Quality and Inaccurate Data 13. Keeping Up With Growth in Data 16. Lack of Skilled Workers 18.

  13. Solving Problems with Data Science

    The vast majority of the problems we face at Viget can't or shouldn't be solved by a lone data scientist because we are solving business problems. Our data scientists team up with UXers, designers, developers, project managers, and hardware developers to develop digital strategies and solving data science problems is one part of that ...

  14. 10 Common Data Quality Issues (And How to Solve Them)

    Robust data quality monitoring can solve many data quality issues. 1. Inaccurate Data. Gartner says that inaccurate data costs organizations $12.9 million a year, on average. Inaccurate data is data that is wrong—customer addresses with the wrong ZIP codes, misspelled customer names, or entries marred by simple human errors. Whatever the ...

  15. What is Problem Solving? Steps, Process & Techniques

    1. Define the problem Diagnose the situation so that your focus is on the problem, not just its symptoms. Helpful problem-solving techniques include using flowcharts to identify the expected steps of a process and cause-and-effect diagrams to define and analyze root causes. The sections below help explain key problem-solving steps.

  16. How Data and Analytics Boost Virtual Problem Solving

    Before you start collecting and analyzing data, you need to have a clear and shared understanding of the problem you are trying to solve. You can use tools like mind maps, fishbone diagrams, or ...

  17. Find the AI Approach That Fits the Problem You're Trying to Solve

    Summary. AI moves quickly, but organizations change much more slowly. What works in a lab may be wrong for your company right now. If you know the right questions to ask, you can make better ...

  18. Can you solve it? The magical maths that keeps your data safe

    One problem faced by data centres is the unreliability of physical machines. Hard drives fail all the time, and when they do, all their data may be lost.

  19. Problem Solving and Data Analysis (Examples, solutions)

    Problem Solving and Data Analysis includes questions that test your ability to create a representation of the problem. consider the units involved. pay attention to the meaning of quantities. know and use different properties of mathematical properties and representations. apply key principles of statistics.

  20. [2402.10104] GeoEval: Benchmark for Evaluating LLMs and Multi-Modal

    Recent advancements in Large Language Models (LLMs) and Multi-Modal Models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a ...

  21. Medium: Problem solving and data analysis

    Unit test. Level up on all the skills in this unit and collect up to 1000 Mastery points! This unit tackles the medium-difficulty problem solving and data analysis questions on the SAT Math test. Work through each skill, taking quizzes and the unit test to level up your mastery progress.

  22. The U.S.-Mexico Border: How Americans View the Situation, Its Causes

    Democrats mostly view the situation as a major problem (44%) or minor problem (26%) for the U.S. Very few Democrats (7%) say it is not a problem. In an open-ended question, respondents voice their concerns about the migrant influx. They point to numerous issues, including worries about how the migrants are cared for and general problems with ...

  23. The Math Behind K-Means Clustering

    Assignment of x to cluster condition — Image by Author. Here's what it means: Ci : This represents the i-th cluster, a set of points grouped based on their similarity.; x: This is a point in the dataset that the K-Means algorithm is trying to assign to one of the k clusters.; d(x,μi ): This calculates the distance between the point x and the centroid μi of cluster Ci .