Using AI to analyze data

Overview

In this unit, we discuss using AI tools to help with data analysis.

Goals

  • Know why and how to use AI to help with data analysis.
  • Be aware of possible confidentiality issues.

Reading

Introduction

AI has gotten pretty good at working with data. While there is no substitute for you understanding your data, AI can help speed up the process. R has many powerful functions that let you explore your data quickly, combining R with AI-generated code can speed up things even more.

Confidentiality and Privacy

It is important to reiterate: If you allow the AI to “see” your data, this data might end up on the servers of the company running the AI (ChatGPT, Microsoft, Google) and might be used by them for future training of their models. Therefore, be careful with what you let the AI see. If you are re-analyzing publicly available data, you shouldn’t have to worry. But if the data is in any way confidential, it might not be a good idea to allow the AI to see it.

If you are working in a setting where your company/organization has their own instance of an AI service, this confidentiality issue might not be that important. But if you are using public AI services, be somewhat careful.

Data analysis with no data

Ok, this sounds dumb, but it’s not that stupid. Instead of trying to feed the AI your data, you can describe your data and ask it to return code that could analyze it. For instance, you could provide a prompt like this:

Write R code to perform an exploratory data analysis of a data frame called dat. The data frame contains the continuous variables age and BMI, and the categorical variables sex and favorite color. Write code that produces a summary table, univariate plots for each variable, and a bivariate plot of age versus BMI.

Of course, you can take this further and ask the AI to first write code that generates hypothetical/synthetic data that has the structure of your real data, and then ask it to write code that analyzes it.

Data analysis with synthetic data

A good solution is to generate synthetic data that looks like your real data, then ask the AI to write code to analyze this synthetic data. Once the AI gives you working code, you can take it off-line and apply it to your real data.

Since synthetic (artificial/fake/simulated) data is very useful for many parts of the data analysis workflow, having synthetic data generation as part of your project is in general a good idea.

Data analysis with copy and paste

A common way to interact with AI tools is through the web browser. If you have fairly simple dataset, you can paste it directly into the prompt and ask the AI to perform some action. It’s a bit limited, but can work and is quick and easy.

Data analysis with file upload

You can upload files to some AI tools and ask them to operate on them. Assuming all is good on the confidentiality side of things, you can use this approach to get AI-generated insights into your data.

You basically upload the files, which can in addition to the data also contain other potentially relevant information. Note that not all data formats are supported by all AI tools. CSV files are generally a safe bet.

You can then ask the AI to operate on the data. While you can ask it to analyze the data directly, I recommend asking it to write R code that performs the analysis. This way you can take the code, copy it to your computer, and run the code yourself later, and modify it if needed.

AI on your computer

Probably the best way to use AI tools to analyze data is to have the AI tools running on your own computer. This way you can have the AI operate directly on your local files, without having to upload them anywhere. This also allows the AI to see other parts of your project. The tools discussed in the AI Tools unit, especially those that tie in nicely with Positron, are great ways of doing this.

The DataBot extension for Positron is one such tool that is specifically designed to perform AI based data analysis and certainly worth trying.

Best practices

The same tips and best practices discussed in the AI Coding unit apply. As always, being specific in your prompt helps. For instance if you want the AI to perform very specific analyses, or use specific R packages, mention this in your prompt. Alternatively, you could tell it what you want, and ask it to recommend some analysis approaches or R packages to use and give you their advantages and disadvantages. As needed, you can ask for deeper explanations to make sure you fully understand any proposed approaches that might be new to you.

Summary

While it is important to keep confidentiality issues in mind, AI tools can be great helpers in analyzing data. Make sure that you always use AI to help you understand the data, not let it make decisions for you, no matter how good-looking and convincing the code and plots it creates 😁.

Further Resources

  • None at this time.

Test yourself

What is the recommended approach when using AI tools on confidential data?

The unit emphasizes avoiding exposure of confidential data by using synthetic/anonymized data or keeping data local.

  • False
  • True
  • False
  • False

Why might you ask an AI tool to write analysis code instead of doing the analysis directly?

Having the AI generate code allows you to review, run, and adapt it locally.

  • False
  • False
  • True
  • False

What is one benefit of using synthetic data with AI tools?

Synthetic data lets you work safely while still getting useful code or analysis ideas.

  • True
  • False
  • False
  • False

Practice