AI tools for exploratory data analysis

Author

Andreas Handel

Modified

2024-03-20

Overview

In this unit, we discuss using AI tools to help with exploratory data analysis.

Learning Objectives

  • Know why and how to use AI to help with exploratory analysis.
  • Be aware of possible confidentiality issues.

Introduction

In an exploratory analysis, you generally want to look at many aspects of your data to get a good idea of what you have in front of you. While R has many powerful functions that let you explore your data quickly, combining R with AI generated code can speed up things even more.

Confidentiality and Privacy

It is important to re-iterate from a previous unit: If you allow the AI to “see” your data, this data might end up on the servers of the company running the AI (ChatGPT, Microsoft, Google) and might be used by them for future training of their models. Therefore, be careful with what you let the AI see. If you are re-analyzing publicly available data, you shouldn’t have to worry. But if the data is in any way confidential, it might not be a good idea to allow the AI to see it.

The best solution in that case is to generate synthetic data that looks like your real data, then ask the AI to write code to analyze this synthetic data. Once the AI gives you working code, you can take it off-line and apply it to your real data.

Since synthetic (artificial/fake/simulated) data is very useful for many parts of the data analyis workflow, it is covered in a separate module.

In the following I’m assuming that you have data that can be shared with the AI.

Exploratory data analyis (EDA) with no data

Ok, this sounds dumb, but it’s not that stupid. Instead of trying to feed the AI your data, you can ask it to generate code that does EDA on hypothetical data. For instance, you could provide a prompt like this:

Write R code to perform an exploratory data analysis of a data frame called dat. The data frame contains the continuous variables age and BMI, and the categorical variables sex and favorite color. Write code that produces a summary table, univariate plots for each variable, and a bivariate plot of age versus BMI.

When I typed the above into Bing AI in precise mode, I got this code. It’s a good start. You can take this code and modify so it can be applied to your data.

Of course, you can take this further and ask the AI to both generate the data you have in mind, then write EDA code. Basically ask it to write code along the lines of the examples from the synthetic data module and EDA code.

Exploratory data analyis (EDA) with copy and paste

Bing and the free ChatGPT version currently do not provide a way to upload data.

If you have fairly simple dataset, you can paste it directly into the prompt.

Here is an example. This dataset comes from the first example in the
Generating synthetic data with R unit. I opened the CSV file, and copied the whole thing into the prompt:

Perform an exploratory data analysis using R code of the following data set:

PASTE DATA HERE

To get a new line in your prompts, use Shift + Ctrl/Return (or whatever the equivalent is on a Mac 😁.)

I received this code from the AI. It runs and does a few nice exploratory analyses, such as writing summary tables and making a few plots.

It would of course be better if I provided much more detailed instructions regarding the types of analyses, and possible even the R packages and functions I want to have. But I think you get the idea. If the data is not too big, you can write your prompt, then copy the data underneath and get code that does EDA on the data.

Note that there are character limits. For Bing, the maximum is 4000 as of this writing. If your data is too big, it gets cut off. In the example above, the last few lines of data got cut. I think this is often not a big problem, since the data you supply isn’t the real data anyway. All you want is enough data for the AI to generate EDA code, then you take that code and apply it to your real data later, with AI turned off.

Exploratory data analyis (EDA) with file upload

The free version of ChatGPT or Bing do not allow upload of files. However, Claude does. I repeated the process from above, but instead of pasting the data, I uploaded the CSV file.

Write R code to perform an exploratory data analysis of the attached CSV file.

I received this code from the AI. It runs and produces some figures. Of course, as you learned previously, it would be better if I provided a much more specific prompt, saying exactly what kind of tables and figures I want to see as output. But you get the idea.

Claude allows you (as of this writing) to upload a maximum of 5 files with 10MB each. While that prevents you to perform EDA on very large datasets, it should be enough for many purposes, and is definitely more than the limit for the copy and paste method described above.

Copilot

I tried using Copilot to help with EDA, but it wasn’t very successful. This screenshot shows an example of me trying to get it to produce a scatterplot. It didn’t work, Copilot did not return any suggestion. It did suggest the other lines of code. I’m not sure if I’m just not using Copilot correctly, maybe you’ve got better luck. It doesn’t hurt turning it on and trying to see if it can help you. If you find it useful, let me know so I can update this.

A screenshot of some R code showing an attempt to get Copilot to create a scatterplot, with copilot not returning any suggestion.

Copilot coding attempt

chattr package

Based on the package description, it should be able to write code based on not only the content in the current file, but also by looking into variables in the current environment. So if you have a data frame called dat in your environment, you should be able to ask the AI to write code to analyze it. Unfortunately, so far I haven’t been able to get chattr to talk to OpenAI. I think this is because talking to ChatGPT via the API (which is what chattr does) requires a paid account.

Other options

The subscription-level access to OpenAI’s GPT engines gives you access to their Advanced Data Analysis tool. This allows you to upload and explore data more easily. I don’t have a subscription so I haven’t tried it yet. But if you are a heavy user of AI for data analysis, this might be worth the cost. This article describes the Advanced Data Analysis tool a bit more.

I have also found a few other tools that seem to focus on data analysis, for instance Julius, but that one only works for Python code.

Summary

I haven’t tried any of the premium/paid options yet. I assume they will be superior, but I wanted to focus on free options. Among those, I currently think for EDA these 2 are the best:

  • (Make the AI) produce code that generates synthetic data that looks like you want it to, then feed that to the AI and ask it to write code to perform EDA on the synthetic data. Once you have working code, you can apply it to your real data.

  • Use Claude to upload your (synthetic or otherwise ok to share) data and ask it to write code to perform EDA on it.

Further Resources

See the AI resource unit.