Introduction to current LLM AI tools
Overview
This unit provides an introduction to the use of Large Language Model (LLM) based Artificial Intelligence (AI) tools (such as ChatGPT) for coding and data analysis.
Learning Objectives
- Be aware of possible ways to use AI to help with data analysis.
- Understand strengths, weaknesses and dangers of using AI.
Introduction
By now, you have certainly heard of ChatGPT and similar AI-based tools. You probably have also tried them out. AI is a broad field, and there are different AI algorithms. You can potentially use some of them to analyze your data. We discussed this in a previous unit. The focus here is on AI tools that can help you in various coding and data analysis tasks.
Currently, the best tools for that are those based on Large Language Models (LLM). For simplicity, I’m just going to call them AI tools here, but be aware that there are other AI approaches out there.
This is a very practical tutorial from a user perspective. I’ll only discuss how to use those tools. If you want to learn more about the general workings, ethical issues regarding licensing of training data and biased results, etc., take a look at the AI resources page.
Available AI tools
The field is changing rapidly. To keep up and have everything in one place I made a separate unit that discusses different AI tools
Shortcomings and Dangers of AI Tools
For all its promises, AI is not without its problems. It still has many shortcomings and dangers. Here, I’ll discuss some of the most important ones.
I’m focusing here on the issues most relevant for our purposes. Broader, more tricky topics such as biases are real but I think not too relevant for using AI tools to help with data analysis and coding.
Privacy/Confidentiality
When you use AI to help with data analysis, you might want to show your data to the AI and ask it to do something with it. The problem is that pretty much all these systems take your data and move it to the servers of whichever company you use for processing. Often, these companies keep your data to help improve their models. If you have data which is sensitive, e.g. human data, or data that you don’t want to share, then you need to be careful not to let the AI have access to the data. A good solution is to generate synthetic data that has the structure of your real data but is made up. Then you can ask the AI to process this synthetic data and give you the code it generated. In a later step, you can go “offline” and use the code the AI helped you write on your real data. We’ll discuss how to generate synthetic/artificial data shortly.
Beyond the data, it is important to keep in mind that if you have the AI running (e.g., through the R Studio CoPilot integration), it might access not just your current file but possibly also other sources you link to, e.g. other stuff on your computer. If you have things you don’t want the company to “grab” and copy to their servers, then be careful about what you let the AI access.
In general, be careful what information you let the AI access and be aware that it might end up on the server of whatever company is running the AI tool.
Non-working code
While most of the AI tools have become pretty good at coding in R and other languages, at times they don’t get it right. You might get code back that doesn’t work. AI are known to hallucinate, i.e., make up stuff. For instance it’s not uncommon that the AI invents an R command that does not exist. So when you get your code back, you will often need to do some trouble-shooting. At times, you can tell the AI that the code is not correct and ask it to fix it. If this doesn’t always work, then you can make the fixes by hand or try to reformulate your request.
Reproducibility
The companies making AI tools constantly update and improve the algorithms. Further, the underlying methods often have random components. This means that if you give the same instructions to an AI tool on different occasions, the results/code you get might differ. This means things might be non-reproducible.
While I encourage you to keep track of any input prompts you use to have the AI generate code, note that providing those prompts does not allow someone else (or future you) to exactly reproduce things. Thus, while AI tools can be useful helpers during the data analysis process, they should not be considered part of the final workflow, which instead should contain results/code (possibly generated with AI help) which can run in such a way to allow full reproducibility.
Ethics of using AI
Here I’m not talking about the fact that most AI tools were trained on data without the permission of the original data generators, or that a lot of AI algorithms produce biased (unethical?) output. Those topics are too complex to tackle here. Instead, I’m talking about the ethics of using AI to help with your data analysis, or more generally your academic/professional work.
Since these tools are so new, nobody knows yet how to properly acknowledge AI help. I’m not sure either. If I search Google for an example to help with my code, and find something on StackOverflow that I use as basis for my code, should I cite it? I sometimes do add a link to the original post in my code. Partly to give credit, and partly to remind myself where I got it from. But I don’t think there are clear rules on this.
Similarly, if you copy text from Wikipedia or some other source, you need to cite it. But if you read it, then repeat in your own words, when do you need to cite it? I don’t think it’s clear. This is similar with AI. If you have a full code or large chunk of text generated by AI and you use it “as is”, you probably need to state that. But it’s more likely that the AI will give you some parts of the code or text, and you write the rest. What is the rule for that? I don’t think there are clear rules.
I suggest you follow the guideline of “if in doubt, cite/acknowledge”. For instance, at the beginning of some code, you can make a statement saying “part of this code was generated by ChatPGT using the following prompt” and then state the prompt you use. Or if you use AI to help you interpret your data, you can state somewhere in your output (e.g., your Quarto document), that you used AI to help generate insights/text/etc. Providing this kind of information prevents you from being accused of “cheating” (if someone thinks using AI is cheating, for this course, you are encouraged to do so), and it also helps somewhat with the reproducibility (see above).
Tips for best practices
Here are some suggestions on how to use the AI tools as efficiently as possible.
Think carefully about what exactly you want to the AI to accomplish and provide instructions that are as detailed and specific as possible. These days, this is often called prompt engineering.
Iterate. Often the first version you get from the AI will not exactly what you want. You can try to ask the AI to rewrite/change/fix things. Or you can manually edit the code, which you can then use as input to ask the AI to make further changes.
Try different engines. E.g., ChatGPT doesn’t give you something useful, try Bing (which is also currently powered by ChatGPT, but gives somewhat different results).
Ask the AI to add a lot of comments into the code to explain what each line of code does.
Break down big tasks into smaller tasks. Ask the AI to solve these smaller tasks individually and then pull things together.
Cite as appropriate. If in doubt, include more information (e.g., prompts).
Summary
AI tools can be very useful tools to help you write code and do your data analysis. They are not perfect, but they can potentially save you a lot of time. I recommend you try them out and see if they can help you.
Further Resources
See the AI resource unit. If you have other useful resources, I’d love to hear about them.