Complex types of data
Overview
For this unit, we will discuss different types of data and how data type influences the analysis approach.
Learning Objectives
- Be familiar with some common, complex data types
- Understand how the type and structure of data influences analysis approaches
Introduction
Our focus is on the analysis of what is sometimes called tabular or rectangular data. We have observations (individual units of data) generally as rows, and variables (outcome and predictors) as columns.
Such data is still the most common type of data, especially in biomedical and life sciences. However, more complex types of data are rapidly increasing in frequency and importance. If the data has a different structure, it will usually require different, specialized analysis approaches.
The good news is that most of the general concepts of data analysis and modeling apply to any type of data. The difference is often in the processing of the data, and the exact way data and models are combined.
Doing analyses of such complex data is beyond the main scope of the course. Nevertheless, it is worth to briefly cover different important types of data. Note that some of these types are not mutually exclusive, for instance video data is always in some sense
Text
Working with and analyzing larger sections of text is becoming increasingly important. Complex and powerful AI tools (e.g., ChatGPT) have become rather good at working with text. The analysis of text often goes by the term natural language processing. Such text analysis will continue to increase in importance, given the increasing data streams of that type. If you are interested in doing full analyses of text data, the tidytext
R package and the Text mining with R book are great resources. A short introduction to this topic is the Text Mining chapter (27) of IDS.
Of course, with the rise of text-based LLM AI tools and similar such offerings, this whole area of data analysis has grown very large on its own. If you want to really dig deep into this, check out some of the course and training offerings from DeepLearning.AI.
Audio
Audio recordings can contain text/speech, or other sound. In the former case, models need to be applied to transform audio into text. After that, any text analysis tool can be used. If the audio does not contain text, one likely wants to apply other analyses. For instance one might want to detect if a certain sound is present, such as the sound of a fire alarm going off. This generally requires partitioning and transforming the audio into pieces that can be turned into some form of numeric variables and analyzed. Audio has a time-series structure that needs to be taken into account when analyzing it.
Images
For analysis purposes, one can think of an image is essentially a matrix of pixels, with each pixel having color and intensity values. This large amount of - usually very structured — data needs to be processed.
Images are generally converted into multiple matrices of values for different pixels of an image. For instance, one could divide an image into a 100x100 grid of pixels, and assign each pixel RGB (color) values and intensity. That means one would have 4 matrices of numeric values, each of size 100x100. One would then perform operations on those values. As you can imagine, that quickly leads to fairly large amounts of data. These days, most successful image analysis is done using some form of neural nets, which are generally considered an artificial intelligence (AI) method. (But recall that ML and AI terminology is somewhat fuzzy. The general current usage is that AI are a type of ML, specifically the neural net type.)
Video
Video is essentially a time-series of images with audio. As such, approaches that work for image and audio analysis can also be applied to video. Of course, the time-series nature of the images makes things more complicated. Neural-net/AI based methods are most often used.
Genetics and -omics data
The main genetic type of data is based on sequences. A lot of specialized tools exist to work with what is often fairly noisy data. Aligning sequences, comparing them, placing them on phylogenetic trees, and other such operations are so common and important that there is a large area of tools for those purposes.
Data that is often called -omics
(e.g., metabolomics, glycomics) is often rectangular in structure, but often has distinguishing features, such as few individuals/rows and many observations/columns. Such data needs special treatment. Often, variable/feature reduction is a common step in the analysis workflow.
The bioconductor website is your source for (almost) all tools and resources related to genetics and omics-type data analyses in R
.
Big data
This isn’t really a type of data in the same sense as the ones listed above. Still, it is its own type of data simply by the fact that it encompasses large amounts of data. Some of the complex data types above, e.g., videos, quickly become big data.
While the term big data is a bit fuzzy, in general people mean any dataset that doesn’t easily fit into the memory of a regular computer (or cluster) and thus needs to be analyzed using special tools. Alternatively, big data might be defined as data that is so big that doing analysis on it takes too long using standard tools (e.g. R on a regular computer) and instead requires special treatment. Of course this also depends on the type of analysis, not only the type of data. As computers keep getting faster and tools more flexible and integrated, the label big data is a moving target.
Generally, big data is stored somewhere in a database. SQL type databases are most common. You then want to access that database in a form that allows you to perform your analysis. There are different ways of dealing with big data. Most methods are general and apply independent of the programming language you use. This article describes a few general approaches and explains how they can be implemented in R. This webinar gives a bit more information and a nice description of the overall setup for big data. As you learn in that tutorial, R is often used together with other software to analyze big data. A tool that is often used for big data analysis is Spark. For R, there is the sparklyr package, which allows one to nicely interface with Spark.
In general, when you work with big data, you will have to carefully look at the data, the type of database it is stored, and the analysis goals. Based on that, you should use a stack of tools that allows analysis. The Databases task view gives a good overview of different R packages for specific types of databases. You will use R for your analysis, and R will then interface with other software. This interface is usually fairly seamless.
Big data is commonly modeled using complex models, such as machine-learning or AI algorithms. The reason for that is that those models are very powerful but need a lot of data. Thus, if you have big data, you can use them. That said, you can analyze big data with any model you want, including simple GLM or similar such models.
Hierarchical data
This type of data is actually quite common. A hierarchy (also at times called multi-level or nesting structure) occurs when there is some type of grouping inherent. For instance, if you analyze heart attacks from different hospitals, there might be systematic differences between hospitals that are worth taking into account. You can then build models that take that into account. I think there is a strong case to be made that one should often start with a hierarchical modeling approach, and only drop it if one is convinced that there are no systematic differences in the data. Unfortunately, such hierarchical models are still a little bit harder to implement, and often a lot harder to understand and interpret. For such models, a Bayesian framework is often very useful. A good introductory resource for hierarchical modeling, and especially how to do it with R
, is Statistical Rethinking (the second half of the book). There is also a CRAN Task view that lists R packages relevant to hierarchical modeling.
Another term you might hear is recursively nested data. While this type of data is a type of hierarchical data, when we discuss hierarchical data we often think about nesting in terms of statistical clusters. Computer scientists and web developers often use hierarchical data for storage and data transfer purposes, and often call it recursively nested data instead.
Recursively nested data has become extremely common on the internet the past few years, and is essentially the default data format of APIs. While statisticians often store hierarchical data in one or more rectangular data sets, recursively nested data is stored in recursive lists. For example, consider the following list of Cars in Jay Leno’s Garage.
- Mazda Miata
- Year: 1996
- Original MSRP: $18450
- Color: Red
- Convertible: True
- Chevrolet Volt
- Year: 2011
- Original MSRP: $4100
- Color: Steel Metallic
- Interior: Black
- Convertible: False
- Hybrid: True
- Etc.
While we can imagine a world where this data is stored in a rectangular format (see mtcars
for example), we would have to do some processing to get it in that format, and we would have to decide what to do for the fields that are not shared across all the entries. These data can also have lists nested within lists (nested within lists, to very high levels of list-recursion), which can be more challenging to “rectangle” (that is, convert to a rectangular format).
Data stored in this format is usually stored in JSON or XML formats, which can be read into R
using the rjson
or xml2
packages, among others. The new edition of R4DS has a chapter on working with hierarchical and recursively nested data data (but does not cover the fitting/analysis part). The repurrrsive
package, by Jenny Brian, contains some examples of this type of data.
Longitudinal/Time-series data
Longitudinal or time series are a specialized data type that are autocorrelated (i.e., each measurement is correlated with at least one measurement that came before it), and we can therefore get better predictions by using specialized models that take this data structure into account. A good resource is the free textbook Forecasting: Principles and Practice and a lot of the other work by Rob Hyndman.
A very useful set of tools to allow times-series work in R is the set of packages called the tidyverts
. The modeltime
R
package allows one to use the tidymodels
framework to analyze time-series data.
CRAN also has a Task View for Time Series Analysis. (A Task View on CRAN is a site that tries to combine and summarize various R packages for a specific topic). Another task view that deals with longitudinal/time-series data is the Survival Analysis Task View.
Note that, somewhat confusingly, longitudinal data are common in epidemiology, and despite being measured over time, “time-series methods” are not usually appropriate. Time-series and forecasting methods are mainly appropriate for data where the times are dense, which means many time points are measured somewhat close together. The term “longitudinal data” (also called panel data in some fields) is most commonly used to refer to sparse data collected over time, where each study unit is only measured a handful of times which may be spread apart. Most cohort studies therefore fall into this longitudinal framework, not into a time series framework.
Spatial data
Like time series data, spatial data also feature autocorrelation, but typically in two dimensions. (i.e., latitude and longitude.) Spatiotemporal data, typically collected as repeated measurements of spatial data over time, is also somewhat common. One could also have an elevation coordinate and have 3D (or even 4D spatiotemporal data) spatial data, but most analyses focus on 2D spatial autocorrelation structures.
While there is (to my knowledge) no current way to fit specific spatial models in tidymodels
(i.e. no modeltime
analogue), one can use spatial resampling through the spatialsample
package, which provides resampling methods that take spatial autocorrelation into account.
Summary
The amount and variety of data available for analysis keeps increasing. As data get more complex, the analysis approaches become more specialized and often complex as well. While this requires additional learning of specialized tools, a lot of the fundamentals covered in this course still apply.
Further Resources
- You can find a community-curated list of resources for spatial data in The Big Book of R or in the Spatial Data CRAN task view.