How to Import Data into R

Anastasios Antoniadis

Updated on:

R is a powerful tool that outperforms software like Microsoft Excel in statistical computing, data analysis, and visualization. It enables faster data loading, automated cleaning, and advanced statistical and predictive analysis. However, before diving into analysis, the first critical step is importing data into R efficiently. Data can come from various sources—CSV files, Excel spreadsheets, databases, web APIs, or even proprietary statistical software formats like SPSS, Stata, and SAS. Handling these different file types effectively is key to ensuring a smooth workflow.

I’ve put together a clear and comprehensive tutorial on importing data into R to simplify the process. From basic text files to more advanced SPSS and SAS formats, this guide will help you seamlessly load your data. Keep reading to learn how to import your files into R with ease!

What You’ll Learn in This Guide

In this guide, I will walk you through the best methods for importing different types of data into R, including:

  • Flat files (CSV, TXT, TSV)
  • Excel files (XLSX, XLS)
  • Structured data (JSON, XML, HTML)
  • Database connections (SQL, SQLite, MySQL, PostgreSQL)
  • Statistical software formats (SPSS, Stata, SAS)
  • Matlab files (MAT)
  • Google Sheets (cloud-based data)
  • Quantmod data (financial data)
  • Large Datasets

By the end of this guide, you’ll be able to:

  • Use the right package for each file type.
  • Avoid common pitfalls in data importing.
  • Optimize data imports for speed and efficiency.
  • Handle large datasets without crashing R.

No matter your level of expertise, mastering data import will streamline your workflow and allow you to focus on data analysis and visualization rather than struggling with file compatibility issues.

Let’s get started! 🚀

First Step: Package CheatSheet & Installation

In R, different packages are used to import various types of data formats. Here’s a list of commonly used packages based on the type of data source:

Base R Functions (No Extra Packages Required)

  • read.table(), read.csv(), read.delim(), read.csv2() → Read text-based data (CSV, TSV)
  • scan() → Reads data line by line
  • file() → Reads raw file connections

Read Text, CSV, and Excel Files

FormatPackageFunction
CSV/TSV/Delimited Filesutilsread.csv(), read.delim()
CSV (Efficient)data.tablefread()
CSV/TSV (Tidy Approach)readrread_csv(), read_tsv()
Excel (.xlsx, .xls)readxlread_excel()
Excel (Write Support)writexlwrite_xlsx()
Google Sheetsgooglesheets4read_sheet()

Read from Databases

DatabasePackageFunction
MySQL, PostgreSQL, SQLiteDBI, RMySQL, RPostgreSQL, RSQLitedbConnect(), dbReadTable()
Microsoft SQL Serverodbc, RODBCodbcConnect(), sqlQuery()
Google BigQuerybigrquerydbConnect(), bq_table_download()

Read JSON, XML, and Web Data

FormatPackageFunction
JSONjsonlitefromJSON()
XMLXML, xml2xmlParse(), read_xml()
Web Scrapingrvestread_html(), html_table()

Read Statistical & Specialized File Formats

FormatPackageFunction
SPSS (.sav)haven, foreignread_sav()
Stata (.dta)haven, foreignread_dta()
SAS (.sas7bdat)havenread_sas()
MATLAB (.mat)R.matlabreadMat()
HDF5hdf5r, rhdf5h5read()

Read Big Data & Parallel Processing

FormatPackageFunction
Large CSVdata.tablefread()
Apache Parquetarrowread_parquet()
Spark Datasparklyrspark_read_csv()
Hadooprhdfshdfs.read()

Read GIS & Spatial Data

FormatPackageFunction
Shapefilessf, rgdalst_read()
GeoJSONgeojsoniogeojson_read()
Raster Dataraster, terraraster()

To install multiple packages at once, use the c() function:

install.packages(c('tidyverse',
                  'foreign',
                  'haven', 
                  'quantmod', 
                  'googlesheets4', 
                  'ff', 
                  'R.matlab'), dependencies=TRUE)

By including the dependencies=TRUE parameter in install.packages(), we ensure that all required dependencies are installed along with the main packages.

Now that your environment is ready, you’re all set to start importing data into R effortlessly!

Importing Flat Files (CSV, TXT, TSV)

CSV (Comma-Separated Values) Files

CSV data in this example:

Movie_ID,Title,Genre,Director,Release_Year,IMDb_Rating,Box_Office_Millions
1,Inception,Sci-Fi,Christopher Nolan,2010,8.8,829
2,The Matrix,Sci-Fi,Lana Wachowski,1999,8.7,466
3,Interstellar,Sci-Fi,Christopher Nolan,2014,8.6,677
4,The Dark Knight,Action,Christopher Nolan,2008,9.0,1005
5,Pulp Fiction,Crime,Quentin Tarantino,1994,8.9,213

CSV files are one of the most common data formats. The readr package from tidyverse provides fast and efficient functions to read CSV files.

library(readr)

# Import CSV file
df <- read_csv("movies_dataset.csv")  # Automatically detects column types
head(df, 5)
R Movies dataset output
Image Credit: Beyond Tech Now

If your CSV file has no column names, specify them using the col_names argument:

df <- read_csv("movies_dataset.csv", col_names = c("Movie_ID", "Title", "Genre", "Director", "Release_Year", "IMDb_Rating", "Box_Office_Millions"))
head(df, 5)

For base R users, you can use:

df <- read.csv("movies_dataset.csv", stringsAsFactors = FALSE)
head(df, 5)

In R, you can also use the read.table() function to import the movies_dataset.csv file. Since CSV files are comma-separated, you need to specify the separator (sep=",") and ensure that headers are read correctly (header=TRUE). Here’s how you can do it:

# Set the file path (update this if needed)
file_path <- "movies_dataset.csv"

# Read the CSV file
movies_df <- read.table(file_path, sep=",", header=TRUE, stringsAsFactors=FALSE)

# Display the first few rows
head(movies_df, 5)
  • file_path: The location of your CSV file.
  • sep=",": Specifies that the separator is a comma (important for CSV files).
  • header=TRUE: Ensures that the first row is treated as column names.
  • stringsAsFactors=FALSE: Prevents automatic conversion of character columns into factors.
R Movies dataset output
Image Credit: Beyond Tech Now

Importing TXT and TSV (Tab-Separated Values) Files

TXT data in this example:

[Intro]
Yeah, look
You ever feel like you’re trapped in a cage?
Like the whole world's watching but you’re stuck on the stage?

[Verse 1]
I was just a kid with a dream and a pen
Tryna scribble out my pain, make it bleed through again
Momma workin’ two jobs, pops gone in the wind
So I put my life in bars, let the therapy begin

[Chorus]
They don’t know what I’ve been through
Every scar, every battle that I lived through
I just fight, I just rise, I continue
Yeah, they don’t know, they don’t know

[Verse 2]
See, I came from the bottom, no silver spoon
Had to hustle just to eat, yeah, real life goon
Teachers said I wouldn’t make it, now they watch from the side
While I'm tearing up these stages with the fire inside

[Chorus]
They don’t know what I’ve been through
Every scar, every battle that I lived through
I just fight, I just rise, I continue
Yeah, they don’t know, they don’t know

[Outro]
Now they wanna see me fall, but I stand too tall
Through the pain, through the hate, yeah, I conquer it all

We will use an alternative to read.table(), the read.delim() function, to load the text file as an R dataframe. Other alternatives include read.csv(), read.csv2(), and read.delim2().

Note: By default, read.delim() separates values using a tab (sep = "\t").

Since the text file contains lyrics and lacks a header row, we need to specify header = FALSE to ensure all lyrics appear in a single row.

Additionally, you can customize the dataframe using other parameters, such as the fill parameter, which adds blank fields to rows of unequal length.

# Import newline-separated file
df <- read_delim("lyrics.txt", delim = "\n")
head(df,5)
# Import semicolon-separated file
df <- read_delim("lyrics.txt", delim = ";")
R lyrics output
Image Credit: Beyond Tech Now

For base R:

df <- read.table("lyrics.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)

Importing Excel Files (.xlsx, .xls)

Excel data in this example:

Movies Dataset Excel File
Image Credit: Beyond Tech Now
Movie_ID	Title	Genre	Director	Release_Year	IMDb_Rating	Box_Office_Millions
1	Inception	Sci-Fi	Christopher Nolan	2010	8,8	829
2	The Matrix	Sci-Fi	Lana Wachowski	1999	8,7	466
3	Interstellar	Sci-Fi	Christopher Nolan	2014	8,6	677
4	The Dark Knight	Action	Christopher Nolan	2008	9	1005
5	Pulp Fiction	Crime	Quentin Tarantino	1994	8,9	213
6	Fight Club	Drama	David Fincher	1999	8,8	101
7	Forrest Gump	Drama	Robert Zemeckis	1994	8,8	678
8	The Shawshank Redemption	Drama	Frank Darabont	1994	9,3	58
9	The Godfather	Crime	Francis Ford Coppola	1972	9,2	250
10	Gladiator	Action	Ridley Scott	2000	8,5	460

Excel files require the readxl package (which does not depend on Java).

library(readxl)

# Read Excel file
df <- read_excel("movies_dataset.xlsx")
head(df,5) 

# Specify sheet name or index
df <- read_excel("movies_dataset.xlsx", sheet = "Movies")
df <- read_excel("movies_dataset.xlsx", sheet = 1)
R Excel Movie Dataset Output
Image Credit: Beyond Tech Now

For writing data to an Excel file:

library(writexl)

write_xlsx(df, "output.xlsx")

For openxlsx (another popular package):

library(openxlsx)

df <- read.xlsx("movies_dataset.xlsx", sheet = 1)
head(df,5)

Importing JSON Files

JSON data in this example:

[
    {
        "Movie_ID":1,
        "Title":"Inception",
        "Genre":"Sci-Fi",
        "Director":"Christopher Nolan",
        "Release_Year":2010,
        "IMDb_Rating":8.8,
        "Box_Office_Millions":829
    },
    {
        "Movie_ID":2,
        "Title":"The Matrix",
        "Genre":"Sci-Fi",
        "Director":"Lana Wachowski",
        "Release_Year":1999,
        "IMDb_Rating":8.7,
        "Box_Office_Millions":466
    },
    {
        "Movie_ID":3,
        "Title":"Interstellar",
        "Genre":"Sci-Fi",
        "Director":"Christopher Nolan",
        "Release_Year":2014,
        "IMDb_Rating":8.6,
        "Box_Office_Millions":677
    },
    {
        "Movie_ID":4,
        "Title":"The Dark Knight",
        "Genre":"Action",
        "Director":"Christopher Nolan",
        "Release_Year":2008,
        "IMDb_Rating":9.0,
        "Box_Office_Millions":1005
    },
    {
        "Movie_ID":5,
        "Title":"Pulp Fiction",
        "Genre":"Crime",
        "Director":"Quentin Tarantino",
        "Release_Year":1994,
        "IMDb_Rating":8.9,
        "Box_Office_Millions":213
    },
    {
        "Movie_ID":6,
        "Title":"Fight Club",
        "Genre":"Drama",
        "Director":"David Fincher",
        "Release_Year":1999,
        "IMDb_Rating":8.8,
        "Box_Office_Millions":101
    },
    {
        "Movie_ID":7,
        "Title":"Forrest Gump",
        "Genre":"Drama",
        "Director":"Robert Zemeckis",
        "Release_Year":1994,
        "IMDb_Rating":8.8,
        "Box_Office_Millions":678
    },
    {
        "Movie_ID":8,
        "Title":"The Shawshank Redemption",
        "Genre":"Drama",
        "Director":"Frank Darabont",
        "Release_Year":1994,
        "IMDb_Rating":9.3,
        "Box_Office_Millions":58
    },
    {
        "Movie_ID":9,
        "Title":"The Godfather",
        "Genre":"Crime",
        "Director":"Francis Ford Coppola",
        "Release_Year":1972,
        "IMDb_Rating":9.2,
        "Box_Office_Millions":250
    },
    {
        "Movie_ID":10,
        "Title":"Gladiator",
        "Genre":"Action",
        "Director":"Ridley Scott",
        "Release_Year":2000,
        "IMDb_Rating":8.5,
        "Box_Office_Millions":460
    }
]

JSON files are widely used in web APIs. The jsonlite package helps read and parse JSON data into R.

library(jsonlite)

# Read JSON file
df <- fromJSON("movies_dataset.json")
print(df[2])

# Read JSON from a URL
df <- fromJSON("https://api.example.com/data.json")
Movies JSON dataset output in R - second column
Image Credit: Beyond Tech Now

To convert an R dataframe to JSON:

json_data <- toJSON(df, pretty=TRUE)
write(json_data, file = "output.json")

Importing XML Files

XML data in this example:

<?xml version="1.0" ?>
<Movies>
  <Movie>
    <Movie_ID>1</Movie_ID>
    <Title>Inception</Title>
    <Genre>Sci-Fi</Genre>
    <Director>Christopher Nolan</Director>
    <Release_Year>2010</Release_Year>
    <IMDb_Rating>8.8</IMDb_Rating>
    <Box_Office_Millions>829</Box_Office_Millions>
  </Movie>
  <Movie>
    <Movie_ID>2</Movie_ID>
    <Title>The Matrix</Title>
    <Genre>Sci-Fi</Genre>
    <Director>Lana Wachowski</Director>
    <Release_Year>1999</Release_Year>
    <IMDb_Rating>8.7</IMDb_Rating>
    <Box_Office_Millions>466</Box_Office_Millions>
  </Movie>
  <Movie>
    <Movie_ID>3</Movie_ID>
    <Title>Interstellar</Title>
    <Genre>Sci-Fi</Genre>
    <Director>Christopher Nolan</Director>
    <Release_Year>2014</Release_Year>
    <IMDb_Rating>8.6</IMDb_Rating>
    <Box_Office_Millions>677</Box_Office_Millions>
  </Movie>
  <Movie>
    <Movie_ID>4</Movie_ID>
    <Title>The Dark Knight</Title>
    <Genre>Action</Genre>
    <Director>Christopher Nolan</Director>
    <Release_Year>2008</Release_Year>
    <IMDb_Rating>9.0</IMDb_Rating>
    <Box_Office_Millions>1005</Box_Office_Millions>
  </Movie>
  <Movie>
    <Movie_ID>5</Movie_ID>
    <Title>Pulp Fiction</Title>
    <Genre>Crime</Genre>
    <Director>Quentin Tarantino</Director>
    <Release_Year>1994</Release_Year>
    <IMDb_Rating>8.9</IMDb_Rating>
    <Box_Office_Millions>213</Box_Office_Millions>
  </Movie>
  <Movie>
    <Movie_ID>6</Movie_ID>
    <Title>Fight Club</Title>
    <Genre>Drama</Genre>
    <Director>David Fincher</Director>
    <Release_Year>1999</Release_Year>
    <IMDb_Rating>8.8</IMDb_Rating>
    <Box_Office_Millions>101</Box_Office_Millions>
  </Movie>
  <Movie>
    <Movie_ID>7</Movie_ID>
    <Title>Forrest Gump</Title>
    <Genre>Drama</Genre>
    <Director>Robert Zemeckis</Director>
    <Release_Year>1994</Release_Year>
    <IMDb_Rating>8.8</IMDb_Rating>
    <Box_Office_Millions>678</Box_Office_Millions>
  </Movie>
  <Movie>
    <Movie_ID>8</Movie_ID>
    <Title>The Shawshank Redemption</Title>
    <Genre>Drama</Genre>
    <Director>Frank Darabont</Director>
    <Release_Year>1994</Release_Year>
    <IMDb_Rating>9.3</IMDb_Rating>
    <Box_Office_Millions>58</Box_Office_Millions>
  </Movie>
  <Movie>
    <Movie_ID>9</Movie_ID>
    <Title>The Godfather</Title>
    <Genre>Crime</Genre>
    <Director>Francis Ford Coppola</Director>
    <Release_Year>1972</Release_Year>
    <IMDb_Rating>9.2</IMDb_Rating>
    <Box_Office_Millions>250</Box_Office_Millions>
  </Movie>
  <Movie>
    <Movie_ID>10</Movie_ID>
    <Title>Gladiator</Title>
    <Genre>Action</Genre>
    <Director>Ridley Scott</Director>
    <Release_Year>2000</Release_Year>
    <IMDb_Rating>8.5</IMDb_Rating>
    <Box_Office_Millions>460</Box_Office_Millions>
  </Movie>
</Movies>

For XML data, the xml2 package provides easy parsing.

library(xml2)

# Read XML file
xml_data <- read_xml("movies_dataset.xml")

# Extract specific nodes
nodes <- xml_find_all(xml_data, "//Movie")

# Convert XML data into a dataframe
df <- data.frame(
  Title = xml_text(xml_find_all(nodes, "Title")),
  Genre = xml_text(xml_find_all(nodes, "Genre")),
  Release_Year = as.numeric(xml_text(xml_find_all(nodes, "Release_Year")))
)
XML Movie Dataset Output in R
Image Credit: Beyond Tech Now

Importing an HTML Table from a URL

R provides several functions to read HTML tables directly from a webpage. The rvest and XML packages are commonly used for this purpose. Below, Ι’ll demonstrate how to extract the highest-grossing films table from the Wikipedia page.

Using rvest

The rvest package is a modern and user-friendly way to scrape web data in R.

# Install and load the rvest package
install.packages("rvest")
library(rvest)

# Define the URL
url <- "https://en.wikipedia.org/wiki/List_of_highest-grossing_films"

# Read the HTML content
webpage <- read_html(url)

# Extract all tables from the page
tables <- html_table(webpage, fill = TRUE)

# Display the first few rows of the first table
head(tables[[1]],10)
HTML Table Data Import Output in R

Using XML

The XML package provides another way to extract tables from an HTML page.

# Install and load the XML package
install.packages("XML")
library(XML)
library(RCurl)

url <- getURL("https://en.wikipedia.org/wiki/List_of_highest-grossing_films")
tables <- readHTMLTable(url)
highest_grossing_films <- tables[1]

highest_grossing_films

Choosing the Right Table

Wikipedia pages often contain multiple tables. The extracted tables are stored in a list, so you may need to inspect them and select the one that contains the required data.

# Check how many tables were extracted
length(tables)

# View the structure of the first table
str(tables[[1]])

# Assign the correct table to a dataframe
highest_grossing_films <- tables[[1]]

This method allows you to efficiently scrape HTML tables from Wikipedia or other websites and store them in an R dataframe for further analysis.

Importing SQL Databases (MySQL, PostgreSQL, SQLite)

R provides database connections using DBI and specific database drivers.

SQLite Example

library(DBI)

# Connect to SQLite database
con <- dbConnect(RSQLite::SQLite(), "database.sqlite")

# List tables
dbListTables(con)

# Read table into R
df <- dbReadTable(con, "employees")

# Run custom SQL query
df <- dbGetQuery(con, "SELECT * FROM employees WHERE Age > 30")

# Disconnect
dbDisconnect(con)

MySQL Example

library(RMySQL)

con <- dbConnect(MySQL(), dbname="your_db", host="localhost", user="root", password="password")

df <- dbGetQuery(con, "SELECT * FROM employees")
dbDisconnect(con)

PostgreSQL Example

library(RPostgres)

con <- dbConnect(RPostgres::Postgres(), dbname="your_db", host="localhost", user="postgres", password="password")

df <- dbGetQuery(con, "SELECT * FROM employees")
dbDisconnect(con)

Importing SPSS, Stata, and SAS Files

For statistical software formats like SPSS (.sav), Stata (.dta), and SAS (.sas7bdat), the haven and foreign packages are commonly used. You can download the .sav file I used for this example from here (it’s “Text Messages.sav”).

SPSS (.sav) Files

Using haven:

library(haven)

df <- read_sav("Text Messages.sav")
head(df,12)
SPSS data output after importing to R
Image Credit: Beyond Tech Now

Using foreign:

library(foreign)

df <- read.spss("Text Messages.sav", to.data.frame = TRUE)
head(df,12)
SPSS data output after importing to R
Image Credit: Beyond Tech Now

Stata (.dta) Files

For this example, I used the airline.dta from here.

library(haven)
df <- read_dta("airline.dta")  # Using haven
head(df,10)
Import Stata File to R output
Image Credit: Beyond Tech Now
library(foreign)
df <- read.dta("airline.dta")  # Using foreign
head(df,10)
Import Stata File to R output with foreign
Image Credit: Beyond Tech Now

SAS (.sas7bdat) Files

For this example, I used the AP_VOTECAST_2018_DATA.sas7bdat from here (you must create an account to access it).

library(haven)
df <- read_sas("AP_VOTECAST_2018_DATA.sas7bdat")  # Using haven
head(df,10)
Import Sas7bdat file to R output
Image Credit: Beyond Tech Now

XPT (.xpt) files:

library(haven)
df <- read_xpt("data.xpt")

Importing Matlab Files (.mat)

Matlab files can be imported using the R.matlab package.

library(R.matlab)

# Read Matlab file
data <- readMat("movies_dataset.mat")

# Extract variables
df <- as.data.frame(data$Title)
df
Import MatLab file into R output
Image Credit: Beyond Tech Now

Importing Data from Google Sheets

Google Sheets data can be imported using the googlesheets4 package. If your Google Sheet is publicly accessible (or shared with “Anyone with the link”), you can directly use something like this:

library(googlesheets4)

sheet_url = "https://docs.google.com/spreadsheets/d/1B-ZN2cgyKj-2R3I5EYhdLL3hkLlFRmf_jbyiMLaVHkg/edit?usp=sharing"

# Authorize and read Google Sheets
df <- read_sheet(sheet_url, sheet = "Music Codes")
df

You can also read a range of cells:

library(googlesheets4)

sheet_url = "https://docs.google.com/spreadsheets/d/1B-ZN2cgyKj-2R3I5EYhdLL3hkLlFRmf_jbyiMLaVHkg/edit?usp=sharing"

# Authorize and read Google Sheets
df <- read_sheet(sheet_url, sheet = "Music Codes", range = "A1:B20")  # Reads only cells from A1 to B20
df

If the Google Sheet is private, you’ll need to authenticate with your Google account:

library(googlesheets4)

sheet_url = "https://docs.google.com/spreadsheets/d/1B-ZN2cgyKj-2R3I5EYhdLL3hkLlFRmf_jbyiMLaVHkg/edit?usp=sharing"

# Authenticate with Google
gs4_auth()

# Read the private sheet
df <- read_sheet(sheet_url)
df

Importing Financial Data with Quantmod

The quantmod package is a powerful financial modeling and trading framework for R. We will use it to download and load the latest trading data into a dataframe.

You can retrieve historical stock prices for Apple (AAPL) from Yahoo Finance using the getSymbols() function.

library(quantmod)

# Get Apple stock data from Yahoo Finance
getSymbols("AAPL", src = "yahoo")

# View the first few rows
head(AAPL)
Import Apple stock data to R output
Image Credit: Beyond Tech Now

To retrieve more specific historical stock data for Apple (AAPL), we will use the getSymbols() function, specifying the “from” and “to” dates along with the desired data frequency.

library(quantmod)

getSymbols("AAPL", src="yahoo", from = "2025/2/1", to = "2025/2/13", periodicity = "daily")

head(AAPL,5)
Import AAPL stock data in specific range with Quantmod output
Image Credit: Beyond Tech Now

Importing Large Datasets into R Efficiently

When working with large datasets in R, selecting the right function can significantly impact loading time and memory usage. Below are various approaches to importing large datasets efficiently.

In this section, we will be looking at popular functions used for loading CSV files greater than 1 GB. We are using the US Accidents (2016 – 2023) dataset from Kaggle, which is about 3.06GB in size and has 7,728,394 records. 

Using read.table() with a ZIP File

Instead of extracting the ZIP file manually, we can use the unz() function to load the CSV directly from the archive. This saves time and disk space.

# Load data directly from a ZIP file
file <- unz("US Accidents.zip", "US_Accidents_March23.csv")

# Read the first 10,000 rows with proper separators
df <- read.table(file, header = TRUE, sep = ",", nrows = 700)

💡 Tip: By default, read.table() assumes space-separated values, so always specify sep="," for CSV files.

Note: I have limited the number of rows to 700 as read.table() requires a tabular structure which is not maintained in row 732.

Read Table on Zip File Output
Image Credit: Beyond Tech Now

Using read_csv() from readr

The readr package in R provides the read_csv() function, which supports multi-threading for faster data loading. By default, read_csv() automatically detects the number of available processor cores and uses multiple threads to read large CSV files efficiently. It allows specifying n_max instead of nrow to limit the number of records.

library(readr)

file <- unz("US Accidents.zip", "US_Accidents_March23.csv")
# Read the full dataset (can take ~1 minute)
df <- read_csv(file)

# To load only 10,000 rows:
df <- read_csv(file, n_max = 10000)

💡 Tip: To speed up loading, adjust the number of threads. Refer to the function documentation for details.

library(readr)

# Unzip the file
unzip("US Accidents.zip")
# Read the full dataset (can take ~1 minute)
df <- read_csv("US_Accidents_March23.csv", num_threads = parallel::detectCores(), n_max = 10000)
First 10000 rows with read_csv
Image Credit: Beyond Tech Now

Using ff for Memory-Efficient Loading

The ff package loads large datasets in chunks, reducing memory usage and improving performance.

library(ff)


# Load data using read.table.ffdf() for large datasets
df <- read.table.ffdf(file = "US_Accidents_March23.csv", nrows = 10000, header = TRUE, sep = ",")
                          
# Preview first 5 rows and first 25 columns
df[1:5, 1:10]

💡 Tip: The ff package is ideal for handling datasets that exceed system memory.

ff output 5 rows 10 columns
Image Credit: Beyond Tech Now

Using fread() from data.table

The fread() function from the data.table package is the fastest option for reading large datasets. It automatically detects file format and column types, requiring minimal manual configuration.

library(data.table)

# Load the first 10,000 rows efficiently
df <- fread("US_Accidents_March23.csv", sep = ",", nrows = 10000, na.strings = c("NA", "N/A", ""), stringsAsFactors = FALSE)

💡 Tip: fread() is significantly faster than read.csv() and can automatically determine the file structure.

Reading from Large Datasets CheatSheet

MethodBest ForProsCons
read.table()Standard datasetsWorks with ZIP filesSlow for large datasets
read_csv()Large CSV filesFaster than read.table()Requires readr package
ffExtremely large datasetsEfficient memory usageMore complex syntax
fread()Best overall performanceVery fast, auto-detects formatRequires data.table package

For small to medium datasets, read_csv() works well. For very large datasets, fread() is the best choice due to its speed and efficiency.

Conclusion

This guide covers the most common ways to import data into R, from simple CSV files to complex SQL databases and statistical software formats. Mastering these techniques will allow you to work seamlessly with data from multiple sources, making your analysis more efficient and powerful.

By leveraging R’s extensive package ecosystem, you can automate data imports, clean and manipulate data efficiently, and focus on analysis rather than struggling with file compatibility issues.

FAQ

1. What types of data files can I import into R?

R supports a wide range of data formats, including:

  • CSV files (.csv)
  • Excel files (.xlsx, .xls)
  • Text files (.txt, .dat)
  • R Data files (.rds, .RData)
  • JSON files (.json)
  • XML files (.xml)
  • SPSS, Stata, and SAS files (.sav, .dta, .sas7bdat)
  • SQL Databases

2. How do I import a CSV file into R?

You can use the read.csv() function:

data <- read.csv("path/to/your/file.csv", header = TRUE, stringsAsFactors = FALSE)

Alternatively, you can use readr::read_csv() for faster performance:

library(readr)
data <- read_csv("path/to/your/file.csv")

3. How do I import an Excel file into R?

Use the readxl package:

library(readxl)
data <- read_excel("path/to/your/file.xlsx", sheet = 1)

For writing Excel files, use the writexl package:

library(writexl)
write_xlsx(data, "output.xlsx")

4. How do I import a text file into R?

For space- or tab-delimited text files, use:

data <- read.table("path/to/your/file.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)

For more flexibility, use readr::read_delim():

data <- read_delim("path/to/your/file.txt", delim = "\t")

5. How do I import JSON data into R?

Use the jsonlite package:

library(jsonlite)
data <- fromJSON("path/to/your/file.json")

6. How do I import data from an SQL database?

Use the DBI package for database connections:

library(DBI)
con <- dbConnect(RSQLite::SQLite(), "database.sqlite")
data <- dbGetQuery(con, "SELECT * FROM table_name")
dbDisconnect(con)

7. How do I load R-specific data files (.rds, .RData)?

For .rds files:

data <- readRDS("path/to/your/file.rds")<br>

For .RData files:

load("path/to/your/file.RData") # Loads all objects stored in the file<br>

8. How do I check if my data was imported correctly?

After importing, use these functions to inspect the data:

head(data)       # View first few rows
str(data)        # Check structure and data types
summary(data)    # Get summary statistics
dim(data)        # Get dimensions (rows & columns)

9. What should I do if my file is not found or there’s an error?

  • Check the file path – Ensure it’s correct and properly formatted (use file.choose() to select a file interactively).
  • Ensure the file exists – Run file.exists("path/to/file") to check.
  • Check delimiters – If data isn’t structured correctly, specify the separator explicitly (sep="," for commas, sep="\t" for tabs).
  • Handle encoding issues – Use encoding="UTF-8" if there are special characters causing problems.

10. How do I handle missing values after importing data?

Use functions like:

sum(is.na(data))        # Count missing values
data_clean <- na.omit(data)  # Remove rows with missing values
data[is.na(data)] <- 0  # Replace missing values with 0
Anastasios Antoniadis
Find me on

Leave a Comment