How to Read Data From Website in R

Tutorial: Web Scraping in R with rvest

Published: April 13, 2020

we'll do some web scraping in R with rvest to gather data about the weather

The cyberspace is ripe with data sets that you lot can use for your own personal projects. Sometimes y'all're lucky and y'all'll have access to an API where you tin can just directly ask for the data with R. Other times, you won't be so lucky, and you won't be able to get your data in a not bad format. When this happens, we demand to plow to web scraping, a technique where we get the data we want to analyze by finding it in a website'due south HTML code.

In this tutorial, we'll cover the basics of how to do spider web scraping in R. We'll be scraping data on weather forecasts from the National Weather Service website and converting it into a usable format.

Web scraping opens up opportunities and gives united states the tools needed to actually create information sets when we can't find the data nosotros're looking for. And since we're using R to do the web scraping, we can merely run our lawmaking again to go an updated information gear up if the sites we use get updated.

Understanding a web page

Earlier we tin can start learning how to scrape a web folio, we need to sympathize how a web page itself is structured.

From a user perspective, a web folio has text, images and links all organized in a way that is aesthetically pleasing and easy to read. But the web page itself is written in specific coding languages that are then interpreted by our spider web browsers. When we're web scraping, nosotros'll demand to bargain with the actual contents of the web folio itself: the code before it's interpreted by the browser.

The main languages used to build web pages are chosen Hypertext Markup Linguistic communication (HTML), Cascasing Style Sheets (CSS) and Javascript. HTML gives a web page its actual structure and content. CSS gives a web page its style and wait, including details like fonts and colors. Javascript gives a webpage functionality.

In this tutorial, we'll focus mostly on how to use R web scraping to read the HTML and CSS that make up a web folio.

HTML

Unlike R, HTML is not a programming language. Instead, it's called a markup linguistic communication — it describes the content and construction of a spider web page. HTML is organized usingtags, which are surrounded by <> symbols. Different tags perform different functions. Together, many tags will grade and comprise the content of a web page.

The simplest HTML certificate looks like this:

            <html> <head>

Although the above is a legitimate HTML certificate, it has no text or other content. If we were to save that every bit a .html file and open it using a web browser, we would encounter a blank page.

Notice that the word html is surrounded by <> brackets, which indicates that it is a tag. To add some more than structure and text to this HTML certificate, nosotros could add the following:

            <caput> </caput> <body> <p> Here's a paragraph of text! </p> <p> Hither'southward a second paragraph of text! </p> </body> </html>

Here we've added <head> and <torso> tags, which add more structure to the document. The  tags are what we use in HTML to designate paragraph text.

At that place are many, many tags in HTML, just we won't be able to cover all of them in this tutorial. If interested, you tin can check out this site. The of import takeaway is to know that tags have detail names (html, body, p, etc.) to make them identifiable in an HTML document.

Notice that each of the tags are "paired" in a sense that each one is accompanied by another with a like proper noun. That is to say, the opening <html> tag is paired with another tag </html> that indicates the offset and end of the HTML document. The same applies to <body> and .

This is of import to recognize, because it allows tags to be nested within each other. The <torso> and <caput> tags are nested within <html>, and  is nested inside <body>. This nesting gives HTML a "tree-like" structure:

This tree-similar structure will inform how we expect for certain tags when nosotros're using R for web scraping, so it's important to keep information technology in heed. If a tag has other tags nested within it, we would refer to the containing tag every bit the parent and each of the tags within it as the "children". If in that location is more than than i kid in a parent, the child tags are collectively referred to as "siblings". These notions of parent, kid and siblings requite u.s.a. an idea of the hierarchy of the tags.

CSS

Whereas HTML provides the content and structure of a web page, CSS provides information almost how a web page should be styled. Without CSS, a web page is dreadfully plain. Here's a simple HTML document without CSS that demonstrates this.

When nosotros say styling, we are referring to a broad, wide range of things. Styling can refer to the colour of particular HTML elements or their positioning. Like HTML, the scope of CSS material is so big that nosotros tin't cover every possible concept in the linguistic communication. If you're interested, you can acquire more here.

Two concepts we do need to learn before we delve into the R web scraping code areclasses and ids.

Kickoff, permit's talk almost classes. If we were making a website, there would oft exist times when we'd want similar elements of a website to look the same. For example, we might desire a number of items in a list to all announced in the aforementioned color, carmine.

We could accomplish that past directly inserting some CSS that contains the color information into each line of text's HTML tag, like so:

            <p style="colour:cherry" >Text 1</p> <p style="color:ruby-red" >Text 2</p> <p manner="color:red" >Text 3</p>

The style text indicates that nosotros are trying to apply CSS to the  tags. Inside the quotes, nosotros run into a key-value pair "color:cherry-red". color refers to the color of the text in the  tags, while red describes what the color should be.

But as nosotros can run across higher up, nosotros've repeated this key-value pair multiple times. That's non ideal — if nosotros wanted to change the colour of that text, nosotros'd accept to change each line one by one.

Instead of repeating this way text in all of these  tags, we can supervene upon it with a class selector:

            <p course="crimson-text" >Text 1</p> <p class="cerise-text" >Text 2</p> <p class="ruddy-text" >Text 3</p>

The class selector, we tin can meliorate indicate that these  tags are related in some way. In a separate CSS file, we can creat the red-text form and ascertain how it looks by writing:

            .red-text {     color : scarlet; }

Combining these two elements into a single web folio will produce the same effect as the first fix of red  tags, merely it allows us to make quick changes more hands.

In this tutorial, of class, we're interested in web scraping, not building a web page. Merely when we're web scraping, we'll often demand to select a specific class of HTML tags, so we demand understand the nuts of how CSS classes piece of work.

Similarly, we may often want to scrape specific data that's identified using an id. CSS ids are used to requite a unmarried element an identifiable name, much similar how a class helps define a class of elements.

            <p id="special" >This is a special tag.</p>

If an id is attached to a HTML tag, information technology makes information technology easier for u.s.a. to identify this tag when we are performing our actual web scraping with R.

Don't worry if you don't quite empathise classes and ids however, it'll become more than clear when we start manipulating the code.

There are several R libraries designed to take HTML and CSS and exist able to traverse them to wait for particular tags. The library we'll employ in this tutorial is rvest.

The rvest library

The rvest library, maintained by the legendary Hadley Wickham, is a library that lets users hands scrape ("harvest") data from web pages.

rvest is one of the tidyverse libraries, and then information technology works well with the other libraries contained in the bundle. rvest takes inspiration from the spider web scraping library BeautifulSoup, which comes from Python. (Related: our BeautifulSoup Python tutorial.)

Scraping a web page in R

In gild to employ the rvest library, we first need to install information technology and import information technology with the library() role.

            install.packages("rvest")

            library(rvest)

In gild to start parsing through a web page, we first need to request that data from the computer server that contains it. In revest, the office that serves this purpose is the read_html() part.

read_html() takes in a web URL as an statement. Let's commencement by looking at that simple, CSS-less page from before to run across how the function works.

            elementary <- read_html("https://dataquestio.github.io/web-scraping-pages/simple.html")

The read_html() function returns a list object that contains the tree-like structure we discussed earlier.

            elementary

            {html_document} <html> [1] <caput>\northward<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\north<title>A simple exa ... [2] <torso>\north        <p>Here is some simple content for this folio.</p>\north    </body>

Permit'southward say that we wanted to store the text contained in the single  tag to a variable. In order to admission this text, nosotros need to effigy out how to target this particular piece of text. This is typically where CSS classes and ids can help us out since good developers will typically make the CSS highly specific on their sites.

In this case, nosotros have no such CSS, merely we practice know that the  tag nosotros want to access is the only one of its kind on the page. In social club to capture the text, we need to use the html_nodes() and html_text() functions respectively to search for this  tag and remember the text. The code below does this:

            unproblematic %>% html_nodes("p") %>% html_text()

            "Here is some uncomplicated content for this page."

The uncomplicated variable already contains the HTML we are trying to scrape, so that just leaves the task of searching for the elements that nosotros want from it. Since nosotros're working with the tidyverse, we can just pipe the HTML into the dissimilar functions.

Nosotros need to laissez passer specific HTML tags or CSS classes into the html_nodes() office. We need the  tag, so nosotros laissez passer in a grapheme "p" into the part. html_nodes() also returns a listing, only it returns all of the nodes in the HTML that have the particular HTML tag or CSS grade/id that you gave it. A node refers to a point on the tree-like structure.

Once nosotros accept all of these nodes, we can pass the output of html_nodes() into the html_text() function. We needed to get the actual text of the tag, so this part helps out with that.

These functions together form the bulk of many mutual web scraping tasks. In general, web scraping in R (or in any other language) boils downward to the following iii steps:

Get the HTML for the web page that you lot desire to scrape
Decide what part of the page you want to read and observe out what HTML/CSS y'all need to select it
Select the HTML and analyze it in the way you lot need

The target web page

For this tutorial, we'll be looking at the National Weather condition Service website. Let's say that nosotros're interested in creating our own conditions app. We'll need the weather information itself to populate it.

Weather data is updated every day, and so we'll utilise web scraping to get this information from the NWS website whenever we need it.

For our purposes, we'll take data from San Francisco, simply each urban center'due south web folio looks the same, so the same steps would work for whatever other city. A screenshot of the San Francisco page is shown beneath:

We're specifically interested in the atmospheric condition predictions and the temperatures for each day. Each day has both a day forecast and a night forecast. Now that we've identified the office of the web page that nosotros need, nosotros can dig through the HTML to see what tags or classes we need to select to capture this particular data.

Using Chrome Devtools

Thankfully, most modern browsers take a tool that allows users to directly audit the HTML and CSS of any web page. In Google Chrome and Firefox, they're referred to equally Developer Tools, and they have like names in other browsers. The specific tool that will be the most useful to us for this tutorial will be the Inspector.

You tin observe the Developer Tools by looking at the upper right corner of your browser. Y'all should be able to meet Programmer Tools if yous're using Firefox, and if you're using Chrome, you can become through View -> More Tools -> Developer Tools. This will open up upward the Developer Tools right in your browser window:

The HTML nosotros dealt with before was bare-bones, only well-nigh spider web pages y'all'll see in your browser are overwhelmingly complex. Developer Tools will arrive easier for u.s.a. to pick out the exact elements of the spider web folio that we want to scrape and inspect the HTML.

We need to see where the temperatures are in the weather condition page'south HTML, so we'll utilize the Inspect tool to look at these elements. The Inspect tool will pick out the exact HTML that we're looking for, so we don't have to look ourselves!

Past clicking on the elements themselves, we can see that the seven day forecast is contained in the following HTML. We've condensed some of it to make information technology more readable:

            <div id="vii-twenty-four hours-forecast-container"> <ul id="seven-24-hour interval-forecast-list" class="listing-unstyled"> <li class="forecast-tombstone"> <div class="tombstone-container"> <p grade="menstruation-proper noun">Tonight<br><br></p> <p><img src="newimages/medium/nskc.png" alt="Tonight: Clear, with a depression effectually l. Calm current of air. " title="Tonight: Articulate, with a depression around 50. Calm wind. " course="forecast-icon"></p> <p class="brusk-desc" style="acme: 54px;">Clear</p> <p class="temp temp-low">Low: 50 °F</p></div> </li> # More elements like the i in a higher place follow, one for each solar day and night </ul> </div>

Using what we've learned

At present that we've identified what detail HTML and CSS we need to target in the web page, we can apply rvest to capture it.

From the HTML in a higher place, it seems like each of the temperatures are independent in the grade temp. In one case we take all of these tags, nosotros can extract the text from them.

            forecasts <- read_html("https://forecast.atmospheric condition.gov/MapClick.php?lat=37.7771&lon=-122.4196#.Xl0j6BNKhTY") %>%     html_nodes(".temp") %>%     html_text()  forecasts

            [1] "Depression: 51 °F" "High: 69 °F" "Low: 49 °F" "High: 69 °F" [5] "Low: 51 °F" "Loftier: 65 °F" "Low: 51 °F" "High: 60 °F" [9] "Low: 47 °F"

With this code, forecasts is now a vector of strings corresponding to the low and high temperatures.

At present that we have the actual information we're interested in an R variable, we simply need to do some regular data analysis to go the vector into the format nosotros need. For case:

            library(readr) parse_number(forecasts)

            [i] 51 69 49 69 51 65 51 lx 47

Next steps

The rvest library makes it like shooting fish in a barrel and convenient to perform web scraping using the aforementioned techniques nosotros would use with the tidyverse libraries.

This tutorial should give you lot the tools necessary to beginning a small web scraping project and beginning exploring more advanced web scraping procedures. Some sites that are extremely compatible with web scraping are sports sites, sites with stock prices or even news manufactures.

Alternatively, you lot could go along to expand on this project. What other elements of the forecast could you scrape for your weather app?

Ready to level up your R skills?

Our Data Analyst in R path covers all the skills you need to land a job, including:

Data visualization with ggplot2
Advanced data cleaning skills with tidyverse packages
Important SQL skills for R users
Fundamentals in statistics and probability
...and much more

At that place's nothing to install, no prerequisites, and no schedule.

How to Read Data From Website in R

Tutorial: Web Scraping in R with rvest

Understanding a web page

HTML

CSS

The rvest library

Scraping a web page in R

The target web page

Using Chrome Devtools

Using what we've learned

Next steps

Ready to level up your R skills?

Tags

0 Response to "How to Read Data From Website in R"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel