Occasionally you’ll run into a data set that you’d like to collect that doesn’t come with an associated API. In a situation like that, you can use a gem called Nokogiri as well as Open-URI to scrape the website and parse the whole HTML doc for the information that you need.
Once you install the required gems from above. You’ll also need to require them in your scraper model.
require 'open-uri'require 'nokogiri'require 'json'
Next, you’d need to tell Nokogiri which URL you’re looking to parse, and I’d also suggest saving the result as a variable so that you could process it further in later steps.
doc = Nokogiri::HTML(URI.open("https://your-url-goes-here.com"))
If you were to print the doc variable from the action above, you’d see that your result is pretty much the whole unedited webpage. So where do we go from here? You’re most likely looking for a specific data set from that page. Fortunately Nokogiri provides a number of ways to parse through the data and select exactly what you’d like.
First we need to tell Nokogiri that we’re going to be search for our data by CSS. For instance, if you had a table on the webpage that you wanted to extract the information from, you’d do it as such.
table = doc.css('table')
Now we’ve stored only the table portion of the website we’re scraping. Next, you’d need to write a method in your scraper to iterate through all of the table elements and grab all of them to store somewhere. Nokogiri .search allows you to do just that.
table.search('tr').each do |tr| data1 = tr.text unless tr.nil?
data2 = tr.text unless tr.nil?end
That’s just a basic example. At this point, you can use the search method to push all of the elements to an array and use that to create your database entries from there. The Nokogiri documentation provides way more information on how to parse through HTML docs to get the information you’re looking for in case an API is not available.