Web Scraping with Rails: User-Submitted Link

I am trying to create a Rails app that will scrape a URL submitted in a form field. This will only be for URLs for one particular site that has the same structure on every page so it doesn’t have to be “dynamic” where it has to be that flexible (i.e. work for Forbes and Bloomberg). It would also be nice to save the text to the database with the link.

I’ve found many examples on how to do this with the URL hard-coded in the code, but I’m having trouble getting it to accept the link submission, grab the article text, put it in the database, and show the result.

Here’s what I have so far in my controller:

class LinksController < ApplicationController
  before_action :set_link, only: %i[ show edit update destroy ]

def index
  @links = Link.all
end

def show
end

def new
  @link = Link.new
end

def edit
end

def create
  @link = Link.new(link_params)
  require 'open-uri'
  page = Nokogiri::HTML(open(@link))
  @article_text = page.css("div{itemprop='articleBody'}").text

  respond_to do |format|
    if @link.save
    format.html { redirect_to @link, notice: "Link was successfully created." }
    format.json { render :show, status: :created, location: @link }
  else
    format.html { render :new, status: :unprocessable_entity }
    format.json { render json: @link.errors, status: :unprocessable_entity }
  end
 end
end

def update
  respond_to do |format|
    if @link.update(link_params)
      format.html { redirect_to @link, notice: "Link was successfully updated." }
      format.json { render :show, status: :ok, location: @link }
    else
      format.html { render :edit, status: :unprocessable_entity }
      format.json { render json: @link.errors, status: :unprocessable_entity }
    end
  end
end

def destroy
  @link.destroy
    respond_to do |format|
    format.html { redirect_to links_url, notice: "Link was successfully destroyed." }
    format.json { head :no_content }
  end
end

private
  def set_link
    @link = Link.find(params[:id])
  end

  def link_params
    params.require(:link).permit(:web_address, :article_text)
  end
end

If I need to provide any other code, please let me know and I will update the question.

Thanks!

If you have a model named Link, does it have an attribute, say maybe url on it, to hold the address? If so, then your create method could look like this:

@link = Link.new(link_params)
require 'open-uri'
page = Nokogiri::HTML(open(@link.url))
@link.article_text = page.css("div{itemprop='articleBody'}").text
...

This way you accept the url and assign it to the Link instance, then use it in the create method to scrape and assign a text value to the article_text attribute before saving that @link.

Walter

1 Like

Hi @walterdavis,

Thanks for the help so far!

I have done what you have suggested, but when I submit the link I get the following error:

Invalid argument @ rb_sysopen on the line page = Nokogiri::HTML(open(@link.web_address))

Any thoughts?

This error indicates that your object @link does not have a valid URL in the web_address attribute. Try printing that value out, looking at it, and seeing what it is missing.

Walter