AW: Re: [Rails] browser simulator independent of web framework

Luma1 · June 18, 2017, 11:21am

I’m extracting content from some websites. Currently I evaluate HTML code using Nokogiri. But the relevant content is not contained in the responded body of the HTTP GET request. This is because there is some Javascript code like $(window).load() or $(document).ready() that will send some Ajax requests and fill the original HTML code.

So I’m searching for some library that automatically executes Javascript code and Ajax requests just like a normal browser.

Martin

Colin_Law · June 18, 2017, 12:36pm

I'm extracting content from some websites. Currently I evaluate HTML code using Nokogiri. But the relevant content is not contained in the responded body of the HTTP GET request. This is because there is some Javascript code like $(window).load() or $(document).ready() that will send some Ajax requests and fill the original HTML code.

So I'm searching for some library that automatically executes Javascript code and Ajax requests just like a normal browser.

Understood. Don't think I can help I am afraid. Does the site not work with js disabled in the browser?

Colin

Luma1 · June 18, 2017, 9:24pm

I’m extracting content from some websites. Currently I evaluate HTML code

using Nokogiri. But the relevant content is not contained in the responded

body of the HTTP GET request. This is because there is some Javascript code

like $(window).load() or $(document).ready() that will send some Ajax

requests and fill the original HTML code.

So I’m searching for some library that automatically executes Javascript

code and Ajax requests just like a normal browser.

Understood. Don’t think I can help I am afraid. Does the site not work

with js disabled in the browser?

Colin

Unfortunately they completely rely on js, there’s nothing working without.

Is there some tool coming close to my use case, or some testing tool that I could use for my purpose without writing test?

Martin

Jason_FB · June 19, 2017, 2:18pm

I think he’s scraping someone else’s site.

You obviously can’t do this with Ruby alone, as there is no headless web browser written entirely in Ruby (that’s just nonsense)

If you can get phantomjs working on your production site, that’s probably the way to go. Look deep into the internals of Capybara to understand how it drives phantomjs. With phantomjs, you basically have a headless web browser and you can use Capybara’s DSL to access parts of the page, including evaluating scripts and parsing the DOM.

Just keep in mind phantomjs is an actual executable so it needs to be compiled and built for your production environment explicitly, which might be a little tricky depending on where your site is.

But a little birdie told me a few months ago that the phantomjs team has decided that once Chrome has a headless mode, which I believe is forthcoming, they plan to abandon phantomjs in favor of Chrome’s headless mode. Not sure if that’s really true or when that will happen.

-Jason

walterdavis · June 19, 2017, 9:31pm

Also look at Mechanize, which I believe can do a headless JS scrape of a site. It's purely a scraper, so less likely to be so test-centric.

Walter

Luma1 · June 22, 2017, 9:40pm

Thanks guys… I’m trying Capybara with Poltergeist / phantomjs and the hints from GitHub - teamcapybara/capybara: Acceptance test framework for web applications. I’ll post my experiences here again.

Martin