text visible in browser but not in source

fugee_ohu · November 7, 2018, 3:35pm

I’m not very good with the consoles in chrome and firefox but I couldn’t find the text I was looking for in source even though it’s displayed as text seemingly, the cursur changes to a vertical line on mouse-over I found this html below in the source How does this html create the text that displays?

<div class="ui-box product-description-main" id="j-product-description">
        <div class="ui-box-title">Product Description</div>
        <div class="ui-box-body">

            <div class="description-content" data-role="description" data-spm="1000023">
            <div class="loading32"></div>
            </div>

        </div>
    </div>

Colin_Law · November 7, 2018, 4:00pm

I should think that javascript is involved. I am sure you asked a similar question before when you were trying to scrape a website and couldn't find the text in the html.

Colin

fugee_ohu · November 7, 2018, 4:17pm

Yes, within that context, javascript, how does it happen that the text I’m viewing in the browser isn’t visible in source?

Colin_Law · November 7, 2018, 4:30pm

It isn't in the source, the DOM is updated using javascript. You should see it in the DOM inspector but not in the source.

Colin

jakeNiemiec · November 7, 2018, 4:33pm

The ui-box class would indicate that it is a react component: https://github.com/segmentio/ui-box

React components are run client-side, meaning the text you are looking for is inserted into the document after the page runs tags. I would take a look at the Sources tab in chrome, you can find all the loaded scripts there.

fugee_ohu · November 8, 2018, 6:09am

Thanks Can you point me to a brief tutorial to show me how to get react to render the content

Colin_Law · November 8, 2018, 8:40am

Open it in a browser, that's what browsers do.

Note there may well be successive requests back to the server to get the data you are looking for. Look at the Network tab in the browser developer tools and you may see the call that fetches it.

Colin

fugee_ohu · November 8, 2018, 10:53pm

I was able to find the text that wasn’t shown in source by opening console and expanding the ui-box div

fugee_ohu · November 9, 2018, 11:22pm

So far I’m trying to get up to the table, the last element shown below doc.at_css(“div#j-product-description div.ui-box-body div.description-content”) gets me back the div class="description-content element but doc.at_css(“div#j-product-description div.ui-box-body div.description-content div.origin-part”) returns nil There’s a lot inside kde:widget that I’m not including here

Product Description

<kse:widget data-widget-type=“relatedProduct” id=“24226336” title=“TOP” type=“relation”>…</kse:widget>

walterdavis · November 10, 2018, 3:34pm

It seems to me that you are going to have to identify the data source that the in-page JavaScript is using to generate the dynamic table data, and query that rather than trying to work everything out from the HTML (which is just a template for the in-page script to fill). There's probably a JSON URL somewhere that is being loaded into the page, and the script is building from that. This entire approach is pretty fraught with peril, though, because (like any scraping project, only more so) any change to the scheme that the site's developer chooses to implement will break your scraper immediately.

Following this path is going to force you to learn about how the site is working on a code level -- and to figure out how they go from data to presentation.

Another approach might be to use a headless browser on the server to construct a "real" DOM of the page, and query that. To be clear -- I do not recommend you follow this path -- I am noting it here to illustrate how ridiculous this effort will be.

One way to visualize this difference is to use the Web Inspector in Safari or Chrome to look at the differences between the raw HTML (Safari labels this tab "Resources") and the DOM (Safari calls this "Elements"). There is likely very little in common outside of the overall outline, if the page is changing as dramatically as you describe. If you hunt through the Resources tab (in Safari) you may find a link to a JSON file that is being required into the page. Loading that URL, rather than the HTML, may give you a much cleaner set of data (which you can parse directly using Ruby) rather than trying to execute JS on your server in order to construct an HTML DOM that you can parse with Nokogiri.

Walter

fugee_ohu · November 10, 2018, 5:22pm

It wasn’t shown in source but when I expanded the element recursively in chrome developer tools I saw the text I was looking for So, what’s that gonna be worth?

Colin_Law · November 10, 2018, 5:25pm

As has been said a number of times that will be because it was filled in by javascript, probably as a result of further calls to the server.

Colin

fugee_ohu · November 10, 2018, 9:28pm

Using a headless browser would be cheating?

Colin_Law · November 10, 2018, 9:57pm

> ... > It wasn't shown in source but when I expanded the element recursively in chrome developer tools I saw the text I was looking for So, what's that gonna be worth?

As has been said a number of times that will be because it was filled in by javascript, probably as a result of further calls to the server.

Have you done what I suggested and looked in the browser developer tools at the Network tab? Then you will see if it fetches any further data after the initial page fetch. Very often you will find it fetching some json which will very likely contain the data you are looking for.

Colin

fugee_ohu · December 25, 2018, 8:19pm

yes, there’s scripts running and when i click response i see the data i’m looking for The script names are https urls ending in .do? with a lot of query string data, so what should I do?

fugee_ohu · December 25, 2018, 10:37pm

The “.do” extension may also be a URL mapping scheme for a web application and not a file extension. For example, the Struts framework often uses the “.do” string for mapping Java servlet actions in the web.xml configuration file

Colin_Law · December 26, 2018, 9:41am

Don't worry about scripts for the moment, look for urls that provide data, probably xml or json. Surely you have used this yourself in your rails apps using AJAX.

Colin

fugee_ohu · December 26, 2018, 10:12pm

Trying now to use Capybara::DSL but when I run visit <‘url’> from within rails console rails complains no such route

walterdavis · December 27, 2018, 6:38pm

You are missing the entire point of what Colin is telling you. From what you describe, you are trying to do the following:

1. Download the JS from a data source. 2. Reconstruct the DOM using a JS driver like Chrome or PhantomJS. 3. Parse the DOM with Nokogiri or similar 4. Use the data you gather

Colin is recommending that you download the JS and parse it directly for the data you require. This will not require a driver of any kind, you are simply reading the data as JSON, which is a valid interchange format that Ruby can read directly using the standard library.

Walter

fugee_ohu · December 27, 2018, 8:24pm

So then how

Topic		Replies	Views
HTML page source not updated. rubyonrails-talk	3	145	July 6, 2009
Showing the source code rubyonrails-talk	2	109	April 9, 2008
Odd replace_html outcome rubyonrails-talk	2	110	September 24, 2008
div hidden but still showing up in source rubyonrails-talk	1	154	January 20, 2018
Help displaying text rubyonrails-talk	2	126	January 28, 2013

text visible in browser but not in source

Related topics

More Resources