I’m not very good with the consoles in chrome and firefox but I couldn’t find the text I was looking for in source even though it’s displayed as text seemingly, the cursur changes to a vertical line on mouse-over I found this html below in the source How does this html create the text that displays?
I should think that javascript is involved. I am sure you asked a
similar question before when you were trying to scrape a website and
couldn't find the text in the html.
React components are run client-side, meaning the text you are looking for is inserted into the document after the page runs tags. I would take a look at the Sources tab in chrome, you can find all the loaded scripts there.
Note there may well be successive requests back to the server to get
the data you are looking for. Look at the Network tab in the browser
developer tools and you may see the call that fetches it.
So far I’m trying to get up to the table, the last element shown below doc.at_css(“div#j-product-description div.ui-box-body div.description-content”) gets me back the div class="description-content element but doc.at_css(“div#j-product-description div.ui-box-body div.description-content div.origin-part”) returns nil There’s a lot inside kde:widget that I’m not including here
It seems to me that you are going to have to identify the data source that the in-page JavaScript is using to generate the dynamic table data, and query that rather than trying to work everything out from the HTML (which is just a template for the in-page script to fill). There's probably a JSON URL somewhere that is being loaded into the page, and the script is building from that. This entire approach is pretty fraught with peril, though, because (like any scraping project, only more so) any change to the scheme that the site's developer chooses to implement will break your scraper immediately.
Following this path is going to force you to learn about how the site is working on a code level -- and to figure out how they go from data to presentation.
Another approach might be to use a headless browser on the server to construct a "real" DOM of the page, and query that. To be clear -- I do not recommend you follow this path -- I am noting it here to illustrate how ridiculous this effort will be.
One way to visualize this difference is to use the Web Inspector in Safari or Chrome to look at the differences between the raw HTML (Safari labels this tab "Resources") and the DOM (Safari calls this "Elements"). There is likely very little in common outside of the overall outline, if the page is changing as dramatically as you describe. If you hunt through the Resources tab (in Safari) you may find a link to a JSON file that is being required into the page. Loading that URL, rather than the HTML, may give you a much cleaner set of data (which you can parse directly using Ruby) rather than trying to execute JS on your server in order to construct an HTML DOM that you can parse with Nokogiri.
It wasn’t shown in source but when I expanded the element recursively in chrome developer tools I saw the text I was looking for So, what’s that gonna be worth?
> ...
> It wasn't shown in source but when I expanded the element recursively in chrome developer tools I saw the text I was looking for So, what's that gonna be worth?
As has been said a number of times that will be because it was filled
in by javascript, probably as a result of further calls to the server.
Have you done what I suggested and looked in the browser developer
tools at the Network tab? Then you will see if it fetches any further
data after the initial page fetch. Very often you will find it
fetching some json which will very likely contain the data you are
looking for.
yes, there’s scripts running and when i click response i see the data i’m looking for The script names are https urls ending in .do? with a lot of query string data, so what should I do?
The “.do” extension may also be a URL mapping scheme for a web application and not a file extension. For example, the Struts framework often uses the “.do” string for mapping Java servlet actions in the web.xml configuration file
Don't worry about scripts for the moment, look for urls that provide
data, probably xml or json. Surely you have used this yourself in
your rails apps using AJAX.
You are missing the entire point of what Colin is telling you. From what you describe, you are trying to do the following:
1. Download the JS from a data source.
2. Reconstruct the DOM using a JS driver like Chrome or PhantomJS.
3. Parse the DOM with Nokogiri or similar
4. Use the data you gather
Colin is recommending that you download the JS and parse it directly for the data you require. This will not require a driver of any kind, you are simply reading the data as JSON, which is a valid interchange format that Ruby can read directly using the standard library.