Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with external JavaScript #45

Closed
IanStorm opened this issue Sep 26, 2017 · 9 comments
Closed

Problem with external JavaScript #45

IanStorm opened this issue Sep 26, 2017 · 9 comments

Comments

@IanStorm
Copy link

Hello,

I'm trying to parse information from a website whose content creation relies on JavaScript; on external JavaScript files, to be precise. Somehow AngleSharp doesn't seem to work in that particular scenario.
I was able to create a MWE of my issue:

.html:

<!doctype html>
<html>
    <head><title>Sample</title></head>
    <body>
        <script src="test.js" type="text/javascript"></script>
    </body>
</html>

test.js:

document.title = 'Simple manipulation...';
document.write('<span class=greeting>Hello World!</span>');

The AngleSharp HtmlParser that I am using is generated as follows:

var config = Configuration.Default
    .WithDefaultLoader(setup => setup.IsResourceLoadingEnabled = true)
    .WithJavaScript();
var parser = new HtmlParser(config);

As you can see, the example is pretty close to the one from the AngleSharp wiki.

I already saw some issues that seem to be related (#35, #44, #43, #24). Still, I can't quite figure out a solution to my problem.
Can somebody please help?
Thanks in advance.

@FlorianRappl
Copy link
Contributor

Any reason why you don't use BrowsingContext?

That being written - I do not see your problem ("does not work", i.e., what is the result you expect?). I guess you use (or should use) ParseAsync. The script is - in any way - executed asynchronously. Thus are you sure that you do not get any result at any time?

@IanStorm
Copy link
Author

Thank you for that quick response.

That being written - I do not see your problem ("does not work", i.e., what is the result you expect?).

Good point. I should have mentioned this in a more clear way.
I am expecting the following HTML as a result of the parsing:

<!doctype html>
<html>
    <head><title>Sample</title></head>
    <body>
        <script src="test.js" type="text/javascript"></script>
        <span class=greeting>Hello World!</span>
    </body>
</html>

Meaning that I am expecting a span tag inside the body.
But I am receiving the input HTML as output, i.e., without any span tag.

I guess you use (or should use) ParseAsync. The script is - in any way - executed asynchronously.

I am using the ParseAsync method. I also tried with Parse. But if it's executed asynchronously in any way, that at least makes it clear why Parse did not work either.

Thus are you sure that you do not get any result at any time?

Yes and no. Maybe sure is the wrong word:
"Yes", because I tried checking the content, i.e., Body.InnerHtml, of the returned document, then doing Thread.Sleep(1000), and checking it again. I even did this 10 times in a row, meaning a total wait time of 10 seconds.
"No", because there's maybe a misunderstanding and I am doing it wrong.

Any reason why you don't use BrowsingContext?

There's not been a specific reason for that, no. It was basically because the JavaScript example from your wiki was using the HtmlParser, and I simply started by copying the example.
However, I just tried using the BrowsingContext; the result is the same. Even if I wait for some time, the document remains unchanged.

To summarize:
Independent from using BrowsingContext or HmtlParser, the span tag is not added to the body of the returned document. This behavior is also independent from any wait time.
Finally, just to make this clear, if I add the content of the test.js file to the HTML, everything works fine. It only fails in case of referencing.

@FlorianRappl
Copy link
Contributor

Hi @IanStorm - thanks for the clarification.

I think everything is working alright - the major issue seems to be the resolution of test.js. It is resolved relative to the HTML path, which (as you use the HtmlParser directly) is set to about:blank (no URL given). If you use the BrowsingContext you can use the virtual response in order to set a proper path for your input document.

context.openAsync((res) => res.Content(...).Address(...))

@IanStorm
Copy link
Author

Right after writing my response, I suddenly realized about the possible issue: I said...

[...] and I simply started by copying the example.

I first tried using the exact code, and then extended it to more and more reflect my scenario; to see where exactly if starts failing.
The thing that I am doing, is to first convert the HTML into a string and pass that as a parameter to the AngleSharp parser. That being said, how should the parser know about the external resource? As it's only receiving a string, it is not able to resolve the relative URL given in the HTML.
I guess that's the silly thing that I've done wrong while slowly adapting the example to the target scenario.

That means, I will now verify whether this is correct, and update this issue later. I guess/hope that no further action is required from your side.

@IanStorm
Copy link
Author

Oh, while writing my response, I did not see your comment, @FlorianRappl .

Yeah, your input is matching to what I've written.
Thank you. And sorry for bothering; it's a silly mistake from my side.

As I've said before, I will check whether this is correct and then update you here later.

@FlorianRappl
Copy link
Contributor

Alright, thanks for the info @IanStorm !

@IanStorm
Copy link
Author

@FlorianRappl ,
as discussed, this is my "callback".

I can solve the issue with the external JavaScript file using your proposed solution.
So, thank you for this. 👍

However, another issue opens up by that (if you prefer, we can close this issue and I can open another):
This approach only works for files that can be accessed via http://…; not for ones located on my local disk, i.e., files that are accessed via file:///…. It's not a problem for the production scenario, as I am only accessing files via HTTP there. But in case of unit testing it seems more convenient to me to provide access to files stored on the local machine (and therefore via file:///…).
Can you comment on this? Is this intended / by design? Do you know any other way than simulating HTTP access for the tests also?

@FlorianRappl
Copy link
Contributor

Hm I don't know if this was built into v0.9.9 of AngleSharp. I don't think so. But AngleSharp.Io also brings a requester that can be used with the file:// scheme. Or you can roll your own of course (just need to implement IRequester).

@IanStorm
Copy link
Author

The IRequester did it. Now all problems are solved.
Thank you for the great support! :)

IanStorm added a commit to XElementDev/RYB that referenced this issue Sep 28, 2017
** re-implemented LoginRecognizer
** reviewed unit tests (i.e., adapted all, removed some)
** fixed LoginRecognizer_GetLoginType_ContentNotYetLoaded test
** for more information on the latter, see AngleSharp/AngleSharp.Js#45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants