-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ability to add custom extractors via api #484
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mtashley this is looking really good! I had some thoughts on small adjustments we might want to make. LMK if you have any questions!
src/extractors/add-extractor.js
Outdated
|
||
if (baseDomain) { | ||
CustomExtractors[baseDomain] = extractor; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for simplicity's sake, what do you think about making this:
CustomExtractors[hostName || baseDomain] = extractor;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
went ahead and removed this logic in favor of the changes listed below.
src/extractors/add-extractor.js
Outdated
@@ -0,0 +1,20 @@ | |||
export const CustomExtractors = []; | |||
|
|||
export default function addExtractor({ hostName, baseDomain, extractor }) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this api. I think we should just be able to say addExtractor(extractor)
and get rid of the hostName
/baseDomain
, since we're already defining the domain in the extractor itself. One thing you might want to do, then, is validate that they've included a domain, since that seems like a base-requirement of a custom extractor.
Then you could either steal some code from here for putting them all together or just do it manually as you go:
https://github.com/postlight/mercury-parser/blob/master/src/extractors/all.js
This also reminds me that you'll want to be able to support supportedDomains
from the custom parser as well, which we're not doing yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestions. I updated the PR to reflect these changes. We can now pass the extractor directly into addExtractor(extractor)
💯
src/mercury.test.js
Outdated
|
||
it('is able to use custom extractors added via api', async () => { | ||
const url = | ||
'https://www.theonion.com/bird-owner-assures-guests-he-sometimes-lets-parakeet-ou-1837586071'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that you chose the onion, since it already has an extractor! (See deadspin.)
Kind of an interesting test, though, b/c based on the order we're choosing extractors in, we're saying custom extractors would overwrite defaults (which makes sense to me). You might want to keep this test, but actually make the selector for the title be "wrong", so that we are also verifying that it uses the custom extractor and not the one that already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the tests to use a unique domain (www.sandiegouniontribune.com) and added a custom key w/ the extend
functionality to ensure the parser is using the correct extractor.
f66f025
to
6271346
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtashley This looks great! Simpler and easier to reason about!
My one additional request (sorry!): Do you think you could add the ability to pass a custom extractor via the cli? So something like:
mercury-parser [url] --add-extractor ./path/to/extractor.js
LMK if this makes no sense, or if you need any help on how to start down this path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Just one quick adjustment and then I think we're good!
// Attempt to load custom extractor from path. | ||
let customExtractor; | ||
if (addExtractor) { | ||
customExtractor = require(addExtractor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do a try/catch here so we can provide some error message if the path doesn't exist or there some problem w/the import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @adampash, in the cli.js file, the overall functionality is already wrapped in a try/catch which spits out the error if the file cannot be imported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@adampash Thank u for this comment about my code. It make me more motivated to share small patches to make free software world a little bigger. |
Summary
This PR adds an
addExtractor
function to the Mercury parser, which allows users to feed in custom extractors at runtime.How it works
Demo URL: https://www.sandiegouniontribune.com/business/growth-development/story/2019-08-27/sdsu-mission-valley-stadium-management-firm