Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ability to add custom extractors via api #484

Merged
merged 10 commits into from
Sep 4, 2019

Conversation

mtashley
Copy link
Contributor

@mtashley mtashley commented Aug 27, 2019

Summary

This PR adds an addExtractor function to the Mercury parser, which allows users to feed in custom extractors at runtime.

How it works

Demo URL: https://www.sandiegouniontribune.com/business/growth-development/story/2019-08-27/sdsu-mission-valley-stadium-management-firm

const customExtractor = {
  domain: 'www.sandiegouniontribune.com',
  title: {
    selectors: ['h1', '.ArticlePage-headline'],
  },
  author: {
    selectors: ['.ArticlePage-authorInfo-bio-name'],
  },
  content: {
    selectors: ['article'],
  },
};

Mercury.addExtractor(customExtractor);

Copy link
Contributor

@adampash adampash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mtashley this is looking really good! I had some thoughts on small adjustments we might want to make. LMK if you have any questions!


if (baseDomain) {
CustomExtractors[baseDomain] = extractor;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for simplicity's sake, what do you think about making this:

CustomExtractors[hostName || baseDomain] = extractor;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went ahead and removed this logic in favor of the changes listed below.

@@ -0,0 +1,20 @@
export const CustomExtractors = [];

export default function addExtractor({ hostName, baseDomain, extractor }) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this api. I think we should just be able to say addExtractor(extractor) and get rid of the hostName/baseDomain, since we're already defining the domain in the extractor itself. One thing you might want to do, then, is validate that they've included a domain, since that seems like a base-requirement of a custom extractor.

Then you could either steal some code from here for putting them all together or just do it manually as you go:

https://github.com/postlight/mercury-parser/blob/master/src/extractors/all.js

This also reminds me that you'll want to be able to support supportedDomains from the custom parser as well, which we're not doing yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions. I updated the PR to reflect these changes. We can now pass the extractor directly into addExtractor(extractor) 💯


it('is able to use custom extractors added via api', async () => {
const url =
'https://www.theonion.com/bird-owner-assures-guests-he-sometimes-lets-parakeet-ou-1837586071';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that you chose the onion, since it already has an extractor! (See deadspin.)

Kind of an interesting test, though, b/c based on the order we're choosing extractors in, we're saying custom extractors would overwrite defaults (which makes sense to me). You might want to keep this test, but actually make the selector for the title be "wrong", so that we are also verifying that it uses the custom extractor and not the one that already exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the tests to use a unique domain (www.sandiegouniontribune.com) and added a custom key w/ the extend functionality to ensure the parser is using the correct extractor.

@mtashley mtashley force-pushed the feat-custom-extractor-by-api branch from f66f025 to 6271346 Compare August 28, 2019 16:36
@mtashley mtashley requested a review from adampash August 28, 2019 16:59
Copy link
Contributor

@adampash adampash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtashley This looks great! Simpler and easier to reason about!

My one additional request (sorry!): Do you think you could add the ability to pass a custom extractor via the cli? So something like:

mercury-parser [url] --add-extractor ./path/to/extractor.js

LMK if this makes no sense, or if you need any help on how to start down this path.

@mtashley mtashley requested a review from adampash August 29, 2019 19:48
Copy link
Contributor

@adampash adampash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Just one quick adjustment and then I think we're good!

// Attempt to load custom extractor from path.
let customExtractor;
if (addExtractor) {
customExtractor = require(addExtractor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do a try/catch here so we can provide some error message if the path doesn't exist or there some problem w/the import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @adampash, in the cli.js file, the overall functionality is already wrapped in a try/catch which spits out the error if the file cannot be imported.

Copy link
Contributor

@adampash adampash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mtashley mtashley merged commit e12c916 into master Sep 4, 2019
@mtashley mtashley deleted the feat-custom-extractor-by-api branch September 4, 2019 14:32
@saar
Copy link

saar commented Sep 5, 2019

Hey @mtashley this is looking really good! I had some thoughts on small adjustments we might want to make. LMK if you have any questions!

@adampash Thank u for this comment about my code. It make me more motivated to share small patches to make free software world a little bigger.
I am new to JavaScript's world and was very noob those days I wrote that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants