Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom regular report for Stanford Medicine Center for Improvement (SMCI) #1201

Closed
peetucket opened this issue Oct 2, 2020 · 6 comments · Fixed by #1204
Closed

Custom regular report for Stanford Medicine Center for Improvement (SMCI) #1201

peetucket opened this issue Oct 2, 2020 · 6 comments · Fixed by #1204

Comments

@peetucket
Copy link
Member

Consider producing a custom report on a regular basis that includes both publications for users currently in profiles, as well as users not currently in profiles.

Most likely it will be a once a month report. They need to have a citation or be able to build the citation they want. Therefore, we were going to provide them with the most recent approved publications (from the last date received) with the following:

  • First Published Date
  • DOI URL
  • Chicago Citation Format
  • APA Citation Format
  • MLA Citation Format

Additionally, we wanted to provide them with the separated fields that form the citation in the case they want to do their own formatting for the citation display on their website. So that would be, for example:

  • Title
  • Article No.
  • Pages
  • DOI / DOI URL
  • Journal Title
  • Publisher
  • Year Published
  • Volume
  • Issue
  • Supplement Number
  • Authors - First name, Middle Initial, Last Name

This would also need to include the author's details, when available:

  • First Name
  • Last Name
  • Position
  • sunet id
  • profile id
  • univ id
  • email address

The thought is that possibly it would make sense, if you can pull the information for those without profiles, that you produce the full report for everyone that also includes the data for those with profiles (thus some of the fields noted above), so that the report is coming from one source and combined. The number of researchers is from 94 - 130 from the current information we have. Out of these there were around 33 that did not have profiles or active profiles and there a number that do not have full profiles that have the publication import option. Also, we have staff members in the list who would also not have the publication import turned on by default, as this is only on by default for faculty and postdocs.

@peetucket
Copy link
Member Author

peetucket commented Oct 2, 2020

Various options considered below.

Option A: Modify the sul_pub codebase to add this report

Modify the existing sul_pub system to add the new users so they can be harvested. Either create a custom report whose results can be output on demand via a new API call or expect the Profiles team to pull all the publications needed by user and then run the aggregate report on their end. Assuming we cannot add the additional authors not already in the profiles system to the author's table in some way (i.e. by assigning them special/unique cap_profile_ids), this option is not advisable due to the extensive modifications needed in a production system.

Challenges:

  • our current database schema relies on links between the authors table, the publications table and the contributions table
  • we cannot add any new rows to the "authors" table without a cap_profile_id, as that is a required field that has a database constraint to be unique...users not in the profiles system will not have a cap_profile_id
  • our harvesting code depends on the authors table, so cannot harvest publications for authors not in the authors table (i.e. from a separate table) without a lot of modifications across the codebase

Rough estimates: 2-3 weeks of a couple developers, likely introducing new bugs in the process

Option B: Create a new system

Create a new codebase with a new database that contains only the authors of interest (could be in profiles or outside of profiles). Either create a custom report whose results can be output on demand via a new API call or expect the Profiles team to pull all the publications needed by user via an API call and then run the aggregate report on their end. This option has the potential to be time consuming from a development standpoint.

Challenges:

  • undesirable duplication of code from sul_pub (e.g. the client that connects to the Web of Science and runs queries)
  • need to establish a mechanism to keep the authors in the new database up to date (more API calls or some import/update mechanism that needs to be run periodically)
  • building a new system may require a new VM or using the existing sul_pub production systems, which may be time consuming and require OPs support
  • we have another system to maintain/run updates on regularly/support with bug fixes

Rough estimates: 3-4 weeks of a couple developers, plus ongoing maintenance and support costs

Option C: Find a solution using existing tools

Do not modify sul_pub or create a new system, but instead investigate the user of existing tools we have access to, such as the Web of Science UI, the Dimensions UI, etc. A staff member would need to periodically run a number of queries and then manually aggregate results into a report.
While not automated, this option could be done without software developers or operations support, but instead could be done by other staff. It does require regular staff support, but documentation and training could hopefully reduce the time required after the initial ramp up.

Challenges:

  • report must be created manually
  • would require initial investigation and documentation on how to run reports
  • would require support staff to run a regular schedule

Rough estimate: 1 week or so of a developer helping another non-developer (e.g. Jacob) find a solution, plus ongoing time cost for this other staff member to run the reports

Option D: Write a scripty solution

This is a combination of options B and C. It takes a coding approach to manual reports, to minimizing the repetitiveness of running reports over and over again, while not going as far as creating a whole new system accessible via APIs. For example, a ruby script could import a CSV file of authors and make use of the Dimensions or Web of Science API to query and then produce lists of publications for the aggregate reporting. We could return a bit list of publications to the Profiles team which they could use for reporting, or we could add development time to produce the report ourselves. It still requires software development time, but less than would be required for a full fledged system available via API calls. It could be run from a QA server or even a developer laptop to minimize demands on OPs. It also still requires ongoing support but maintenance, but less than would be required for a full system. It also requires someone technical enough to run a script on a regular basis (though this person doesn't necessarily need to be a software developer).

Rough estimate: 2 weeks or so of a developer helping another non-developer (e.g. Jacob) script a solution, plus ongoing time cost for this other staff member to run the scripts

Option E: Use ORCID/CrossRef/other systems

This makes use of an ORCID/CrossRef connection to have authors keep their ORCID profiles up to date. This essentially becomes a training/support task to help authors get ORCID profiles setup and populated. It may also require development to produce a report from all of the ORCID Profiles.

Challenges:

  • authors need to have an ORCID and set one up if they don't yet have one
  • authors need to connect ORCID and cross-ref
  • authors may need training and prompting to do this
  • some publications will not make it onto ORCID if they aren't listed in cross-ref
  • its not clear how reports would be produced from ORCID profiles (some coding using the ORCID API may still be needed)

Rough estimate: A few days of a Jacob or Peter working with other staff to find the solution and help with training

Summary

Options A & B are not advisable for various reasons. Option C is the least intrusive to the software development teams. Option D may be the most practical if we choose to take this on. Option E is the best path to encourage ORCID option but the least likely to result in an actual report in a reasonable time frame.

@peetucket
Copy link
Member Author

peetucket commented Oct 14, 2020

Additional questions for Tina:

  • just wondering, is the result of this report is a list of publications that are going to show up on a website somewhere?
  • and the idea is that we would run this custom report manually approximately once per month, using the input CSV file, and then generating an output CSV file which we could place somewhere for easy transfer
  • we presume the input is a list of authors in a CSV file, and that it would be clear which authors are in profiles and which are not (for example, authors in profiles would have a cap_profile_id column populated in the input CSV, and authors not in profiles would have their first name and last name provided)
  • we presume the output is a list of publications in a flat CSV file with the columns being the fields noted in your earlier email
  • note that any publications returned for authors not in Profiles people would not be approved/confirmed - they would just be the results of a standard name search
  • in addition, for authors not in Profiles, we would use the standard "Stanford" search qualifier for institution along with the name provided (unless the input CSV included a column with institution names as well for those authors)

@peetucket
Copy link
Member Author

peetucket commented Oct 14, 2020

More possible ideas:

  1. Use the sul-pub API to pull publications for each known user
  2. Use the DImensions API to pull publications for each unknown user (but then need to parse the response and create formatted citations).
  3. Merge the set.

OR

  1. It turns out we can run arbitrary WoS queries from within sul-pub and return formatted citations using our existing code:
query = 'AU=("TestUser,First") AND AD=("Stanford" OR "Nebraska")' # publications involving both Stanford and University of Nebraska
retriever = WebOfScience.queries.user_query(query);
results = retriever.next_batch.to_a;
puts retriever.records_found # shows total number of results returned
puts retriever.records_retrieved # show total number returned in the given batch
results.each {|result| puts result.pub_hash[:title]}; # print all of the titles
results = retriever.next_batch.to_a if retriever.next_batch? # get the next batch if available

# look at the citations
results.each {|result| puts "#{result.pub_hash[:apa_citation]}\n\r"};nil

OR even easier by name and institution directly

name = 'Casciotti, Karen'
institution =  ['Stanford']
retriever = WebOfScience.queries.search_by_name(name, institution)
results = retriever.next_batch.to_a;
puts retriever.records_found # shows total number of results returned
puts retriever.records_retrieved # show total number returned in the given batch
results.each {|result| puts result.pub_hash[:title]}; # print all of the titles
results = retriever.next_batch.to_a if retriever.next_batch? # get the next batch if available

# look at the citations
results.each {|result| puts "#{result.pub_hash[:apa_citation]}\n\r"};nil

So we could write a rake task to run two sets of queries, once against our local database for those with Profile IDs for approved publications and once against WoS without profile ideas, merging the results. This may be faster since we have the same pub_hash structure in each case ready to go.

@peetucket
Copy link
Member Author

See #1204 for a custom class/rake task that does the report.

@jacobthill
Copy link

Here is an example of a dashboard made using Dimensions / Google BigQuery. If it meets the reporting requirements it could be the least intrusive to the development team and could provide additional views into the data (I'm not sure if those would be useful). The downside is the added cost.

@peetucket
Copy link
Member Author

Those dashboards in GBQ are definitely nifty, and I think are a nice solution for some of the custom requests we've field in the RIALTO world. I'm not 100% sure it fulfills this particular request though, as I believe they may be using the result of this report simply to import publication into a website, so not sure they benefit from the nice UI in the report and instead just need a CSV. The other challenge is that they need approved publications from sul_pub for authors that have profiles in addition to this second set of authors that are not in profiles - so there are two separate data sources. The sul_pub data either requires direct database queries on the server, or using our own internal API, which can return the publications as JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants