Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Naive Redactor #141

Open
jdherg opened this issue Nov 30, 2014 · 3 comments
Open

Simple Naive Redactor #141

jdherg opened this issue Nov 30, 2014 · 3 comments

Comments

@jdherg
Copy link

jdherg commented Nov 30, 2014

I wrote a quick Python script that tries to "redact" a novel by obscuring names.

I initially used NLTK to identify words to redact, but wanted to see how close I could get to that result without using complex models or anything other than the standard library. As a result, the script now redacts tokens that:

  • Are longer than two characters
  • Contain only letters
  • Never start with a lowercase letter anywhere in the text

Like some actual document redaction, the results are a little inconsistent. For example, it does a glaringly bad job with names that are also common nouns that appear elsewhere in the text.

The repo is here and a redacted copy of Pride and Prejudice is here.

@hugovk
Copy link
Collaborator

hugovk commented Dec 1, 2014

Heh, did you know some 19th century writers, Austen included, did actually redact names and dates?

@jdherg
Copy link
Author

jdherg commented Dec 2, 2014

Good catch! I ran into a similar question on MetaFilter a little while ago and I think it planted the seed for this script. I nearly used '-' instead of 'x' as my replacement character as a reference to that practice. Thanks for the reminder!

@hugovk
Copy link
Collaborator

hugovk commented Dec 2, 2014

See also #108 that uses a different method of redacting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants