You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wrote a quick Python script that tries to "redact" a novel by obscuring names.
I initially used NLTK to identify words to redact, but wanted to see how close I could get to that result without using complex models or anything other than the standard library. As a result, the script now redacts tokens that:
Are longer than two characters
Contain only letters
Never start with a lowercase letter anywhere in the text
Like some actual document redaction, the results are a little inconsistent. For example, it does a glaringly bad job with names that are also common nouns that appear elsewhere in the text.
The repo is here and a redacted copy of Pride and Prejudice is here.
The text was updated successfully, but these errors were encountered:
Good catch! I ran into a similar question on MetaFilter a little while ago and I think it planted the seed for this script. I nearly used '-' instead of 'x' as my replacement character as a reference to that practice. Thanks for the reminder!
I wrote a quick Python script that tries to "redact" a novel by obscuring names.
I initially used NLTK to identify words to redact, but wanted to see how close I could get to that result without using complex models or anything other than the standard library. As a result, the script now redacts tokens that:
Like some actual document redaction, the results are a little inconsistent. For example, it does a glaringly bad job with names that are also common nouns that appear elsewhere in the text.
The repo is here and a redacted copy of Pride and Prejudice is here.
The text was updated successfully, but these errors were encountered: