Simple Naive Redactor #141

jdherg · 2014-11-30T23:44:34Z

I wrote a quick Python script that tries to "redact" a novel by obscuring names.

I initially used NLTK to identify words to redact, but wanted to see how close I could get to that result without using complex models or anything other than the standard library. As a result, the script now redacts tokens that:

Are longer than two characters
Contain only letters
Never start with a lowercase letter anywhere in the text

Like some actual document redaction, the results are a little inconsistent. For example, it does a glaringly bad job with names that are also common nouns that appear elsewhere in the text.

The repo is here and a redacted copy of Pride and Prejudice is here.

hugovk · 2014-12-01T06:35:44Z

Heh, did you know some 19th century writers, Austen included, did actually redact names and dates?

jdherg · 2014-12-02T06:42:21Z

Good catch! I ran into a similar question on MetaFilter a little while ago and I think it planted the seed for this script. I nearly used '-' instead of 'x' as my replacement character as a reference to that practice. Thanks for the reminder!

hugovk · 2014-12-02T08:39:57Z

See also #108 that uses a different method of redacting.

hugovk added the completed label Dec 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Naive Redactor #141

Simple Naive Redactor #141

jdherg commented Nov 30, 2014

hugovk commented Dec 1, 2014

jdherg commented Dec 2, 2014

hugovk commented Dec 2, 2014

Simple Naive Redactor #141

Simple Naive Redactor #141

Comments

jdherg commented Nov 30, 2014

hugovk commented Dec 1, 2014

jdherg commented Dec 2, 2014

hugovk commented Dec 2, 2014