A while ago I was building a database of newspaper articles retrieved from LexisNexis for a research project in which I was working as a research assistant. At some point we noticed that we seemed to have a lot of duplicates in our database. I had already removed the duplicates with R so we were really surprised that those are still in there. However, after some investigation, I found that there are indeed small differences between the articles we had identified manually as duplicates in our data.