[Mayan EDMS: 1491] Duplicate document check on watch folders feature

Discussion:

Victor Zele

2017-01-31 00:22:49 UTC

We have several watch folders setup for contracts, invoices, quotes, etc.

It would be nice if Mayan would validate a new document does not exist
already in the system by checking maybe an MD5 checksum table of current
documents in the system and reject the new document as already existing.

Also, for duplicates, it would be nice to run a cleaner on the
/opt/mayan-edms/lib/python2.7/site-packages/mayan/media/document_storage
directory of PDFs to clean out duplicates. I can write a shell script to
check for PDF duplicates via MD5 sums, but no way to automate cleaning them
out of the Mayan system/DB.

Just an idea,
Victor
--
*CONFIDENTIALITY NOTICE: *

*This transmission may contain information which is Vimo, Inc. (DBA
Getinsured) confidential and/or legally privileged. The information is
intended only for the use of the individual or entity named on this
transmission. If you are not the intended recipient, you are hereby
notified that any disclosure, copying, or distribution of the contents of
this transmission is strictly prohibited. If you have received this
transmission in error, please immediately notify me by return e-mail and
destroy all copies of the original message.*

--
---
You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mayan-edms+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roberto Rosario

2017-02-13 22:07:25 UTC

Permalink

Those are very good ideas!

There was once a duplicate search feature but was removed due to lack of
usage and because it ran on the foreground and could take a long time since
the checksum of each document was check against the checksum of each other
document , the time was exponential to the number of documents. If checking
from duplicates using the first as the first step, the second step would be
to search those documents using the API by checksum. The checksum field is
not exposed so that is another update to the API that would need to be done.

Skipping duplicates from the watch folder would be less difficult since
this is just a single query to see it the checksum is already matched in
the database.

I'm updating the roadmap wiki
(https://gitlab.com/mayan-edms/mayan-edms/wikis/roadmap/) and will add
these.

Thank you!

Post by Victor Zele
We have several watch folders setup for contracts, invoices, quotes, etc.
It would be nice if Mayan would validate a new document does not exist
already in the system by checking maybe an MD5 checksum table of current
documents in the system and reject the new document as already existing.
Also, for duplicates, it would be nice to run a cleaner on the
/opt/mayan-edms/lib/python2.7/site-packages/mayan/media/document_storage
directory of PDFs to clean out duplicates. I can write a shell script to
check for PDF duplicates via MD5 sums, but no way to automate cleaning them
out of the Mayan system/DB.
Just an idea,
Victor

Lin Pro

2017-10-21 03:45:33 UTC

Permalink

I can write a shell script to check for PDF duplicates via MD5 sums, but
no way to automate cleaning them out of the Mayan system/DB.
Just an idea,
Victor

Hi Victor,
by the way, what tools are best to check for duplicates in this scenario?
For file system level duplicates there is a tool "fdupes"
https://github.com/adrianlopezroche/fdupes

I just discovered it through an article on lxer.com
It checks md5 sums to see if a directory contains two or more identical
files.
It seems like a neat idea. The tool itself is 16 years old.

I wonder what is your tool of choice for duplicates.

Usage: fdupes [options] DIRECTORY...

-r --recurse for every directory given follow subdirectories
encountered within
-R --recurse: for each directory given after this option follow
subdirectories encountered within (note the ':' at
the end of the option, manpage for more details)
-s --symlinks follow symlinks
-H --hardlinks normally, when two or more files point to the same
disk area they are treated as non-duplicates; this
option will change this behavior
-n --noempty exclude zero-length files from consideration
-A --nohidden exclude hidden files from consideration
-f --omitfirst omit the first file in each set of matches
-1 --sameline list each set of matches on a single line
-S --size show size of duplicate files
-m --summarize summarize dupe information
-q --quiet hide progress indicator
-d --delete prompt user for files to preserve and delete all
others; important: under particular circumstances,
data may be lost when using this option together
with -s or --symlinks, or when specifying a
particular directory more than once; refer to the
fdupes documentation for additional information
-N --noprompt together with --delete, preserve the first file in
each set of duplicates and delete the rest without
prompting the user
-I --immediate delete duplicates as they are encountered, without
grouping into sets; implies --noprompt
-p --permissions don't consider files with different owner/group or
permission bits as duplicates
-o --order=BY select sort order for output and deleting; by file
modification time (BY='time'; default), status
change time (BY='ctime'), or filename (BY='name')
-i --reverse reverse order while sorting
-v --version display fdupes version
-h --help display this help message

regards
Lin