Scripts for identifying and removing duplicated keywords

Post Reply
vlad
Posts: 967
Joined: 01 Sep 08 15:20

Scripts for identifying and removing duplicated keywords

Post by vlad » 01 Apr 15 20:20

I have written and attached here two scripts for identifying and removing duplicated keywords:

Filter With Duplicated Keywords simply filters the images with duplicated keywords. Its implementation checks both the list of flat keywords stored inside XMP (dc:subject) and the keyword list inside IPTC, but my testing shows that the list of IPTC keywords is not currently (build 2066) correctly retrieved by PSU. (The returned list always consists of only one keyword - see ticket #2837 in Mantis). Duplicated keywords inside XMP seem to be correctly identified.

Remove Duplicated Keywords removes all redundant (duplicated) flat keywords from XMP (dc:subject), for a selection of images (possibly filtered by the first script). Please note that:
- the list of hierarchical keywords and of ICS tags is not checked or touched in any way, since redundancy within the flat keyword list does not typically involve redundancy at the hierarchical level. (That is, keywords and labels with identical names can be usually distinguished by their hierarchical info.)
- the list of IPTC keywords is not currently checked or touched - but I am open to adding this once bug #2387 gets fixed. (Such an enhancement would be especially useful if the implemented solution for the issue of repeatedly added IPTC keywords (described here) does not somehow fix the metadata of images already affected by this bug.)
- this script implements (on demand) the optimization requested in ticket #2252, but the list of assigned labels is not checked or touched. Consequently, Hert's comments regarding #2252 still apply:
"Eliminating duplicates is a sub optimisation. [...] Duplicate catalog label names typically indicate that something should change in the catalog structure."
In particular, please note that if you run the script but maintain the assignment of labels with identical names, then you may still end up with duplicated keywords when the metadata of the affected image(s) is going to be saved (or write-synced) again. If you are willing to eliminate the root cause of duplicated labels, then you might want to use the filter script to identify all affected images and then modify or revoke the duplicated labels as appropriate. (I could probably write another script to automatically tinker with duplicated labels - either by renaming some labels or by revoking some label assignments - but I'm fairly skeptical of a "one policy fits all" there.)

I hope some of you will find one or both of the scripts helpful.
Attachments
Filter With Duplicated Keywords.psc
Filters images with duplicated keywords.
(2.78 KiB) Downloaded 178 times
Remove Duplicated Keywords.psc
Removes all duplicated flat keywords from a selection of images.
(3.31 KiB) Downloaded 184 times

Post Reply