Going for a walk with the dogs: Exporting Metadata from Canon ImageWare Document Manager

Friday, October 19, 2012

Exporting Metadata from Canon ImageWare Document Manager

Note: Canon ImageWare Document Manager (iWDM) 4.1 Workgroup Edition was used for this analysis, but I expect the details are the same or very similar for other editions and minor versions.

[Updated 2015-04-22]

ImageWare is like many document management systems sold by scanner companies. They tend to be focused on finding a place to store the mountains of scans the scanner companies' devices can produce. They can be difficult to use and have a number of limitations compared with general document management solutions. As a result many users of these systems would like to switch to more powerful and comprehensive products like Sharepoint, FileHold, or LaserFiche.

The problem with making the switch is getting the thousands of documents from the old system into the new one. In the case of ImageWare there are some complications. All the documents are stored in a proprietary volume block file format with a dot IMG extension. While Canon does provide a method for exporting these in their original source format it requires a fair bit of manual effort for anything but the most simple repository and it does not export the documents' metadata.

The metadata is often the most important part of these documents. It is typically created using optical character recognition when the document is scanned or it is manually entered when the document is filed. Either way it is a valuable commodity that should not be lost. The good news is that the metadata is stored in a Microsoft SQL Server database. With the right technical skills the metadata can be extracted and prepared to import into a new document management system.

iWDM stores all its files in a cabinet. Each cabinet has one or more folders. The folders can be nested. Underneath the covers the cabinets are stored in the file system the cabinets in a folder called iW DM Cabinet. Each cabinet is stored in a sub-folder named Cabinetx where the x is a number that is incremented by the system. If your document repository is in the default location on your C drive and you have a single cabinet called Accounting you would find it in the following location.

C:\iW DM Cabinet\Cabinet1

You can find this information in the cabinet properties in iWDM user interface as the actual location and names are provided.

The Cabinet1 folder would have one or more sub-folders containing your files in the proprietary IMG format. At the root of the folder you will find the all important Microsoft database files. There should be four files with the extensions MDF or LDF. The MDF files are database files and the LDF files are the data log files. The file named iWDM_Accounting_Data.mdf is your accounting cabinet database. RM_Accounting.mdf is the database used for maintaining the full text search indexes.

After all that preamble we can get to the metadata, which is stored in the cabinet database. A quick view of the iWDM_Accounting_Data.mdf file with a tool like SysTools MDF Viewer will show a number of tables. The key table is Document. There is one row in this table for each document in the system. There are three key fields that will associate the database rows with the files that were exported. The first one is FolderIndex, the second is Name, and the third is Creator. The Creator is effectively the file extension such as .tif or .jpg. However, there is a special internal Canon image format with a creator value of .image. When these files are exported they will be converted into the image format you select.

Documents are stored in iWDM in a hierarchy that looks like a Windows folder structure with the cabinet at the root level. When you export a folder that structure is maintained when the documents are put into the Windows file system. The FolderIndex is the key to finding the folder the document will be in. It is a link into the Folder table. The Folder table includes the name of the folder and the tree structure. Folders have a FolderType column that can contain one of five values.

0 - Cabinet
2 - Main trash folder
4 - Hidden user trash folder
5 - Normal folder
9 - Deleted folder

Note that trash folders are a special case of folders. They are created automatically as needed by the system. When a document is deleted it gets moved to a corresponding trash folder. It still exists in the Documents table, but it is not possible to export it any longer. When the document is deleted from the trash folder it is removed from the Documents table. The trash folder you see in the user interface contains a hidden trash folder for each user. When a document or folder moves to the trash folder the Location column gets changed from 0 to 2.

All that remains to match the document in ImageWare to the file that was exported is the filename. The filename in iWDM is stored in two columns. The base name is in the Name field and the file extension is in the Creator field.

Now that we have the basic relationship between the exported file and the database we can start to find the associated metadata. There are three sources of metadata in iWDM: document properties, system index, and user index. The first two can be found in the Document table. Document properties like author or create date are available as columns of this table. System index refers to three predefined category fields. An index to the category values for each document is stored as Category1, Category2, and Category3 fields with each document. The index provides a reference to the values in the Category table.

The user index is slightly more complicated. The Document table provides no direct link to the user index. This work is done by the DocUserIndex table that provides a multi-way link to the Document table, the UserIndex table, and then to one of several user index data tables. There are eight different types of user indexes and six corresponding value tables.

User Index Data Type	Value Table	User Index Type
fixed string	FixedStringIndexValue	0
fixed maximum string	FixedStringIndexValue	1
variable string	StringIndexValue	2
date	DateIndexValue	3
integer	IntIndexValue	4
unsigned integer	IntIndexValue	5
floating point decimal	FloatIndexValue	6
boolean	BoolIndexValue	7

There is a special case when a user index has been defined as a "selectable list". In this instance the administrator predefines the possible values when the system is setup. The user can only choose from the list when they set the value of the user index. These fields have the UserIndexValueType set to 1; all other types are set to 0. The corresponding value table contains each possible value in the selectable list regardless of whether or not any documents have been assigned the value. Boolean indexes are a special case as they are always a selectable list with the values TRUE and FALSE predefined.

As an example, the following SQL query will return all string user index values for the given document:

SELECT d.Name 'Doc Name', f.Name 'Folder Name',
u.Name 'Index Name', u.UserIndexType 'Type',
fs.Value 'Fstr (0-1)', s.Value 'Str (2)'
FROM DocUserIndex di
LEFT JOIN Document d on di.DocumentIndex = d.DocumentIndex
AND di.FolderIndex = d.FolderIndex
LEFT JOIN Folder f on f.FolderIndex = d.FolderIndex
LEFT JOIN UserIndex u on u.UserIndexId = di.UserIndexId
LEFT JOIN FixedStringIndexValue fs on di.ValueId = fs.ValueId
AND di.UserIndexId = fs.UserIndexId
LEFT JOIN StringIndexValue s on di.ValueId = s.ValueId
AND di.UserIndexId = s.UserIndexId
WHERE d.Name = 'document name' AND f.Name = 'folder name'

That is all that is needed to extract the metadata for each of your documents. Each comprehensive document management system has its own method of importing documents and metadata. For example, in FileHold you would create a document.xml file with the metadata and the document locations at the root document folder and import it using managed imports.

It is unfortunate that the documents are stored in a proprietary format. This makes it difficult to automate the entire process. It appears as if the only change to the file is that iWDM adds 54 bytes to the front of the file. It may just be a simple matter of stripping this data off, but that is an investigation for another day. If you do export the documents using the iWDM export function you will likely need to reformat or compress the output files using a tool like Batch TIFF Resizer or TIFF Junction.

[Update]

I have taken a little closer look at the proprietary IMG files. As I suspected, removing the header (55 characters) of the IMG will reveal the embedded file. At least this is true for one type of IMG file; it turns out there are two: modifiable and non-modifiable. The modifiable version contains a single file and striping the first 55 bytes off will give the original. The non-modifiable format can contain multiple files and the 55 byte rule only works if there is exactly one file in the IMG.

I did dig into the format a little deeper and there are a few interesting titbits.

The first 32 bytes of the file appear to be the volume name; a 31 character ASCII null terminated string. This is followed by 5 bytes whose purpose is not known to me. Next we have what appears to be four 32 bit integers: a pointer to the end of the file, the IMG file number, the length of the first embedded file, and a pointer to the IMG file number (always seems to be 41); little endian format. Finally we have "VU" (hex 5655).

In the case of the modifiable IMG files the image file number is translated to base 36 and used for the file name. For the non-modifiable format the image file number relates to the file number in the IMG file. The file numbers start at 0.

For the non-modifiable format there is a secondary header at the start of the second and subsequent files. It starts with three 32 bit integers. The first one is the IMG file number, the second is the length of the embedded file, and the third is a pointer to the IMG file number. As before, the header ends with "VU". I have not investigated, but I suspect there is a flag in the header somewhere for deleted files.

There is an option to encrypt the IMG file. If encryption is in place, all bets are off for recovering the embedded files.

For the embedded files the format seems unchanged with the exception of images. If the images are converted to the iWDM image format they still seem to be stored their original format. You can check the file signatures at www.filesignatures.net or similar sites to confirm this even though you have lost the original image type in the database. Interestingly binders are stored as self extracting executables. When you run the binder code it extracts a PDF file with each of the images in the binder on each page.

16 comments :

TechnogetitOctober 30, 2012 at 10:22 AM
Hey Russ,

Love this article as the Canon DMS is one system we are looking at.

If you had another program, a 3rd party, do you believe it would be relatively easy to import data to the Canon DMS DB directly?

We like the Canon DMS for some things and would be willing to consider it as a possible solution, however, we want to automate the data entry as much as we can. Most of our forms are hand written, but indexing data (metadata) is encoded uniquely to each document through a barcode on the document. So the hope is a program would read the barcode, append the metadata to the PDF and get stored in the DMS. The catch is, Canon's ImageWare does not work with 2d barcodes, nor can it parse a 1d barcode; it is stuck on the ideology of "one barcode, one piece of data".

There is a wonderful program I found that can do exactly what I want to input data, but does not come with a user-friendly UI for recalling the stored documents. However, it does have the flexibility to connect to any DB - sharepoint (SQL), mySQL, Oracle, Access, Excel, etc... It also has the ability to put any of the scanned or parsed data, into any table and any column of the table, in addition to relocating the scanned document to a physical location.

So, with what you have been able to decipher of the Canon system, is it possible to use a 3rd party program to load the Canon DMS?
ReplyDelete
Replies
John MichleApril 28, 2013 at 7:39 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
UnknownJuly 2, 2014 at 5:52 AM
Incidentally, I have done the work to get the images out of the .img files and link properly to the metadata, including their whole "tree" structure. Willing to do again if anyone needs, contact me for a quote. phil@twopoint.com
ReplyDelete
Replies
UnknownApril 6, 2015 at 7:54 AM
Russ, did you ever figure out how to work with the .Img file? I tried trimming of the first 54 bytes, but the file was still invalid.
ReplyDelete
Replies
PrecursiveApril 21, 2015 at 7:51 PM
Hey all, sorry for the sideways question but this is the most knowledgeable/applicable thread I've found on the general Internet. On behalf of a client, and as an ImageWare novice, my question: is there a mechanism for bulk exporting files as PDF out of ImageWare, other than selecting a bunch, right clicking, and choosing to Export from the context menu within IWDM? Also, any tips or tricks from the community for speeding up the export to PDF process? Getting between 500 Kb/min and 2 Mb/min export speeds right now (!) . Are there other threads/forums/whitepapers I might find insight in? Appreciate any pointers from seasoned ImageWare folks!

Thanks, take care,
Nick
ReplyDelete
Replies
UnknownApril 19, 2016 at 11:29 AM
I know that this thread is old, but we are working on a similar export project.

We are running iMageware Document Manager but are not able to locate the File > Export option. Can someone help with where to look for this option?

Regards
Charan
ReplyDelete
Replies

Add comment

Subscribe to: Post Comments ( Atom )

Friday, October 19, 2012

Exporting Metadata from Canon ImageWare Document Manager

16 comments :

Follow Me on Twitter

Share My Blog