Friday, October 19, 2012

Exporting Metadata from Canon ImageWare Document Manager

Note: Canon ImageWare Document Manager (iWDM) 4.1 Workgroup Edition was used for this analysis, but I expect the details are the same or very similar for other editions and minor versions.

[Updated 2015-04-22]

ImageWare is like many document management systems sold by scanner companies. They tend to be focused on finding a place to store the mountains of scans the scanner companies'  devices can produce. They can be difficult to use and have a number of limitations compared with general document management solutions. As a result many users of these systems would like to switch to more powerful and comprehensive products like SharepointFileHold, or LaserFiche.

The problem with making the switch is getting the thousands of documents from the old system into the new one. In the case of ImageWare there are some complications. All the documents are stored in a proprietary volume block file format with a dot IMG extension. While Canon does provide a method for exporting these in their original source format it requires a fair bit of manual effort for anything but the most simple repository and it does not export the documents' metadata.

The metadata is often the most important part of these documents. It is typically created using optical character recognition when the document is scanned or it is manually entered when the document is filed. Either way it is a valuable commodity that should not be lost. The good news is that the metadata is stored in a Microsoft SQL Server database. With the right technical skills the metadata can be extracted and prepared to import into a new document management system.

iWDM stores all its files in a cabinet. Each cabinet has one or more folders. The folders can be nested. Underneath the covers the cabinets are stored in the file system the cabinets in a folder called iW DM Cabinet. Each cabinet is stored in a sub-folder named Cabinetx where the x is a number that is incremented by the system. If your document repository is in the default location on your C drive and you have a single cabinet called Accounting you would find it in the following location.

C:\iW DM Cabinet\Cabinet1

You can find this information in the cabinet properties in iWDM user interface as the actual location and names are provided.

The Cabinet1 folder would have one or more sub-folders containing your files in the proprietary IMG format. At the root of the folder you will find the all important Microsoft database files. There should be four files with the extensions MDF or LDF. The MDF files are database files and the LDF files are the data log files. The file named iWDM_Accounting_Data.mdf is your accounting cabinet database. RM_Accounting.mdf is the database used for maintaining the full text search indexes.

After all that preamble we can get to the metadata, which is stored in the cabinet database. A quick view of the iWDM_Accounting_Data.mdf file with a tool like SysTools MDF Viewer will show a number of tables. The key table is Document. There is one row in this table for each document in the system. There are three key fields that will associate the database rows with the files that were exported. The first one is FolderIndex, the second is Name, and the third is Creator. The Creator is effectively the file extension such as .tif or .jpg. However, there is a special internal Canon image format with a creator value of .image. When these files are exported they will be converted into the image format you select.

Documents are stored in iWDM in a hierarchy that looks like a Windows folder structure with the cabinet at the root level. When you export a folder that structure is maintained when the documents are put into the Windows file system. The FolderIndex is the key to finding the folder the document will be in. It is a link into the Folder table. The Folder table includes the name of the folder and the tree structure. Folders have a FolderType column that can contain one of five values.

  • 0 - Cabinet
  • 2 - Main trash folder
  • 4 - Hidden user trash folder
  • 5 - Normal folder
  • 9 - Deleted folder

Note that trash folders are a special case of folders. They are created automatically as needed by the system. When a document is deleted it gets moved to a corresponding trash folder. It still exists in the Documents table, but it is not possible to export it any longer. When the document is deleted from the trash folder it is removed from the Documents table. The trash folder you see in the user interface contains a hidden trash folder for each user. When a document or folder moves to the trash folder the Location column gets changed from 0 to 2.

All that remains to match the document in ImageWare to the file that was exported is the filename. The filename in iWDM is stored in two columns. The base name is in the Name field and the file extension is in the Creator field.

Now that we have the basic relationship between the exported file and the database we can start to find the associated metadata. There are three sources of metadata in iWDM: document properties, system index, and user index. The first two can be found in the Document table. Document properties like author or create date are available as columns of this table. System index refers to three predefined category fields. An index to the category values for each document is stored as Category1, Category2, and Category3 fields with each document. The index provides a reference to the values in the Category table.

The user index is slightly more complicated. The Document table provides no direct link to the user index. This work is done by the DocUserIndex table that provides a multi-way link to the Document table, the UserIndex table, and then to one of several user index data tables. There are eight different types of user indexes and six corresponding value tables.

User Index Data TypeValue TableUser Index Type
fixed stringFixedStringIndexValue
0
fixed maximum stringFixedStringIndexValue
1
variable stringStringIndexValue
2
dateDateIndexValue
3
integerIntIndexValue
4
unsigned integerIntIndexValue
5
floating point decimalFloatIndexValue
6
booleanBoolIndexValue
7

There is a special case when a user index has been defined as a "selectable list". In this instance the administrator predefines the possible values when the system is setup. The user can only choose from the list when they set the value of the user index. These fields have the UserIndexValueType set to 1; all other types are set to 0. The corresponding value table contains each possible value in the selectable list regardless of whether or not any documents have been assigned the value. Boolean indexes are a special case as they are always a selectable list with the values TRUE and FALSE predefined.

As an example, the following SQL query will return all string user index values for the given document:


SELECT d.Name 'Doc Name', f.Name 'Folder Name', 
   u.Name 'Index Name', u.UserIndexType 'Type', 
   fs.Value 'Fstr (0-1)', s.Value 'Str (2)'
FROM  DocUserIndex di 
   LEFT JOIN Document d on di.DocumentIndex = d.DocumentIndex 
      AND di.FolderIndex = d.FolderIndex
   LEFT JOIN Folder f on f.FolderIndex = d.FolderIndex
   LEFT JOIN UserIndex u on u.UserIndexId = di.UserIndexId
   LEFT JOIN FixedStringIndexValue fs on di.ValueId = fs.ValueId 
      AND di.UserIndexId = fs.UserIndexId
   LEFT JOIN StringIndexValue s on di.ValueId = s.ValueId 
      AND di.UserIndexId = s.UserIndexId
   WHERE d.Name = 'document name' AND f.Name = 'folder name'


That is all that is needed to extract the metadata for each of your documents. Each comprehensive document management system has its own method of importing documents and metadata. For example, in FileHold you would create a document.xml file with the metadata and the document locations at the root document folder and import it using managed imports.

It is unfortunate that the documents are stored in a proprietary format. This makes it difficult to automate the entire process. It appears as if the only change to the file is that iWDM adds 54 bytes to the front of the file. It may just be a simple matter of stripping this data off, but that is an investigation for another day. If you do export the documents using the iWDM export function you will likely need to reformat or compress the output files using a tool like Batch TIFF Resizer or TIFF Junction.

[Update]

I have taken a little closer look at the proprietary IMG files. As I suspected, removing the header (55 characters) of the IMG will reveal the embedded file. At least this is true for one type of IMG file; it turns out there are two: modifiable and non-modifiable. The modifiable version contains a single file and striping the first 55 bytes off will give the original. The non-modifiable format can contain multiple files and the 55 byte rule only works if there is exactly one file in the IMG.

I did dig into the format a little deeper and there are a few interesting titbits.

The first 32 bytes of the file appear to be the volume name; a 31 character ASCII null terminated string. This is followed by 5 bytes whose purpose is not known to me. Next we have what appears to be four 32 bit integers: a pointer to the end of the file, the IMG file number, the length of the first embedded file, and a pointer to the IMG file number (always seems to be 41); little endian format. Finally we have "VU" (hex 5655).

In the case of the modifiable IMG files the image file number is translated to base 36 and used for the file name. For the non-modifiable format the image file number relates to the file number in the IMG file. The file numbers start at 0.

For the non-modifiable format there is a secondary header at the start of the second and subsequent files. It starts with three 32 bit integers. The first one is the IMG file number, the second is the length of the embedded file, and the third is a pointer to the IMG file number. As before, the header ends with "VU". I have not investigated, but I suspect there is a flag in the header somewhere for deleted files.

There is an option to encrypt the IMG file. If encryption is in place, all bets are off for recovering the embedded files.

For the embedded files the format seems unchanged with the exception of images. If the images are converted to the iWDM image format they still seem to be stored their original format. You can check the file signatures at www.filesignatures.net or similar sites to confirm this even though you have lost the original image type in the database. Interestingly binders are stored as self extracting executables. When you run the binder code it extracts a PDF file with each of the images in the binder on each page.

16 comments :

  1. Hey Russ,

    Love this article as the Canon DMS is one system we are looking at.

    If you had another program, a 3rd party, do you believe it would be relatively easy to import data to the Canon DMS DB directly?

    We like the Canon DMS for some things and would be willing to consider it as a possible solution, however, we want to automate the data entry as much as we can. Most of our forms are hand written, but indexing data (metadata) is encoded uniquely to each document through a barcode on the document. So the hope is a program would read the barcode, append the metadata to the PDF and get stored in the DMS. The catch is, Canon's ImageWare does not work with 2d barcodes, nor can it parse a 1d barcode; it is stuck on the ideology of "one barcode, one piece of data".

    There is a wonderful program I found that can do exactly what I want to input data, but does not come with a user-friendly UI for recalling the stored documents. However, it does have the flexibility to connect to any DB - sharepoint (SQL), mySQL, Oracle, Access, Excel, etc... It also has the ability to put any of the scanned or parsed data, into any table and any column of the table, in addition to relocating the scanned document to a physical location.

    So, with what you have been able to decipher of the Canon system, is it possible to use a 3rd party program to load the Canon DMS?

    ReplyDelete
    Replies
    1. Hi,

      Unfortunately I did not investigate getting information into the Canon DMS. What I did see was fairly proprietary looking, so if the Canon front end does not let you read barcodes you are probably stuck.

      The reason I did this research was for the benefit of a customer that was trying hard to get away from Canon. They had been using it for about five years and did not like it very much.

      I think the biggest problem they had was that they could never get it configured to do many of the things they wanted. As soon as they tried to do anything interesting it would crash. After a while they just stuck to the most basic usage until they could got off it. Then the problem was how they would get their documents and metadata out of it.

      If you go with a product other than Canon you can easily use a front end solution that will scan your 2D barcode and break out the metadata fields. Many other document management systems will take metadata input in an XML file from a scanning solution.

      I am curious, what is the name of the "wonderful program"?

      Delete
    2. Canon does offer a front end capture application called Scan Manager. Scan Manager can collect "zone OCR" data or "bar code" recognition to release as a user index for page/document retrieval.

      Delete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. Incidentally, I have done the work to get the images out of the .img files and link properly to the metadata, including their whole "tree" structure. Willing to do again if anyone needs, contact me for a quote. phil@twopoint.com

    ReplyDelete
  4. Russ, did you ever figure out how to work with the .Img file? I tried trimming of the first 54 bytes, but the file was still invalid.

    ReplyDelete
    Replies
    1. I did not investigate it in too much detail after I realized I could export all the documents without too much trouble. It did mean it was a two step process though. One to get the files out and compressed and one to get the metadata out. Fortunately it was fairly easy to link the metadata to the files that were exported. We were doing less than 100000 documents, so it was not too big a burden.

      So far, this is the only iWDM conversion we have had to do, so it was not worth looking any closer. If we were going to look any closer we would probably start with non-image formats and compare the IMG files with the original files. Actual image files might be a little tougher has iWDM has its own internal format for those. It is likely based on TIFF, but who knows. I have come across other vendors that take a standard format like TIFF and tweak it a little or a lot.

      I notice another poster above offering to do a conversion of the IMG format, though I cannot attest to the quality or efficacy of the service.

      Delete
    2. Thank you for the reply. Using the information from this page, I was able to figure out how to link everything together using the file export and the database queries. I really appreciate you writing this it. It is the only good information that I found on the net.

      After I was able to link everything together and get it imported into OnBase using the OnBase DIP process, a big curve ball was thrown at me. I found out that the ImageWare export does not include all of the actual files. In ImageWare 4.x files can be stored in binders. For example, a binder can be named GTE, and inside the binder, there may be a .tif and an .xls. The ImageWare file exports a pdf that contains the information of both of those files, but not the original source documents. I searched the exported files, and those documents are nowhere to be found. The only way to get at them is to go into the ImageWare document manager, double click on the binder (which looks like a single document), then select the icon which represents the .xls, then File->"Export Source Document". That way, our analysts can get access to the actual document in it's original form.

      This is where my interest in accessing the .Img come in. I need to locate the source documents that are not included in the export.

      Thank you again. You probably saved me weeks of work by sharing your experience.

      Delete
    3. It is unfortunate about binders. We were lucky not to have that problem in the source data we were working with. I am glad this post was able to save you some time.

      Delete
    4. I figured out a fair bit more about the IMG files and updated the main post.

      Delete
    5. Thank you. I will dig back into this.

      Delete
  5. Hey all, sorry for the sideways question but this is the most knowledgeable/applicable thread I've found on the general Internet. On behalf of a client, and as an ImageWare novice, my question: is there a mechanism for bulk exporting files as PDF out of ImageWare, other than selecting a bunch, right clicking, and choosing to Export from the context menu within IWDM? Also, any tips or tricks from the community for speeding up the export to PDF process? Getting between 500 Kb/min and 2 Mb/min export speeds right now (!) . Are there other threads/forums/whitepapers I might find insight in? Appreciate any pointers from seasoned ImageWare folks!

    Thanks, take care,
    Nick

    ReplyDelete
    Replies
    1. Also, for files which mysteriously fail to export to PDF, does anyone have any insight in troubleshooting? Nothing useful is logging anywhere obvious. Anyone know if there is a [verbose] log somewhere which records events relevant to export?

      Hope everyone is well,
      Nick

      Delete
    2. For each cabinet the best bulk export you can do is the top level folders. Select the folder, choose File > Export > Export in this format. If you just have one folder at the top level this will not be too bad.

      You must bear in mind that there is a default limit of 1000 items per folder for the export. You could have 1000 sub-folders with 1000 items each and that is okay, but as soon as you have a folder with 1001 or more you will get a warning that only 1000 files will be exported. We divided up a few folders up before we exported. You can also change the maximum number of documents to display in the system settings. I have no idea what the practical limit is for this value.

      I cannot really speak to the speed. Likely that is dependent on your document size and server performance. We exported 36 GB of files over the course of several hours. They expanded to 360 GB after they were output. Most were TIFFs, so they were highly compressed in iWDM and exported without any compression by iWDM.

      The whole reason I wrote this article was because I could not find any info on this from anywhere.

      Delete
  6. I know that this thread is old, but we are working on a similar export project.

    We are running iMageware Document Manager but are not able to locate the File > Export option. Can someone help with where to look for this option?

    Regards
    Charan

    ReplyDelete