Motivation

Transportable databases are one of the easiest (and fundamental) methods of sharing InfoWorks ICM and InfoNet data. Quite often a consultant is required to submit their hydraulic models to the client as a transportable database. Unfortunately, many clients do not have access to InfoWorks or InfoNet meaning that validation that the consultant has actually done the work is difficult.

Given that it is vitally important for companies to be able to check their data, it is also extremely important that there is a cheap and easy solution to check the contents of transportable databases without requiring expensive modelling software.

Furthermore, I am no lawyer, but there may be grounds to suggest that transportable databases are even in violation of GDPR. Quoting from ICO.org.uk:

Where no specific format is in common use within your industry or sector, you should provide personal data using open formats such as CSV, XML and JSON. You may also find that these formats are the easiest for you to use when answering data portability requests.

Since 'personal data' can be stored in a transportable database, this format may be in violation of GDPR as long as it stays a proprietry format with no open specification.

For all of these reasons, I deemed it necessary to build an open source library which can read InfoWorks and InfoNet Transportables. I started this project about 2.5 years ago, however soon after I started I hit a brick wall, until last week. Without further ado...

Transportable database structure

About a year ago I posted an article on LinkedIn about understanding the InfoWorks CS database structure. Within this article I briefly covered the structure of transportable databases. I explained how, 1 year prior to writing the article, I had come across an error while trying to open a corrupt transportable database:

Error extracting file Globals.Dat into ... error in zipfile (2).

At this point curiosity got the best of me and I opened the transportable database in 7-zip to take a peak inside. Below is an interactive example of an unzipped transportable database.

When I discovered that you could unzip a transportable, I began thinking about the possibilities! So many things appeared to become possible now that we could build our own transportable databases from scratch! So I began searching deeper into the database in an effort to discover what made it tick.

Investigating the raw files

After the above transportable database shown is loaded in ICM it has the following database structure:

So somehow the file structure above is morphed into the transportable database structure. But the question is... How?

If we look into the files we see that a transportable database item contains what appears to be a list of Parent and Child IDs, under #Children and #Parent keys. So let's expand these out and see what we are left with! You might also notice that some files contain an ID key, however this ID directly relates to the number in the name of the file, and doesn't relate to the IDs represented in #Children and #Parent.

In the diagram, links represent the files which define connections. Hoverring over the link will tell you which file the link is defined in. The file which defines the link's ID is unknown however.

Ultimately, a particular link e.g. the link from the Transportable Database to the Master Group.

0 -> 433 -> 4272

Is split into 2 seperate trees.

0   -> ??? -> 4272

and

??? -> 433 -> ???

It is possible to decipher this from the diagram because the Master Group has 25 children in both trees, and is the only object which has 25 children. Thus these 2 must match and thus 0->433->4272. However the same is not true for the children of the Asset Groups. All asset groups have 1 child so this trick cannot be used here.

The ids of the parents and children are not sufficient for the production of a tree. Any node must define it's own id. After this is defined then either the parents or children may be defined.

Ultimately this means that we are missing some data. There must be some additional data stored regarding the ID of each zip item. At first, I thought it may be stored in alternate data streams attached to the zip items. But after looking into it, it appeared this was not the case.

Next I assumed that the data was stored in the Wallaby 2 file. This file contains all data related to model/collection/distribution networks stored in the transportable database. However then I realised that databases could be created without any of these files and thus this cannot be used to store this data.

My next idea was to look in the various blob files spread throughout the database. However as with the wallaby 2 files, transportable databases could be made without blob files yet still have complex structures.

At this point I stopped and proceded onto other projects... Recently, 2 years later, I revisited the project. I just so happened to have the following tree structure:

Within AG1.DAT the children string: #Children,1,"(364,620)" was present. It dawned on me that 620 was an extremely round number to land an ID on and that the difference was approximately 250. By taking the difference of the 2 IDs 620-364 we get 256. To me, this set off alarm bells and if you've been programming for a while, you'd likely agree. 256 or 2^8 is a really interesting number to seperate 2 unique IDs.

So let's assume all sequential IDs are seperated by 256. Let's try to verify this is the case with our first example. If we look into the data in MASG1.DAT (the Master group which contains many asset groups), we see children IDs as follows:

#Children,1,"(432,688,944,1200,1456,1712,2224,2480,2736,2992,3248,3504,3760,4016,4272,4528,4784,5040,5296,5552,5808,6064,6320,6576,7088)"

If we add 256 to each ID we get the next one as required. (Note: 1968 and 6832 are missing from this ID list because these had been delted previously.)

Determining the ID

So it seems 256 is always used as a seperator but how do we determine what the ID of a particular file is? The formula I use is as follows:

Ultimately the id depends on 2 parts of the file name:

AG 12 .DAT

The red section relates to id_type . This is a 1-to-1 representation with each model object type being attributed to it's own numeric id_type . The blue section is equal to id_cumulative . So in the case of AG12.DAT the type is AG which can be translated into the number 176. So ultimately our equation becomes id = 176+256*12 = 3248. Now that you know how transportable databases work, you can verify this for yourself!

Open Source JavaScript Library

I have created an open source JavaScript library which implements an algorithm allowing you to read transportable databases in the browser, and in desktop applications with NodeJS server. The source code is hosted on github. Please see the repository readme for documentation regarding the API. I've tried to build it in a manner which is as flexible as possible. Let me know how you get on with it!

I have also created a few demos to help demonstrate the library in action! I suggest looking at the Tree Viewer first!

Limitations and community contributions

Currently there are a few limitations with the library. The library itself is a basis ready for extension. Currently the library does not contain any specific objects. For example, it would be helpful if Runs related directly to the network, waste water, trade water profiles and rain events which they are using for simulations. This should be easy to implement using the TransportableItem type system, but currently hasn't.

You may have noticed that when it comes to layer lists the file LL1.DAT does not include data regarding which layers are loaded in the layer list. This data is stored in the LL1Blob.DAT binary file. I am still unaware of any format specification for reading these files, and they almost look encrypted. I have already tried decompressing them however it seems this cannot be done to the file as a whole. Ultimately the format of these files is still unknown to me, so if anyone has time to invest in decoding these files that would be fantastic!

Finally the numbatdata is in a format specified in the database itself "Wallaby 2 file". However through my googling I have not found any relational databases which refer to Wallaby 2. I have noticed that these files are likely SQLITE3 based, as numerous SQLITE temporary files are created while performing grid reports, for instance. Otherwise, I am not aware of any specification for these files. If anyone has some time to give it a crack that'd be a great help!

Currently only reading from transportable databases has been implemented. Long term it would be great if we could write to transportable databases and ultimately have complete control over the contents of it, to fix it as required and download the newer version. If this is understood as a requirement then TransportableItem should be given a method which returns a file collection object of pre-written files required for import.

If you have any other suggestions for improvements to the library feel free to raise them as an issue on github. You can also branch the repository and make pull requests to get features intergrated into the library faster.

So to recap here's the ToDo list:

Intergrated Type wrappers for different model objects

Blob parser

Wallaby 2 parser

Writing data to and downloading the new transportable database

Conclusion

All in all, I have speant a fair amount of my free time over the past few weeks building this project. I really hope this ends up being helpful to others. I think this would be a great tool for many of our client's at PickEverard!

Currently this has been entirely self funded by myself. If you would like specific features implemented for your own processes then I am more than happy to come to an arrangement which works for the both of us. Please feel free to get in contact!

Enjoy!

Parsing ICM Standalone and Transportable databases