ad: cq2k-1

Bloated Log Files

Discussion in 'Logbooks & Logging Programs' started by AA6YQ, Aug 3, 2010.

Thread Status:
Not open for further replies.
ad: L-HROutlet
ad: l-rl
ad: L-MFJ
ad: Left-2
ad: L-Geochron
ad: abrind-2
ad: Left-3
  1. AA6YQ

    AA6YQ XML Subscriber QRZ Page

    There is no such thing as a linear database, and the term SQL database refers to a database that accepts queries in a language known as Structured Query Language -- not a kind of database that might be more or less likely to create bloated log files.

    What does create bloated log files is

    • the use of large fixed database fields instead of variable-length database fields
    • storing the same information in every QSO
    It's nice to be able to capture a Name that's 128 characters in length, but to define a log's NAME field as 128 fixed characters will waste a lot of space. Most modern databases offer variable-length fields that consume only as much space as is required to record the actual data. So if your QSO partner's name is Ron, only 3 bytes of storage are consumed, instead of 128.

    A significant amount of information stored in each QSO concerns you and your location -- your latitude, longitude, grid square, CQ zone, ITU zone, city, county, state, etc. Most hams operate from a single location, or from a small number of locations, so recording all that "my location" info with every QSO can waste a lot of space. To avoid this waste and to make it easier for those who do operate from multiple locations, DXKeeper lets you specify one or more "QTHs", each identified by a simple name (like the name of your town, or the name of your street). The information about each of your QTHs - latitude, longitude, grid square, CQ zone, ITU zone, city, county, state, etc. - is captured once. Each logged QSO specifies the simple name of the QTH from which it was made -- so all of the information is accessible (and exported in standard ADIF) -- it's just not needlessly replicated in every logged QSO.

    To follow up on KB1NXE's claim above, I installed HRD version 2636, and imported an ADIF file containing all 18,390 QSOs exported from my DXKeeper log. DXKeeper and HRD both use Microsoft's Jet database engine, so any difference between them is entirely attributable to database structure. Here are the results:

    DXKeeper: 22 megabytes

    HRD: 107 megabytes

    The difference is nearly a factor of 5.

    73,

    Dave, AA6YQ
     
  2. KB3TZK

    KB3TZK Ham Member QRZ Page

    Not quite. Specifically, this depends on the database engine and varies from engine to engine. In general though, a variable-length text field will require more space than just the space of the text itself because now each record must note the actual length of the field. To use your example, the record which contains "Ron" will take at least 4 bytes to record this information. And because general-purpose database engines try to support various scenarios, it is unlikely that the additional cost in reality will be only one more byte.

    A realistic comparison is as follows:

    • Fixed-length text field: the per-record cost is the total length of the field; the field length is stored in the table metadata so it does not need to be recorded in the record.
    • Variable-length text field: the per-record cost is the actual length of the field plus some overhead required to record the address and length of the field.
    "Address? What is that?" one might ask. Well, a good deal of relational database engines prefer to have records within a single table be all the same length. It helps optimize access. So variable-length fields would have a fixed-length address recorded inside the actual record, instead of their variable-length content, and then have their content stored inside a memory blob. The address points to the content.

    (This is substantially how Microsoft Jet does it. There's a 12-byte overhead in each record in addition to the space needed to record the content of the field:

    http://support.microsoft.com/kb/198660

    Reading the above page, keep in mind that TEXT is also known as MEMO.

    Edit: Actually let me amend this: the best case scenario is a 12-byte overhead. This overhead can never be eliminated if the type is TEXT. The general scenario adds an additional 2K overhead (or 4K depending on the version of the Jet engine). The devil is in the details.)

    Maybe you've looked at the internals of each software or how their databases are structured but looking from outside, there can be other explanations besides fixed- vs variable-length fields and the use of unnormalized tables. Some relevant questions:


    • Could DXKeeper be running "housekeeping" tasks more aggressively than HRD?
    • Could HRD prioritize speed of access over memory usage? In particular, indexes used to speed up access can take quite a bit of memory. Denormalizing tables, as inelegant as it may be, can also sometimes achieve this and, as you pointed out, one of the costs of denormalization is the need for more memory.
    • Could HRD be designed to allow more information to be stored per QSO? There's a cost associated with this. (I use neither software so I do not know. When I researched logging software for Linux though, I've seen quite a range of capabilities.)
    I'm not trying to defend one software over the other but if the softwares are going to be compared over technical issues, then all angles need to be considered.
     
    Last edited: Aug 3, 2010
  3. KB1NXE

    KB1NXE Ham Member QRZ Page

    I see Dave - aa6yq is back at trying to assail me again. I have him on my ignore list and sure would like him to reciprocate.

    I guess just because I said I didn't like DXLab suite and preferred HRD, I'm now persona non grata in his eyes and he will take every opportunity to debase me.

    So be it.

    Jim - KB1NXE

    FWIW, in some circles, a linear file is also referred to as a flat file.
     
  4. KB1NXE

    KB1NXE Ham Member QRZ Page

    Louis,

    I believe HRD has many more fields defined in their DB structure than 'others'. This accounts for the size difference.

    FWIW, dave, aa6yq is the author of DXLabs and is very proud of his work (as he should be). He is NOT open to criticism in my experience, however.
     
  5. KB3TZK

    KB3TZK Ham Member QRZ Page

    Ah, I see... Well, what he reports regarding how DXLabs works should be right on the money then.
     
  6. AA6YQ

    AA6YQ XML Subscriber QRZ Page

    Yes, variable-length fields incur some bookkeeping overhead, but there's till a 5X reduction in space consumption when storing a typical first name in variable-length field compared with a fixed 128-character NAME field.

    In Jet, fields of type TEXT are fixed-length, and fields of type MEMO are variable-length.

    I installed HRD, imported the ADIF file, closed HRD, and looked at the log file size in Windows Explorer. With no deletion or modification operations, no housekeeping should be required.

    HRD does index the fields used for award tracking: callsign, band, mode, continent, IOTA, WPX, etc. DXKeeper maintains separate award tracking tables in it's log file, the "cost" of which are included in the 22 mb log size.

    Both DXKeeper and HRD implement the fields defined in ADIF. DXKeeper provides 8 user-defined fields, and HRD provides 10. Both applications define some application-specific fields. So the record contents aren't identical, but they are too close to explain the 5X difference in log file sizes.
     
  7. AA6YQ

    AA6YQ XML Subscriber QRZ Page

     
  8. AA6YQ

    AA6YQ XML Subscriber QRZ Page

    This is not the case with respect to DXLab. Opening an HRD log file and a DXLab log file in Access and displaying "design view" for each will objectively demonstrate this.

    DXLab is driven by critique from it's user community. Not only am I open to it, I actively solicit it. Spend a week monitoring the DXLab Yahoo group and draw your own conclusions.
     
  9. KB3TZK

    KB3TZK Ham Member QRZ Page

    From Jet 4.0 onwards, TEXT without any specified length is a synonym for MEMO whereas TEXT with a specified length, TEXT(n), is a synonym for CHAR field of the same length, CHAR(n). This explains it pretty well:

    http://support.microsoft.com/kb/275561

    Looking at the doc for Office 2007, the relationship still holds:

    http://msdn.microsoft.com/en-us/library/bb177899(office.12).aspx

    I have no reason to believe it does not hold in Office 2010.

    So TEXT is variable or fixed depending on whether a size is specified and if no size is specified it is synonymous with MEMO. The documentation suggests that "MEMO" is a convenience for developers: the engine translates it to "TEXT".
     
  10. AA6YQ

    AA6YQ XML Subscriber QRZ Page

    The TEXT fields in HRD's schema all specify sizes, and so are truly fixed-width.

    73,

    Dave, AA6YQ
     
Thread Status:
Not open for further replies.

Share This Page

ad: M2Ant-1