HACKER Q&A
📣 LinuxBender

Why do applications log long human readable static strings?


If an application has a definition file for all possible error messages, why don't they include a dictionary file in an application manifest with a 4 digit hex code that maps to a long string?

e.g.

Dogs barking can not fly without umbrellas: 1.1.1.1

could be

db1c: 1.1.1.1

Or if there are multiple variables %s and such, it could be db1c: 1a 1.1.1.1, 1b 1.2.3.4

Does this make debugging too difficult because you have to read the logs through a script that translates the log messages? I recall IBM doing this in the 70's for messages that had to be relayed over modems and satellite, but I don't see applications doing this today.

Why I am asking: Tens of thousands of servers sending the same strings to Splunk translates to tens of millions of dollars in splunk licensing. Even using Elk/Logstash, it translates to a lot of disk storage and IO. Surely this must be a solved problem. It would be easy to put a dictionary file in Splunk to look up the codes.

Codifying the events would also make it easier to translate log events into multiple languages.


  👤 epc Accepted Answer ✓
My semi–informed guess: even with automation you need the message to be decipherable by human eyes.

Also: you'd now have to keep your data dictionary in sync between your version of the product and whatever third party log analyzer you're using.

It isn't impossible, but it doesn't seem to be in anyone's particular interest to solve.

I used to write error messages for IBM RACF (a security product for MVS), circa 1990-1992. Mainframe stuff (the O/S, products, etc) emit messages and codes of various sorts. Most are inhaled by automation systems, but all are semi–legible to humans, because sometimes the automation systems throw up their digital hands.

The first hard lesson I learned is: once a message is in the system, you can never change it. You can replace it (with a completely different message identifier and text), but you can't change the typos you find (they're 20 years old at this point) or lowercase the all–uppercase messages to make them look pretty (I didn't do this, but someone briefly on my team did).

When I moved on to doing web stuff, I hacked my copy of NCSA to emit a simplified access log, timestamps were Unix epoch, I skipped the "user" fields, error codes were in hex (ie, not "200" but "C8" for Document OK). Thought I was a genius, saving space and all that. But every tool assumed what we now call "common log" format and it meant that you couldn't just look at a snippet of log at a glance to see what was going on. So I ended up reverting back to common log format after a couple of months.


👤 Someone
gzip can create that dictionary for you on the fly.

AdvantageS are that the dictionary stays up to date, and that it covers non-static strings such as dates, too, disadvantage a slightly less efficient compression.