HACKER Q&A
📣 thiago_fm

How to Find Anomalies in JSONs?


Hello everyone, I have some JSON files(>1000, possibly stream of JSONs) which look quite similar.

If I want to find anomalies among them, what would be the way to go? I saw that k-means isn't the best method.

I don't want to find particularly examples which are just a little bit different from others, but examples which are VERY different. If you ever did web development, you might as well in your life have got a strange error inside a JSON instead of what you expected. I want to be able to get it with an algorithm.

Why I want to do that? I have a few APIs I use, but sometimes they end up changing those responses or give out unknown response body's. I want my algorithm/model to be able to detect them and show me a list of the biggest anomalies.

If I manage to do it successfully, I'll make sure it's open source. Also if you know an easy way or an OSS solution, please also share). Hell, even if you know what I should study! I was studying deep learning but didn't find any known methods by me that I could use in order to make sense of that data.


  👤 rkx1 Accepted Answer ✓
If I understand correctly, the type of data that's contained in the files is much more important the format. As a starting point, is there anything that's stopping you from reading the files in Python (pandas) for example and doing some simple outlier detection (interquartile ranges, standard deviations)?

👤 bdr
This seems like a statistics question that doesn't have much to do with JSON. A starting point: https://en.wikipedia.org/wiki/Anomaly_detection

You should probably start off trying the simplest thing that could possibly work for your use-case.


👤 jrandm
> If I want to find anomalies among them, what would be the way to go?

What is an anomalous JSON file other than a JSON file that does not meet the specification[0]?

I have never gotten a "strange error" from a JSON parser. Most JSON parsers are very specific about whatever character they dislike. I would suggest that the algorithm you're seeking is in fact whatever is giving you the error.

If you're speaking to an API returning JSON, then you should be able to determine what the API is supposed to return to you. Many times different responses contain meaning about why the response is different than expected, like HTTP status codes.

Deep learning is a tool to use to solve a problem. Until you have a well defined problem it will be difficult to apply various machine learning techniques to it.

[0]: https://www.json.org/json-en.html


👤 verdverm
Cuelang may be able to help here. You can define a schema and then process all of the JSON. you can look for new/missing fields and constrain the field to a regex.

The docs are missing some of this. If you jump into slack, we'd be happy to help.


👤 heavenlyblue
Use dictdiff module between subsequent responses and then check if the difference in values is more than just atomics.