Tracking data made easier!

4 min readMar 13, 2021

Hi there, stuck techies! Welcome to my blog where I try to make our life’s easier by showing how a tool could be made to track data, finally! 😛

“DATA IS THE NEW OIL”.

A very commonly heard statement these days! But what happens when some piece of data goes missing? 🤔

Searching logs be like :P

Yea, Something like this indeed! Manually scrolling the logs, checking the data at each point, and sometimes making silly assumptions while doing so. And that too with people standing on your head asking “Where is my data!!”. This could happen pretty frequently given that systems transfer millions of records per hour and with each record being so very important, a tool to track such records becomes a necessity. Also, searching in logs can become a nightmare, especially for a person who’s relatively new to the system. This arises a need for a tool that could:

Track each record end to end
Identify any point of failure and their reasons within seconds, and
Visualize transformations to data by each of the applications in real-time.

Event Tracker

Wouldn’t it be great to have a tool that could show data moving from the source to the destination via all the different applications in an ordered manner, just would make searching so much easier!

Let’s get into technical details of how we can make one from scratch.

Design:

Each of the systems sends signals to a common kinesis stream for each of the actions on the data that they perform. Different actions could be consuming, publishing, transforming data, splitting data into multiple records, or merging multiple records into one. The signals sent to Kinesis streams are in a specified format mainly comprising of

TransactionId: ID which remains constant throughout the event lifecycle
Previous Id
Next Id
Data or the payload
Component name
Action performed
Status: Success/ Failure
Current Time

A lambda is written to consume data from the Kinesis stream, transform and push data into Dynamo DB. Triggers to the kinesis stream are added in the lambda which triggers the lambda function every ‘X’ seconds or for every ‘Y’ record. Adjusting the number of shards of Kinesis stream and parallelization factor for lambda, this combination could easily scale up to 5000TPS. This would also reduce the cost as you’ll pay only for what you use, and at non-peak hours, lesser data would trigger the lambda lesser times, hence lesser the cost.

Reference: Lambda with kinesis

Data is stored in dynamo DB in the form of nodes and edges using adjacency list design patterns which makes it much easier for the service to consume it and show it in the form of a graph in UI application. Thinking in terms of a directed acyclic graph, for a particular event received, a ‘node’ and an ‘edge’ is created in Dynamo DB. A node represents all the details about a particular event, for eg. the component that emitted it, the data emitted, success/failure, etc. Whereas an edge a connection between these 2 nodes.

Representation of records as nodes and edges

Reference: Best Practices for Managing Many-to-Many Relationships — Amazon DynamoDB

Bingo! We have the data in a DB finally! Now we just need to expose this data via a service and build a UI to show this data as clearly as we can. For building a UI like the one shown, having a directed acyclic graph, I used Dagre-D3 in reactJS. Each of the nodes denotes the activity happening and each of the edges denotes the movement of data from one activity to the other. The application itself is the parent of or group of all the actions happening within it. An error in processing is denoted by a red color node, which makes it even easier to identify the failure.

Reference: https://codesandbox.io/s/x90v27yvjz

What did we finally achieve? 🤔

Searching data for any particular transaction ID became so much simpler! No need to look at logs for each system to find the point of failure. This could be extremely complex in case a single record is transforming and merging at various places! My team here at Intuit builds complex integrations from various sources and persists them into a single data store. From what we’ve seen with this tool, the MTTD any missing pieces in data gets tremendously reduced from ~30 mins to <1–2 mins!
The UI was built to show the data transformed at each of the ‘nodes’ on clicking of the node. This tool became a single point of entry for seeing any data transformations happening, and hence much more insights into what each of the systems is doing
Possibilities of further development on top of the current solution are endless! Having a single place for all logs from each system makes it a very powerful solution indeed. We’ve performed analytics on top of data in the kinesis stream using Kinesis data analytics, where we are also able to identify any event not reaching a destination within a predefined SLA in real-time! Isn’t that cool? 😎

Conclusion

Hope this blog will help you think in a better direction towards making a tool to make tracking data (and eventually life) easier. I did not go too much into the technical depth of the solution but feel free to reach out to me if you get stuck anywhere or you want me to write the in-depth technicalities.

Tracking data made easier!

Event Tracker

Design:

Written by Rohit Jain

No responses yet