Leveraging Observability Practices in Software Development

Introduction

I've been playing around with observability tools recently.

So I decided to build something and break it on purpose. Made a simple cottage booking app and intentionally created some database lock issues to see if observability could actually help me debug the mess.

Used OpenTelemetry, Prometheus, Grafana, and Loki - the usual suspects for metrics, traces, and logs.

Code's all here if you want to see: github.com/morifky/cottage-booking-app

The Setup

OpenTelemetry does most of the heavy lifting - automatically instruments GORM calls and HTTP requests. Pretty neat.

For metrics, I'm tracking request times, booking counts, database stuff. You know, the usual things that tell you when something's going wrong.

Logs go through Zap to Loki. The nice thing is everything connects with trace IDs, so you can jump from a log line to see the full request flow.

How it works

App sends everything to the OpenTelemetry Collector, which routes metrics to Prometheus, logs to Loki, traces to Tempo. Grafana pulls it all together.

                                +-----------------+
                                |     Grafana     |
                                | (Visualization) |
                                +--+----+---+----+-+
                                   ^    ^   ^    ^
                                   |    |   |    |
+-----------------------+      +---+----+---+----+---+      +-----------------+
| Cottage Booking App   |----->|  OpenTelemetry    |----->|   Prometheus    |
| (Instrumented w/SDK)  |      |     Collector     |      |    (Metrics)    |
+-----------------------+      +-------------------+      +-----------------+
                                          |
                                          +--------------->|      Loki       |
                                          |                |      (Logs)     |
                                          |                +-----------------+
                                          |
                                          +--------------->|  Grafana Tempo  |
                                                           |     (Traces)    |
                                                           +-----------------+

Breaking things

Here's the fun part - I wrote some code to create database lock contention on purpose:

func (br *BookingRepository) SaveWithLock(ctx context.Context, booking *models.Booking, holdDuration time.Duration) error {
    tracer := otel.Tracer("booking-repository")
    ctx, span := tracer.Start(ctx, "SaveWithLock")
    defer span.End()

    span.SetAttributes(
        attribute.Int("visitor_id", int(booking.VisitorID)),
        attribute.Int("room_id", int(booking.RoomID)),
        attribute.String("hold_duration", holdDuration.String()),
    )

    return br.db.WithContext(ctx).Transaction(func(tx *gorm.DB) error {
        // Acquire exclusive table lock
        if err := tx.Exec("LOCK TABLE bookings IN ACCESS EXCLUSIVE MODE").Error; err != nil {
            span.SetStatus(codes.Error, "Failed to acquire table lock")
            return err
        }

        // Hold the lock for specified duration
        select {
        case <-time.After(holdDuration):
            // Continue after hold duration
        case <-ctx.Done():
            span.SetStatus(codes.Error, "Request timed out")
            return ctx.Err()
        }

        return tx.Create(booking).Error
    })
}

Without observability

This is what usually happens:

App times out randomly
Can't reproduce the issue
Users are mad, you're confused
Lots of guessing and hoping

With observability

Completely different story.

Here is how i simulate the lock the lock contention:

Terminal 1 - grab the lock:

curl -X POST http://localhost:8080/booking/with-lock \
  -H "Content-Type: application/json" \
  -d '{"visitor_id":1,"room_id":1,"hold_lock_seconds":100}'

Terminal 2 - try another booking:

curl -X POST http://localhost:8080/booking \
  -H "Content-Type: application/json" \
  -d '{"visitor_id":2,"room_id":2}'

Now I can see exactly what's happening - when the lock was taken, how long it lasted, why other requests are waiting. No more guessing.

What I learned

Tracing changes everything

Instead of guessing what went wrong, you can see the exact sequence of events. When that database lock happened, I could watch other requests pile up in real-time.

Structured logs with trace IDs

This is probably the best part. See an error in the logs? Click the trace ID and boom - you can see the entire request flow that caused it.

Final thoughts

Observability isn't just about fixing bugs. It's about actually understanding what your system is doing. Once you have it set up properly, debugging becomes way less painful.

Definitely worth the setup time.