Streaming

3.1.5. Streaming#

Streaming refers to the real-time flow of data. Many real-time applications rely on data streaming to receive information with minimal latency, enabling them to generate up-to-date insights and results instantly. You’ve likely seen examples of this - most social media platforms use real-time data to update trending topics. That’s why these trends change rapidly; for example, when a football player does something remarkable (or controversial), a surge of related posts appears immediately. Social media platforms use streaming to capture this live behavior and update their trending leaderboards accordingly.

There are several ways to stream data. Below, we’ll explore streaming methods commonly encountered by data scientists.

3.1.5.1. Polling#

The simplest and most straightforward way to get real-time data is to continuously ask for it. For example, a system could send an API request or run an SFTP command every 5 seconds to fetch new data. This approach leverages existing protocols to simulate real-time behavior.

However, it comes with significant overhead. The client must constantly poll for updates, which can consume CPU and network resources. On the server side, handling frequent requests from multiple clients can lead to high resource usage. Additionally, if the data doesn’t change for long periods, say an hour, this method wastes substantial computational and network bandwidth on both ends. However, for less complex systems, it still has its benefits!

Let’s look at one example:

Remember the HTTP APIs example. We can write a polling system which polls every 5 seconds and gets the top most post.

import requests

url = "https://jsonplaceholder.typicode.com/posts"

import time


cnt = 0
# UNCOMMENT THIS CODE SNIPPET AND RUN

# while True:
  
#   cnt += 1
#   response = requests.get(url)
#   print(f"{response.json()[0]['title']}")
#   print()
#   time.sleep(5)

#   if cnt == 10: break # to prevent spam

3.1.5.2. Web Sockets#

WebSockets are a network protocol that power many modern internet applications. As the name suggests, they create a persistent, two-way connection between a client and a server - much like an electrical socket. Once connected, clients remain attached to the server, and the server keeps track of all active connections.

Whenever new data is available, the server can push updates directly to the clients, allowing for real-time interaction without the need for constant polling. Clients can then process the data as needed.

A popular library for working with WebSockets is Socket.io. It simplifies building both WebSocket servers and clients, and includes fallback options for environments where native WebSocket support is limited.

3.1.5.3. Message Brokers#

Message brokers are system components that are used to connect different de-coupled systems with one another. We can think of them as an extenal link between the two. As the name suggests, it is responsible for deliveriing messages from the source to the destination. Further, it is generally an asynchronous system and thus does not block the source or the reciever’s resources. It is primarily used for one-to-one connections.

A famous example of a message broker is RabbitMQ. You can connect to a RabbitMQ instance in Python using the pika library as shown below:

Install the pika package

python -m pip install pika --upgrade

Connect to the broker and start a consumer

import pika

# Replace 'localhost' with the address of your RabbitMQ broker
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Declare the queue (make sure it exists or create it if not)
channel.queue_declare(queue='queue_name')

# Define the callback to handle incoming messages
def callback(ch, method, properties, body):
    print(f"[x] Received {body}")

# Register the callback with the consumer
channel.basic_consume(
    queue='queue_name',
    auto_ack=True,
    on_message_callback=callback
)

# Start consuming messages
channel.start_consuming()

Once running, the consumer function (callback) will be triggered each time a new message is pushed to the queue.

3.1.5.4. Event Bus#

Event Buses are similar to message brokers but are specifically designed for broadcasting events. Producers of events don’t need to know who will consume the messages - there can be one or many consumers subscribed to a given topic on the event bus.

For consumers, each must acknowledge the messages they receive. Event buses also persist messages for a certain duration, allowing for the ability to “replay” past events if needed - useful for debugging, reprocessing, or recovering missed data.

One of the most widely used event bus technologies is Kafka. Learn more here.