Saturday 24 October 2020
  • :
  • :

How To Streaming Log File To HDFS Using Flume In Big Data Application

You will learn from big data analytics services providers about streaming log file method to HDFS with Flume. This post will introduce Flume and everything else needed for streaming log file using Flume in big data application.

Introduction Flume to ingest the data to HDFS

In big data application, the raw data is very important to do more analytic operations. In this blog, I will introduce about Apache Flume which help to ingest the data from many sources to our HDFS to process the data.

Flume is sub project of Hadoop ecosystem which ingests the log data from outside system to Hadoop. In ingesting the data, Flume will run 1 or many agents and agents have three mandatory components below:

  • Sources receive data and send it to channels.
  • Channels keep the data in queue to wait communication between sources and sinks.
  • Sinks process data collected from in queues from channels and move it to HDFS.

Flow of Flume

Figure 1: Flow of Flume


Java: JDK 1.7

Cloudera version:  CDH4.6.0

Initial steps

  1. We need make sure we have some log file in our linux system.
  2. Create the configuration config for Flume agent as the configuration below

Also Read: Errors During Oozie Execution

Code walk through

This configuration file will collect the real time log from tail command from location /var/system.log to the destination location in HDFS.

# Define a source of Flume on myagent and use the memory-channel channel to call command of Linux tail the log file of Linux system

myagent.sources.tail-source.type = exec
myagent.sources.tail-source.command = tail -F /var/log/system.log
myagent.sources.tail-source.channels = memory-channel 

# Define a sink of Flume that outputs to logger from source input stream data = memory-channel
myagent.sinks.log-sink.type = logger 

# Define a sink of Flume that outputs to HDFS location with data stream file type. = memory-channel
myagent.sinks.hdfs-sink.type = hdfs
myagent.hdfs_w1.hdfs.writeFormat = Text
myagent.sinks.hdfs-sink.hdfs.path = hdfs:///mydata/destinationLog

myagent.sinks.hdfs-sink.hdfs.fileType = DataStream 

# Set the channel, source and sink component for this agent config

myagent.channels = memory-channel
myagent.sources = tail-source
myagent.sinks = log-sink hdfs-sink

Run this command to start the agent:

flume-ng agent -f /mylocalconfig.conf -n myagent 

Verify the result

We will do some operation from our linux system like create files, remove files etc.

vi a
rm a 

Afther this operation, the sys log from linux will update and tail –f command will ingest that changes to our HDFS location as we configured above. We can check in the HDFS location to see the output

hadoop fs –text /mydata/destinationLog /* | head –n 10

It will show the data change from the log file from linux local in our HDFS file.

The agenda of big data analytics services providers was to make you understand about Flume and its use for streaming log file to HDFS. For queries, kindly contact experts.

Hope that this blog can help you guys understand the steps to config the Flume to ingest the data from other system to our HDFS for big data application.

Vijay is a compulsive blogger who likes to educate like-minded people on various new technologies and trends. He works with Aegis SoftTech as a software developer and has been developing software for years. Stay Connected to him on Facebook and Google+.