You will learn from big data analytics services providers about streaming log file method to HDFS with Flume. This post will introduce Flume and everything else needed for streaming log file using Flume in big data application.
Introduction Flume to ingest the data to HDFS
In big data application, the raw data is very important to do more analytic operations. In this blog, I will introduce about Apache Flume which help to ingest the data from many sources to our HDFS to process the data.
Flume is sub project of Hadoop ecosystem which ingests the log data from outside system to Hadoop. In ingesting the data, Flume will run 1 or many agents and agents have three mandatory components below:
- Sources receive data and send it to channels.
- Channels keep the data in queue to wait communication between sources and sinks.
- Sinks process data collected from in queues from channels and move it to HDFS.
Figure 1: Flow of Flume
Java: JDK 1.7
Cloudera version: CDH4.6.0
- We need make sure we have some log file in our linux system.
- Create the configuration config for Flume agent as the configuration below
Also Read: Errors During Oozie Execution
Code walk through
This configuration file will collect the real time log from tail command from location /var/system.log to the destination location in HDFS.
# Define a source of Flume on myagent and use the memory-channel channel to call command of Linux tail the log file of Linux system
myagent.sources.tail-source.type = exec
myagent.sources.tail-source.command = tail -F /var/log/system.log
myagent.sources.tail-source.channels = memory-channel
# Define a sink of Flume that outputs to logger from source input stream data
myagent.sinks.log-sink.channel = memory-channel
myagent.sinks.log-sink.type = logger
# Define a sink of Flume that outputs to HDFS location with data stream file type.
myagent.sinks.hdfs-sink.channel = memory-channel
myagent.sinks.hdfs-sink.type = hdfs
myagent.hdfs_w1.hdfs.writeFormat = Text
myagent.sinks.hdfs-sink.hdfs.path = hdfs:///mydata/destinationLog
myagent.sinks.hdfs-sink.hdfs.fileType = DataStream
# Set the channel, source and sink component for this agent config
myagent.channels = memory-channel
myagent.sources = tail-source
myagent.sinks = log-sink hdfs-sink
Run this command to start the agent:
flume-ng agent -f /mylocalconfig.conf -n myagent
Verify the result
We will do some operation from our linux system like create files, remove files etc.
Afther this operation, the sys log from linux will update and tail –f command will ingest that changes to our HDFS location as we configured above. We can check in the HDFS location to see the output
hadoop fs –text /mydata/destinationLog /* | head –n 10
It will show the data change from the log file from linux local in our HDFS file.
The agenda of big data analytics services providers was to make you understand about Flume and its use for streaming log file to HDFS. For queries, kindly contact experts.
Hope that this blog can help you guys understand the steps to config the Flume to ingest the data from other system to our HDFS for big data application.