Unix/Linux Basics for Data Engineers – Complete ETL Server Guide

Linux commands are essential for every Data Engineer working with ETL pipelines. Most production data pipelines run on Linux servers, and understanding command-line tools helps in debugging, automation, monitoring, and file processing.

1. pwd – Check Current Directory

Purpose: Shows your current working directory.

pwd

Example Output:

/home/etl_user/projects/sales_pipeline

ETL Scenario: Before running a pipeline script, confirm you're inside the correct project directory.

2. ls -l – List Files with Permissions

Purpose: Displays files with detailed permissions and ownership.

ls -l

Example Output:

-rwxr-xr-- 1 etl_user data_team 2048 Mar 1 02:00 etl_job.sh

ETL Scenario: Verify whether your ETL script has execution permission.

3. head – View Beginning of File

Purpose: View first few lines of a large CSV file.

head -n 10 sales.csv

Example Output:

id,name,amount
1,John,200
2,Alice,150

ETL Scenario: Validate headers before loading data into database.

4. tail – View End of Log File

Purpose: View last lines of log file.

tail -n 50 etl_job.log

Live Monitoring:

tail -f etl_job.log

ETL Scenario: Most ETL failures appear at the end of logs.

5. grep – Search Errors in Logs

Purpose: Search specific keywords inside files.

grep -i "error" etl_job.log

Example Output:

ERROR: Database connection timeout

ETL Scenario: Quickly identify root cause of job failure.

6. wc -l – Count Records

Purpose: Count number of rows in file.

wc -l sales.csv

Example Output:

100001 sales.csv

ETL Scenario: Validate source vs target record count.

7. awk – Perform Calculations

Purpose: Perform column-based operations.

awk -F',' '{sum+=$3} END {print sum}' sales.csv

Example Output:

ETL Scenario: Validate total revenue before loading to warehouse.

8. chmod – Fix Permission Issues

Purpose: Grant execution permission to script.

chmod +x etl_job.sh

ETL Scenario: Fix "Permission Denied" error in production.

9. ps -ef – Check Running Jobs

Purpose: See running processes.

ps -ef | grep etl

ETL Scenario: Check whether scheduled pipeline is still running.

10. kill -9 – Stop Stuck Job

Purpose: Terminate process forcefully.

kill -9 12345

ETL Scenario: Stop a frozen ETL process consuming high CPU.

11. crontab – Schedule ETL Job

Edit Cron:

crontab -e

Example (Run Daily at 2 AM):

0 2 * * * /home/etl_user/etl_job.sh >> job.log 2>&1

ETL Scenario: Automate daily data warehouse load.

12. df -h – Check Disk Space

Purpose: Check disk usage.

df -h

ETL Scenario: Full disk is a common cause of ETL failures.

13. gzip – Compress Data Files

Purpose: Compress file before transfer.

gzip sales.csv

Result:

sales.csv.gz

ETL Scenario: Compress file before sending to S3 or FTP.

Conclusion

Mastering these Unix/Linux commands enables Data Engineers to debug production issues, monitor ETL pipelines, validate data, and automate workflows efficiently. Strong command-line knowledge significantly improves troubleshooting speed and reliability in real-world data engineering environments.

Search This Blog

Welcome to Daily Updates

* Happy Learning*

Quick Links