Unix/Linux Basics for Data Engineers – Complete ETL Server Guide
Linux commands are essential for every Data Engineer working with ETL pipelines. Most production data pipelines run on Linux servers, and understanding command-line tools helps in debugging, automation, monitoring, and file processing.
1. pwd – Check Current Directory
Purpose: Shows your current working directory.
pwd
Example Output:
/home/etl_user/projects/sales_pipeline
2. ls -l – List Files with Permissions
Purpose: Displays files with detailed permissions and ownership.
ls -l
Example Output:
-rwxr-xr-- 1 etl_user data_team 2048 Mar 1 02:00 etl_job.sh
3. head – View Beginning of File
Purpose: View first few lines of a large CSV file.
head -n 10 sales.csv
Example Output:
id,name,amount 1,John,200 2,Alice,150
4. tail – View End of Log File
Purpose: View last lines of log file.
tail -n 50 etl_job.log
Live Monitoring:
tail -f etl_job.log
5. grep – Search Errors in Logs
Purpose: Search specific keywords inside files.
grep -i "error" etl_job.log
Example Output:
ERROR: Database connection timeout
6. wc -l – Count Records
Purpose: Count number of rows in file.
wc -l sales.csv
Example Output:
100001 sales.csv
7. awk – Perform Calculations
Purpose: Perform column-based operations.
awk -F',' '{sum+=$3} END {print sum}' sales.csv
Example Output:
350000
8. chmod – Fix Permission Issues
Purpose: Grant execution permission to script.
chmod +x etl_job.sh
9. ps -ef – Check Running Jobs
Purpose: See running processes.
ps -ef | grep etl
10. kill -9 – Stop Stuck Job
Purpose: Terminate process forcefully.
kill -9 12345
11. crontab – Schedule ETL Job
Edit Cron:
crontab -e
Example (Run Daily at 2 AM):
0 2 * * * /home/etl_user/etl_job.sh >> job.log 2>&1
12. df -h – Check Disk Space
Purpose: Check disk usage.
df -h
13. gzip – Compress Data Files
Purpose: Compress file before transfer.
gzip sales.csv
Result:
sales.csv.gz
Conclusion
Mastering these Unix/Linux commands enables Data Engineers to debug production issues, monitor ETL pipelines, validate data, and automate workflows efficiently. Strong command-line knowledge significantly improves troubleshooting speed and reliability in real-world data engineering environments.
Comments
Post a Comment