In this article we will discuss about split command and csplit command used in Linux/Unix variants. These are two different tools available in Linux and Unix variants which can be used to split and join files based on various scenarios. We will cover these scenarios with examples to split files into multiple small sections using split command and csplit command.
1. csplit based on regex match
You can csplit files based on regex match. Below I have a sample file:
# cat my_file <report> <bundle> <name>Value 2018</name> <version>2018.03</version> </bundle> </report> <report> <bundle> <name>Value 2019</name> <version>2019.03</version> </bundle> </report> <report> <bundle> <name>Value 2020</name> <version>2020.03</version> </bundle> </report>
In this csplit command example, I want to split file based on string starting with "<report" and ending with ">" so every block between <report> and </report> comes in a separate file.
# csplit my_file '/^<report>$/' '{*}'
109
109
111
Here,
my_file => input file /^<report>$/ => pattern match every `<report>` line {*} => repeat the previous pattern as many times as possible
As you see csplit command has performed csplit based on regex match into three sections:
# ls -l total 16 -rw-r--r--. 1 root root 329 Feb 7 12:34 my_file -rw-r--r--. 1 root root 109 Feb 7 12:38 xx00 -rw-r--r--. 1 root root 109 Feb 7 12:38 xx01 -rw-r--r--. 1 root root 111 Feb 7 12:38 xx02
To verify you can check the content of any one of the split files:
# cat xx01 <report> <bundle> <name>Value 2019</name> <version>2019.03</version> </bundle> </report>
Let's take another csplit
command example to csplit based on regex match. Below I have another sample file with some functions:
# cat my_file function main1 { echo "function main1" } function main2 { echo "function main2" } function main3 { echo "function main3" }
I want to break every function into separate file. But as you see I have an empty line after every function so we need to add some offset value: I will use "}
" as the pattern
# csplit --elide-empty-files my_file '/^}/+2' "{*}"
43
43
43
Here,
/^}/+2 => Check for regex matching and starting with "}". Then add +2 offset to also cut the empty line {*} => repeat the previous pattern as many times as possible
Verify the content of these files:
# cat xx00 function main1 { echo "function main1" }
2. csplit based on pattern match
Similar to regex match, you can also csplit based on pattern match. Below I have a sample file with some content:
# cat my_file 12345 asdfg vbn 000 4634 fghvva 000 ceqdcad 433214 000
In this csplit command example, I want to split based on pattern match with "000
" being the pattern. So to csplit based on pattern and to include the line matching pattern add a +1
offset:
# csplit --elide-empty-files my_file '/000/+1' {*}
20
16
19
Here we created three small files after split, check the content of one of the files:
# cat xx01 4634 fghvva 000
Here,
{*} => Repeat argument until input is exhausted /000/+1 => Match 000 pattern and add +1 offset so 000 is added to the next split file
3. Suppress matched content with csplit in Linux or Unix
With csplit command you can also suppress matched content (pattern/string/regex). In the above csplit command example we will suppress matched pattern i.e. "000
"
# csplit --suppress-matched --elide-empty-files my_file '/000/' {*}
16
12
15
Check the content of any file:
# cat xx00 12345 asdfg vbn
4. Remove empty files with csplit and split command
When you split files with csplit or split command, there are chances you may also get empty files after split. To avoid this you can use --elide-empty-files
as shown in the below csplit command example:
# csplit my_file '/000/+1' {*}
20
16
19
0
here as you see the fourth file is of 0 bytes (empty). So re-run the command using --elide-empty-files
# csplit --elide-empty-files my_file '/000/+1' {*}
20
16
19
5. Add prefix with csplit command
In the below example we use --prefix
to add a prefix "subfile_
" for all the files which are created after split:
# csplit --elide-empty-files --prefix subfile_ my_file '/000/' {*}
16
16
19
4
Check the filenames which are created after split has prefix "subfile_
"
# ls -l total 20 -rw-r--r--. 1 root root 55 Feb 7 20:08 my_file -rw-r--r--. 1 root root 16 Feb 7 21:58 subfile_00 -rw-r--r--. 1 root root 16 Feb 7 21:58 subfile_01 -rw-r--r--. 1 root root 19 Feb 7 21:58 subfile_02 -rw-r--r--. 1 root root 4 Feb 7 21:58 subfile_03
6. csplit content between multiple patterns
In earlier csplit command example we csplit based on pattern match but we can also csplit content between multiple patterns. For example we have a file with different titles and we wish to split the content between each title into separate part.
This is my sample file
# cat my_file title1 content1 title2 content2 title3 content3
We provide all the patterns of the title for which we wish to grep the content:
# csplit --elide-empty-files --prefix parts my_file '/title1/' '/title2/' '/title3/'
16
16
16
Check the content of the files created after split:
# cat parts01 title2 content2 # cat parts00 title1 content1
This command reads from file and creates these sub files of different lengths:
parts00 => Text in my_file before “title2” parts01 => Text starting at “title2” and ending just before “title3” parts02 => Text starting at “title3” to the end of the file {*} => repeat the previous pattern as many times as possible
7. csplit add suffix
By default csplit creates files after split with a custom syntax . But you can add your custom prefix but then you can also add custom suffix.
In the below csplit command example we add a suffix ".sh
" for all the files created after split:
# csplit --elide-empty-files --prefix=subfile --suffix-format="%d.sh" my_file '/000/' "{*}"
16
16
19
4
Here,
--prefix => Add prefix before the start of all the files created after split --suffix => Add sufix at the end of all the files created after split
Verify your files after split:
# ls -l total 20 -rw-r--r--. 1 root root 55 Feb 7 14:49 my_file -rw-r--r--. 1 root root 16 Feb 7 14:49 subfile0.sh -rw-r--r--. 1 root root 16 Feb 7 14:49 subfile1.sh -rw-r--r--. 1 root root 19 Feb 7 14:49 subfile2.sh -rw-r--r--. 1 root root 4 Feb 7 14:49 subfile3.sh
More example to add more digits in the suffix:
# csplit --elide-empty-files --prefix=subfile --suffix-format="%02d.sh" my_file '/000/' "{*}"
16
16
19
4
Now our suffix contains additional digit as you can check below:
# ls -l total 20 -rw-r--r--. 1 root root 55 Feb 7 14:49 my_file -rw-r--r--. 1 root root 16 Feb 7 14:51 subfile00.sh -rw-r--r--. 1 root root 16 Feb 7 14:51 subfile01.sh -rw-r--r--. 1 root root 19 Feb 7 14:51 subfile02.sh -rw-r--r--. 1 root root 4 Feb 7 14:51 subfile03.sh
8. csplit files into specific count based on pattern match
With split command you can split files into specific count but we will get to that later. Here with csplit command we will split based on pattern match but we also define the number of times the pattern must be checked in the file for splitting.
Below is my sample file:
# cat my_file #1 before first pattern #2 before first pattern pattern 000 #1 before second pattern #2 before second pattern pattern 000 #1 before third pattern #2 before third pattern pattern 000
I want to split using the pattern "000
" but I only wish to search this pattern once i.e. create two split files
# csplit --elide-empty-files --digits 1 my_file // '/000/+1' {0}
60
122
We have used {0}
which means don't repeat the pattern match so we only search the pattern once and create the split file. Similarly to search the pattern twice and create three split files use below command:
# csplit --elide-empty-files --digits 1 my_file // '/000/+1' {1}
60
62
60
9. split files based on lines number
The split utility breaks its input into 1,000-line sections named xaa, xab, xac, and so on and split files based on lines number. The last section might be shorter. Options can change the sizes of the sections and lengths of the names.
Below is my sample file which has 6 lines:
# cat my_file #1 before first pattern #2 before first pattern #3 before second pattern #4 before second pattern #5 before third pattern #6 before third pattern
I wish to split files based on lines number and in this file after every 5th line so all our files after split will contain 5 lines
# split --lines 5 my_file
Verify the files after split
# ls -l total 12 -rw-r--r--. 1 root root 146 Feb 7 19:03 file -rw-r--r--. 1 root root 122 Feb 7 19:04 xaa -rw-r--r--. 1 root root 24 Feb 7 19:04 xab
As expected I have 5 lines in our first file:
# cat xaa #1 before first pattern #2 before first pattern #3 before second pattern #4 before second pattern #5 before third pattern
10. split file based on size
Next we will split file based on size. You can use split command to split based on different file size
Use,
K => KiloBytes M => megaBytes G => GigaBytes
Here in this split command example I will split file based on size for every 1 MB size
# split --bytes 1M my_file
Verify the files:
# ls -l total 6040 -rw-r--r--. 1 root root 3092035 Feb 7 19:24 my_file -rw-r--r--. 1 root root 1048576 Feb 7 19:24 xaa -rw-r--r--. 1 root root 1048576 Feb 7 19:24 xab -rw-r--r--. 1 root root 994883 Feb 7 19:24 xac
11. Add suffix or extension using split command
We showed csplit command examples earlier to add suffix or extension, now the same can also be done using split command using --additional-suffix
as shown below:
# split --additional-suffix ".ext" -b 1M file
Next verify the files with suffix extension
# ls -l total 6080 -rw-r--r--. 1 root root 3109034 Feb 7 19:48 file -rw-r--r--. 1 root root 1048576 Feb 7 19:55 xaa.ext -rw-r--r--. 1 root root 1048576 Feb 7 19:55 xab.ext -rw-r--r--. 1 root root 1011882 Feb 7 19:55 xac.ext
12. Add numerical suffix followed by additional suffix
Now along with extension with additional suffix we can also add numerical suffix using --numeric-suffixes
# split --numeric-suffixes --additional-suffix ".ext" -b 1M file
Next verify the additional suffix
# ls -l total 9120 -rw-r--r--. 1 root root 3109034 Feb 7 19:48 my_file -rw-r--r--. 1 root root 1048576 Feb 7 20:00 x00.ext -rw-r--r--. 1 root root 1048576 Feb 7 20:00 x01.ext -rw-r--r--. 1 root root 1011882 Feb 7 20:00 x02.ext
13. Add prefix using split command
In earlier csplit command examples I shared syntax to add prefix with csplit and with we do the same thing with split command:
In the example we add a prefix "_prefix
" in the beginning of every file after split
# split my_file prefix_
Verify the file names:
# ls -l total 6096 -rw-r--r--. 1 root root 3109034 Feb 7 19:48 my_file -rw-r--r--. 1 root root 220073 Feb 7 20:05 prefix_aa -rw-r--r--. 1 root root 221369 Feb 7 20:05 prefix_ab -rw-r--r--. 1 root root 200474 Feb 7 20:05 prefix_ac
14. Split based based on specific count
Now we used csplit to split files after pattern match into a specific count but with split we do not match a pattern but we can define the count of files to be created after split.
Here we want to split the my_file
into 3 parts
# split --number 3 my_file
Verify the same:
# ls -l total 6088 -rw-r--r--. 1 root root 3109034 Feb 7 19:48 my_file -rw-r--r--. 1 root root 1036344 Feb 7 20:06 xaa -rw-r--r--. 1 root root 1036344 Feb 7 20:06 xab -rw-r--r--. 1 root root 1036346 Feb 7 20:06 xac
15. Join files which you had split earlier
You can again join or combine all the files which you had split to combine into one using below syntax
# cat split-files-* > your_new_filename
We had split some files earlier so I will use this method to again combine all the split files to create a new one
# cat xa* > my_new_file
Now you can verify the content of your new file:
# cat my_new_file #1 before first pattern #2 before first pattern #3 before second pattern #4 before second pattern #5 before third pattern #6 before third pattern
Lastly I hope this article on split command and csplit command examples for different scenarios on Linux or Unix was helpful. So, let me know your suggestions and feedback using the comment section.
References:
man page of csplit
man page of split