detecting unique lines from log file
up vote
0
down vote
favorite
I have a large log file and would like to detect the patterns instead of specific lines.
for example:
/path/messages-20181116:11/15/2018 14:23:05.159|worker001|clusterm|I|userx deleted job 5018
/path/messages-20181116:11/15/2018 14:41:25.662|worker001|clusterm|I|userx deleted job 4895
/path/messages-20181116:11/15/2018 14:41:25.673|worker000|clusterm|I|userx deleted job 4890
/path/messages-20181116:11/15/2018 14:41:25.681|worker000|clusterm|I|userx deleted job 4889
11/09/2018 06:18:55.115|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.118|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.120|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.140|scheduler000|clusterm|P|PROF: job dispatching took 5.005 s (10 fast)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 1 job(s)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 5 job(s)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 3 job(s)
11/09/2018 06:18:55.145|scheduler000|clusterm|P|PROF: parallel matching 14 0438 107668
11/09/2018 06:18:55.148|scheduler000|clusterm|P|PROF: sequential matching 9 0261 8203
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting :wc =0.006s
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc=5.005
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting : wc=0.006s
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc =0.015
becomes something like below:
/path/messages-*NUMBER*:*DATE* *TIME*|worker001|clusterm|I|userx deleted job *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching *NUMBER* *NUMBER* *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching *NUMBER* *NUMBER* *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*
which greatly reduce the number of lines and make analyzing/reading log by eye easier.
basically detecting variable words and replace them with some symbol.
command-line logs wildcards text
New contributor
|
show 1 more comment
up vote
0
down vote
favorite
I have a large log file and would like to detect the patterns instead of specific lines.
for example:
/path/messages-20181116:11/15/2018 14:23:05.159|worker001|clusterm|I|userx deleted job 5018
/path/messages-20181116:11/15/2018 14:41:25.662|worker001|clusterm|I|userx deleted job 4895
/path/messages-20181116:11/15/2018 14:41:25.673|worker000|clusterm|I|userx deleted job 4890
/path/messages-20181116:11/15/2018 14:41:25.681|worker000|clusterm|I|userx deleted job 4889
11/09/2018 06:18:55.115|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.118|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.120|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.140|scheduler000|clusterm|P|PROF: job dispatching took 5.005 s (10 fast)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 1 job(s)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 5 job(s)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 3 job(s)
11/09/2018 06:18:55.145|scheduler000|clusterm|P|PROF: parallel matching 14 0438 107668
11/09/2018 06:18:55.148|scheduler000|clusterm|P|PROF: sequential matching 9 0261 8203
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting :wc =0.006s
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc=5.005
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting : wc=0.006s
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc =0.015
becomes something like below:
/path/messages-*NUMBER*:*DATE* *TIME*|worker001|clusterm|I|userx deleted job *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching *NUMBER* *NUMBER* *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching *NUMBER* *NUMBER* *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*
which greatly reduce the number of lines and make analyzing/reading log by eye easier.
basically detecting variable words and replace them with some symbol.
command-line logs wildcards text
New contributor
1
What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
Nov 20 at 12:07
have you looked atcut
anduniq
?
– ctrl-alt-delor
Nov 20 at 12:14
Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
Nov 20 at 16:25
I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
Nov 20 at 16:31
I don't understand the transformations you're expecting. Do you want the dates and times replaced by*DATE* *TIME*
or are those placeholders for real values of some sort? What makes a line non-unique?
– Jeff Schaller
Nov 20 at 21:04
|
show 1 more comment
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a large log file and would like to detect the patterns instead of specific lines.
for example:
/path/messages-20181116:11/15/2018 14:23:05.159|worker001|clusterm|I|userx deleted job 5018
/path/messages-20181116:11/15/2018 14:41:25.662|worker001|clusterm|I|userx deleted job 4895
/path/messages-20181116:11/15/2018 14:41:25.673|worker000|clusterm|I|userx deleted job 4890
/path/messages-20181116:11/15/2018 14:41:25.681|worker000|clusterm|I|userx deleted job 4889
11/09/2018 06:18:55.115|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.118|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.120|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.140|scheduler000|clusterm|P|PROF: job dispatching took 5.005 s (10 fast)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 1 job(s)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 5 job(s)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 3 job(s)
11/09/2018 06:18:55.145|scheduler000|clusterm|P|PROF: parallel matching 14 0438 107668
11/09/2018 06:18:55.148|scheduler000|clusterm|P|PROF: sequential matching 9 0261 8203
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting :wc =0.006s
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc=5.005
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting : wc=0.006s
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc =0.015
becomes something like below:
/path/messages-*NUMBER*:*DATE* *TIME*|worker001|clusterm|I|userx deleted job *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching *NUMBER* *NUMBER* *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching *NUMBER* *NUMBER* *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*
which greatly reduce the number of lines and make analyzing/reading log by eye easier.
basically detecting variable words and replace them with some symbol.
command-line logs wildcards text
New contributor
I have a large log file and would like to detect the patterns instead of specific lines.
for example:
/path/messages-20181116:11/15/2018 14:23:05.159|worker001|clusterm|I|userx deleted job 5018
/path/messages-20181116:11/15/2018 14:41:25.662|worker001|clusterm|I|userx deleted job 4895
/path/messages-20181116:11/15/2018 14:41:25.673|worker000|clusterm|I|userx deleted job 4890
/path/messages-20181116:11/15/2018 14:41:25.681|worker000|clusterm|I|userx deleted job 4889
11/09/2018 06:18:55.115|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.118|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.120|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1
11/09/2018 06:18:55.140|scheduler000|clusterm|P|PROF: job dispatching took 5.005 s (10 fast)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 1 job(s)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 5 job(s)
11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 3 job(s)
11/09/2018 06:18:55.145|scheduler000|clusterm|P|PROF: parallel matching 14 0438 107668
11/09/2018 06:18:55.148|scheduler000|clusterm|P|PROF: sequential matching 9 0261 8203
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting :wc =0.006s
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc=5.005
11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting : wc=0.006s
11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc =0.015
becomes something like below:
/path/messages-*NUMBER*:*DATE* *TIME*|worker001|clusterm|I|userx deleted job *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching *NUMBER* *NUMBER* *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching *NUMBER* *NUMBER* *NUMBER*
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s
*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*
which greatly reduce the number of lines and make analyzing/reading log by eye easier.
basically detecting variable words and replace them with some symbol.
command-line logs wildcards text
command-line logs wildcards text
New contributor
New contributor
New contributor
asked Nov 20 at 12:06
user772266
1
1
New contributor
New contributor
1
What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
Nov 20 at 12:07
have you looked atcut
anduniq
?
– ctrl-alt-delor
Nov 20 at 12:14
Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
Nov 20 at 16:25
I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
Nov 20 at 16:31
I don't understand the transformations you're expecting. Do you want the dates and times replaced by*DATE* *TIME*
or are those placeholders for real values of some sort? What makes a line non-unique?
– Jeff Schaller
Nov 20 at 21:04
|
show 1 more comment
1
What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
Nov 20 at 12:07
have you looked atcut
anduniq
?
– ctrl-alt-delor
Nov 20 at 12:14
Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
Nov 20 at 16:25
I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
Nov 20 at 16:31
I don't understand the transformations you're expecting. Do you want the dates and times replaced by*DATE* *TIME*
or are those placeholders for real values of some sort? What makes a line non-unique?
– Jeff Schaller
Nov 20 at 21:04
1
1
What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
Nov 20 at 12:07
What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
Nov 20 at 12:07
have you looked at
cut
and uniq
?– ctrl-alt-delor
Nov 20 at 12:14
have you looked at
cut
and uniq
?– ctrl-alt-delor
Nov 20 at 12:14
Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
Nov 20 at 16:25
Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
Nov 20 at 16:25
I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
Nov 20 at 16:31
I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
Nov 20 at 16:31
I don't understand the transformations you're expecting. Do you want the dates and times replaced by
*DATE* *TIME*
or are those placeholders for real values of some sort? What makes a line non-unique?– Jeff Schaller
Nov 20 at 21:04
I don't understand the transformations you're expecting. Do you want the dates and times replaced by
*DATE* *TIME*
or are those placeholders for real values of some sort? What makes a line non-unique?– Jeff Schaller
Nov 20 at 21:04
|
show 1 more comment
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
user772266 is a new contributor. Be nice, and check out our Code of Conduct.
user772266 is a new contributor. Be nice, and check out our Code of Conduct.
user772266 is a new contributor. Be nice, and check out our Code of Conduct.
user772266 is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f482960%2fdetecting-unique-lines-from-log-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
Nov 20 at 12:07
have you looked at
cut
anduniq
?– ctrl-alt-delor
Nov 20 at 12:14
Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
Nov 20 at 16:25
I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
Nov 20 at 16:31
I don't understand the transformations you're expecting. Do you want the dates and times replaced by
*DATE* *TIME*
or are those placeholders for real values of some sort? What makes a line non-unique?– Jeff Schaller
Nov 20 at 21:04