linux - Using grep to find difference between two big wordlists -
I have a 78k lines .txt file with the British words and a 5k lines .txt file with the most common British words . I want to solve the most common words from the big list so that I do not have a new list as a general list.
I solved my problem in some other case, but I really want to know what I'm doing with it because it does not work.
I have tried the following:
// To ensure that they are cut-cut-d- "F" 78kfile.txt | TAC | TAC & gt; 78kfile.txt Cut-D "" -f1 5kfile.txt | TAC | TAC & gt; 5kfile.txt grep -xivf 5kfile.txt 78kfile.txt & gt; Pure / But this process apparently gives me two empty files. I
If I already run the grapes without cuts, then I get the words I know That's both files.
I have also tried:
sort 78kfile.txt & gt; 78kfile-sorted.txt sort 5kfile.txt & gt; 5kfile-sorted.txt comm-3 78kfile-sorted.txt 5kfile-sorted.txt // No luck either
If any person wants to try for themselves, then two lessons Files: brit-az-sorted.txt
grep - X
). So, first we need to end the same line: dos2unix & lt; Brit-az-sorted.txt & gt; Brit-az-sorted-fixed.txt
Now, we can use grep
to remove the common word:
grep -xivFf 5k-most-common-sorted.txt brit-az-sorted-fixed.txt & gt; Less- common.txt
I have also added the -F
flag to make sure that words will be interpreted as a fixed string of regular expression In comparison, it gives speed to things too.
I think that is not in brit-az which has many words in the
. For example, "British" is in a common file but not a large file, besides "aluminum" in common file, while large files contain only "aluminum". 5k-most-common-sorted.txt file
-sorted.txt
What does the grep option mean? Which is curious:
-f
means reading the pattern from a file.
-F
means to treat them as a fixed pattern, not a regular expression,
-i
means ignore Means to do
-x
means full line matches
-v
This means that inverted reversal. In other words, print the rows that do not match any patterns.
Comments
Post a Comment