linux - Using grep to find difference between two big wordlists -

April 15, 2010

I have a 78k lines .txt file with the British words and a 5k lines .txt file with the most common British words . I want to solve the most common words from the big list so that I do not have a new list as a general list.

I solved my problem in some other case, but I really want to know what I'm doing with it because it does not work.

I have tried the following:

  // To ensure that they are cut-cut-d- "F" 78kfile.txt | TAC | TAC & gt; 78kfile.txt Cut-D "" -f1 5kfile.txt | TAC | TAC & gt; 5kfile.txt grep -xivf 5kfile.txt 78kfile.txt & gt; Pure / But this process apparently gives me two empty files. I

If I already run the grapes without cuts, then I get the words I know That's both files.

I have also tried:

  sort 78kfile.txt & gt; 78kfile-sorted.txt sort 5kfile.txt & gt; 5kfile-sorted.txt comm-3 78kfile-sorted.txt 5kfile-sorted.txt // No luck either

If any person wants to try for themselves, then two lessons Files: brit-az-sorted.txt

5k is the most common-sorted.txt in the end of the Unix line and (b) you are trying to compare the whole line ( grep - X ). So, first we need to end the same line:

  dos2unix & lt; Brit-az-sorted.txt & gt; Brit-az-sorted-fixed.txt

Now, we can use grep to remove the common word:

  grep -xivFf 5k-most-common-sorted.txt brit-az-sorted-fixed.txt & gt; Less- common.txt

I have also added the -F flag to make sure that words will be interpreted as a fixed string of regular expression In comparison, it gives speed to things too.

I think that is not in brit-az which has many words in the 5k-most-common-sorted.txt file -sorted.txt . For example, "British" is in a common file but not a large file, besides "aluminum" in common file, while large files contain only "aluminum".

What does the grep option mean? Which is curious:

-f means reading the pattern from a file.

-F means to treat them as a fixed pattern, not a regular expression,

-i means ignore Means to do

-x means full line matches

-v This means that inverted reversal. In other words, print the rows that do not match any patterns.

Search This Blog

LAva

linux - Using grep to find difference between two big wordlists -

Comments

Post a Comment

Popular posts from this blog

eclipse plugin - Run java code error: Workspace is closed -

ios - How do I use CFArrayRef in Swift? -

file rename - Git : Not under version control fatal error -