How to remove duplicate lines
I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
outfile = open(output_file, "w")
for line in open(input_file, "r"):
if line not in seen_lines:
outfile.write(line)
seen_lines.add(line)
outfile.close()
input.txt
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal
Expected output
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
python text-files
New contributor
add a comment |
I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
outfile = open(output_file, "w")
for line in open(input_file, "r"):
if line not in seen_lines:
outfile.write(line)
seen_lines.add(line)
outfile.close()
input.txt
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal
Expected output
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
python text-files
New contributor
2
You open the file twice, sinceinput_file
andoutput_file
are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.
– busybear
3 hours ago
@busybear Yes. Open your file asr+
to read and write to the file at the same time (they will both work).
– Ethan K
3 hours ago
Possible duplicate of How might I remove duplicate lines from a file?
– glennv
3 hours ago
add a comment |
I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
outfile = open(output_file, "w")
for line in open(input_file, "r"):
if line not in seen_lines:
outfile.write(line)
seen_lines.add(line)
outfile.close()
input.txt
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal
Expected output
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
python text-files
New contributor
I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
outfile = open(output_file, "w")
for line in open(input_file, "r"):
if line not in seen_lines:
outfile.write(line)
seen_lines.add(line)
outfile.close()
input.txt
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal
Expected output
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
python text-files
python text-files
New contributor
New contributor
edited 3 hours ago
Mad Physicist
33.5k156894
33.5k156894
New contributor
asked 3 hours ago
Mark
534
534
New contributor
New contributor
2
You open the file twice, sinceinput_file
andoutput_file
are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.
– busybear
3 hours ago
@busybear Yes. Open your file asr+
to read and write to the file at the same time (they will both work).
– Ethan K
3 hours ago
Possible duplicate of How might I remove duplicate lines from a file?
– glennv
3 hours ago
add a comment |
2
You open the file twice, sinceinput_file
andoutput_file
are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.
– busybear
3 hours ago
@busybear Yes. Open your file asr+
to read and write to the file at the same time (they will both work).
– Ethan K
3 hours ago
Possible duplicate of How might I remove duplicate lines from a file?
– glennv
3 hours ago
2
2
You open the file twice, since
input_file
and output_file
are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.– busybear
3 hours ago
You open the file twice, since
input_file
and output_file
are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.– busybear
3 hours ago
@busybear Yes. Open your file as
r+
to read and write to the file at the same time (they will both work).– Ethan K
3 hours ago
@busybear Yes. Open your file as
r+
to read and write to the file at the same time (they will both work).– Ethan K
3 hours ago
Possible duplicate of How might I remove duplicate lines from a file?
– glennv
3 hours ago
Possible duplicate of How might I remove duplicate lines from a file?
– glennv
3 hours ago
add a comment |
6 Answers
6
active
oldest
votes
The line outfile = open(output_file, "w")
truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:
- Open a temp file for writing
- Process the input to the new output
- Close both files
- Move the temp file to the input file name
This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.
Here is a sample using tempfile.NamedTemporaryFile
, and a with
block to make sure everything is closed properly, even in case of error:
from tempfile import NamedTemporaryFile
from shutil import move
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
for line in open(input_file, "r"):
sline = line.rstrip('n')
if sline not in seen_lines:
output.write(line)
seen_lines.add(sline)
move(output.name, output_file)
The move
at the end will work correctly even if the input and output names are the same, since output.name
is guaranteed to be something different from both.
Note also that I'm stripping the newline from each line in the set, since the last line might not have one.
Alt Solution
If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:
input_file = "input.txt"
output_file = "input.txt"
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input)
with open(output_file, 'w') as output:
for line in unique:
output.write(line)
output.write('n')
You can compare this against
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input.readlines())
with open(output_file, 'w') as output:
output.write('n'.join(unique))
The second version does exactly the same thing, but loads and writes all at once.
I get an error ofoutfile
is not defined
– Mark
3 hours ago
just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
– Mark
2 hours ago
@Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
– Mad Physicist
2 hours ago
@Mark. Fixed the error. It was just a typo
– Mad Physicist
2 hours ago
@Mark. I've proposed an alternative
– Mad Physicist
2 hours ago
|
show 2 more comments
The problem is that you're trying to write to the same file that you're reading from. You have at least two options:
Option 1
Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.
Option 2
Read all data in from your input file, close that file, then open the file for writing.
with open('input.txt', 'r') as f:
lines = f.readlines()
seen_lines = set()
with open('input.txt', 'w') as f:
for line in lines:
if line not in seen_lines:
seen_lines.add(line)
f.write(line)
Option 3
Open the file for both reading and writing using r+
mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.
1
Or user+
for reading and writing.
– Ethan K
3 hours ago
add a comment |
import os
seen_lines =
with open('input.txt','r') as infile:
lines=infile.readlines()
for line in lines:
line_stripped=line.strip()
if line_stripped not in seen_lines:
seen_lines.append(line_stripped)
with open('input.txt','w') as outfile:
for line in seen_lines:
outfile.write(line)
if line != seen_lines[-1]:
outfile.write(os.linesep)
Output:
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
New contributor
This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search throughseen_lines
.
– Flight Odyssey
3 hours ago
When I use this code, I seeKeep the change ya filthy animal
twice in the output?
– Mark
3 hours ago
@Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
– bitto
3 hours ago
Wait, I think its because the last line has theEOF
at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of theEOF
. Any way around this? I am on windows by the way
– Mark
3 hours ago
@Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
– bitto
3 hours ago
|
show 5 more comments
I believe this is the easiest way to do what you want:
with open('FileName.txt', 'r+') as i:
AllLines = i.readlines()
for line in AllLines:
#write to file
New contributor
At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
– Mad Physicist
2 hours ago
add a comment |
Try the below code, using list comprehension with str.join
and set
and sorted
:
input_file = "input.txt"
output_file = "input.txt"
seen_lines =
outfile = open(output_file, "w")
infile = open(input_file, "r")
l = [i.rstrip() for i in infile.readlines()]
outfile.write('n'.join(sorted(set(l,key=l.index))))
outfile.close()
add a comment |
Just my two cents, in case you happen to be able to use Python3. It uses:
- A reusable
Path
object which has a handywrite_text()
method. - An
OrderedDict
as data structure to satisfy the constraints of uniqueness and order at once. - A generator expression instead of
Path.read_text()
to save on memory.
# in-place removal of duplicate lines, while remaining order
import os
from collections import OrderedDict
from pathlib import Path
filepath = Path("./duplicates.txt")
with filepath.open() as _file:
no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)
filepath.write_text("n".join(no_duplicates))
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Mark is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53974070%2fhow-to-remove-duplicate-lines%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
The line outfile = open(output_file, "w")
truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:
- Open a temp file for writing
- Process the input to the new output
- Close both files
- Move the temp file to the input file name
This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.
Here is a sample using tempfile.NamedTemporaryFile
, and a with
block to make sure everything is closed properly, even in case of error:
from tempfile import NamedTemporaryFile
from shutil import move
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
for line in open(input_file, "r"):
sline = line.rstrip('n')
if sline not in seen_lines:
output.write(line)
seen_lines.add(sline)
move(output.name, output_file)
The move
at the end will work correctly even if the input and output names are the same, since output.name
is guaranteed to be something different from both.
Note also that I'm stripping the newline from each line in the set, since the last line might not have one.
Alt Solution
If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:
input_file = "input.txt"
output_file = "input.txt"
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input)
with open(output_file, 'w') as output:
for line in unique:
output.write(line)
output.write('n')
You can compare this against
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input.readlines())
with open(output_file, 'w') as output:
output.write('n'.join(unique))
The second version does exactly the same thing, but loads and writes all at once.
I get an error ofoutfile
is not defined
– Mark
3 hours ago
just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
– Mark
2 hours ago
@Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
– Mad Physicist
2 hours ago
@Mark. Fixed the error. It was just a typo
– Mad Physicist
2 hours ago
@Mark. I've proposed an alternative
– Mad Physicist
2 hours ago
|
show 2 more comments
The line outfile = open(output_file, "w")
truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:
- Open a temp file for writing
- Process the input to the new output
- Close both files
- Move the temp file to the input file name
This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.
Here is a sample using tempfile.NamedTemporaryFile
, and a with
block to make sure everything is closed properly, even in case of error:
from tempfile import NamedTemporaryFile
from shutil import move
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
for line in open(input_file, "r"):
sline = line.rstrip('n')
if sline not in seen_lines:
output.write(line)
seen_lines.add(sline)
move(output.name, output_file)
The move
at the end will work correctly even if the input and output names are the same, since output.name
is guaranteed to be something different from both.
Note also that I'm stripping the newline from each line in the set, since the last line might not have one.
Alt Solution
If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:
input_file = "input.txt"
output_file = "input.txt"
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input)
with open(output_file, 'w') as output:
for line in unique:
output.write(line)
output.write('n')
You can compare this against
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input.readlines())
with open(output_file, 'w') as output:
output.write('n'.join(unique))
The second version does exactly the same thing, but loads and writes all at once.
I get an error ofoutfile
is not defined
– Mark
3 hours ago
just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
– Mark
2 hours ago
@Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
– Mad Physicist
2 hours ago
@Mark. Fixed the error. It was just a typo
– Mad Physicist
2 hours ago
@Mark. I've proposed an alternative
– Mad Physicist
2 hours ago
|
show 2 more comments
The line outfile = open(output_file, "w")
truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:
- Open a temp file for writing
- Process the input to the new output
- Close both files
- Move the temp file to the input file name
This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.
Here is a sample using tempfile.NamedTemporaryFile
, and a with
block to make sure everything is closed properly, even in case of error:
from tempfile import NamedTemporaryFile
from shutil import move
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
for line in open(input_file, "r"):
sline = line.rstrip('n')
if sline not in seen_lines:
output.write(line)
seen_lines.add(sline)
move(output.name, output_file)
The move
at the end will work correctly even if the input and output names are the same, since output.name
is guaranteed to be something different from both.
Note also that I'm stripping the newline from each line in the set, since the last line might not have one.
Alt Solution
If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:
input_file = "input.txt"
output_file = "input.txt"
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input)
with open(output_file, 'w') as output:
for line in unique:
output.write(line)
output.write('n')
You can compare this against
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input.readlines())
with open(output_file, 'w') as output:
output.write('n'.join(unique))
The second version does exactly the same thing, but loads and writes all at once.
The line outfile = open(output_file, "w")
truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:
- Open a temp file for writing
- Process the input to the new output
- Close both files
- Move the temp file to the input file name
This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.
Here is a sample using tempfile.NamedTemporaryFile
, and a with
block to make sure everything is closed properly, even in case of error:
from tempfile import NamedTemporaryFile
from shutil import move
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
for line in open(input_file, "r"):
sline = line.rstrip('n')
if sline not in seen_lines:
output.write(line)
seen_lines.add(sline)
move(output.name, output_file)
The move
at the end will work correctly even if the input and output names are the same, since output.name
is guaranteed to be something different from both.
Note also that I'm stripping the newline from each line in the set, since the last line might not have one.
Alt Solution
If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:
input_file = "input.txt"
output_file = "input.txt"
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input)
with open(output_file, 'w') as output:
for line in unique:
output.write(line)
output.write('n')
You can compare this against
with open(input_file) as input:
unique = set(line.rstrip('n') for line in input.readlines())
with open(output_file, 'w') as output:
output.write('n'.join(unique))
The second version does exactly the same thing, but loads and writes all at once.
edited 2 hours ago
answered 3 hours ago
Mad Physicist
33.5k156894
33.5k156894
I get an error ofoutfile
is not defined
– Mark
3 hours ago
just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
– Mark
2 hours ago
@Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
– Mad Physicist
2 hours ago
@Mark. Fixed the error. It was just a typo
– Mad Physicist
2 hours ago
@Mark. I've proposed an alternative
– Mad Physicist
2 hours ago
|
show 2 more comments
I get an error ofoutfile
is not defined
– Mark
3 hours ago
just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
– Mark
2 hours ago
@Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
– Mad Physicist
2 hours ago
@Mark. Fixed the error. It was just a typo
– Mad Physicist
2 hours ago
@Mark. I've proposed an alternative
– Mad Physicist
2 hours ago
I get an error of
outfile
is not defined– Mark
3 hours ago
I get an error of
outfile
is not defined– Mark
3 hours ago
just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
– Mark
2 hours ago
just a question, this way of removing duplicates is very slow if there is over 100,000 lines. Is there a better way? Also still getting the same error.
– Mark
2 hours ago
@Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
– Mad Physicist
2 hours ago
@Mark. With that size, your I/O is the bottleneck. I doubt you can do much to speed it up.
– Mad Physicist
2 hours ago
@Mark. Fixed the error. It was just a typo
– Mad Physicist
2 hours ago
@Mark. Fixed the error. It was just a typo
– Mad Physicist
2 hours ago
@Mark. I've proposed an alternative
– Mad Physicist
2 hours ago
@Mark. I've proposed an alternative
– Mad Physicist
2 hours ago
|
show 2 more comments
The problem is that you're trying to write to the same file that you're reading from. You have at least two options:
Option 1
Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.
Option 2
Read all data in from your input file, close that file, then open the file for writing.
with open('input.txt', 'r') as f:
lines = f.readlines()
seen_lines = set()
with open('input.txt', 'w') as f:
for line in lines:
if line not in seen_lines:
seen_lines.add(line)
f.write(line)
Option 3
Open the file for both reading and writing using r+
mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.
1
Or user+
for reading and writing.
– Ethan K
3 hours ago
add a comment |
The problem is that you're trying to write to the same file that you're reading from. You have at least two options:
Option 1
Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.
Option 2
Read all data in from your input file, close that file, then open the file for writing.
with open('input.txt', 'r') as f:
lines = f.readlines()
seen_lines = set()
with open('input.txt', 'w') as f:
for line in lines:
if line not in seen_lines:
seen_lines.add(line)
f.write(line)
Option 3
Open the file for both reading and writing using r+
mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.
1
Or user+
for reading and writing.
– Ethan K
3 hours ago
add a comment |
The problem is that you're trying to write to the same file that you're reading from. You have at least two options:
Option 1
Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.
Option 2
Read all data in from your input file, close that file, then open the file for writing.
with open('input.txt', 'r') as f:
lines = f.readlines()
seen_lines = set()
with open('input.txt', 'w') as f:
for line in lines:
if line not in seen_lines:
seen_lines.add(line)
f.write(line)
Option 3
Open the file for both reading and writing using r+
mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.
The problem is that you're trying to write to the same file that you're reading from. You have at least two options:
Option 1
Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.
Option 2
Read all data in from your input file, close that file, then open the file for writing.
with open('input.txt', 'r') as f:
lines = f.readlines()
seen_lines = set()
with open('input.txt', 'w') as f:
for line in lines:
if line not in seen_lines:
seen_lines.add(line)
f.write(line)
Option 3
Open the file for both reading and writing using r+
mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.
edited 3 hours ago
answered 3 hours ago
Jonah Bishop
8,45232957
8,45232957
1
Or user+
for reading and writing.
– Ethan K
3 hours ago
add a comment |
1
Or user+
for reading and writing.
– Ethan K
3 hours ago
1
1
Or use
r+
for reading and writing.– Ethan K
3 hours ago
Or use
r+
for reading and writing.– Ethan K
3 hours ago
add a comment |
import os
seen_lines =
with open('input.txt','r') as infile:
lines=infile.readlines()
for line in lines:
line_stripped=line.strip()
if line_stripped not in seen_lines:
seen_lines.append(line_stripped)
with open('input.txt','w') as outfile:
for line in seen_lines:
outfile.write(line)
if line != seen_lines[-1]:
outfile.write(os.linesep)
Output:
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
New contributor
This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search throughseen_lines
.
– Flight Odyssey
3 hours ago
When I use this code, I seeKeep the change ya filthy animal
twice in the output?
– Mark
3 hours ago
@Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
– bitto
3 hours ago
Wait, I think its because the last line has theEOF
at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of theEOF
. Any way around this? I am on windows by the way
– Mark
3 hours ago
@Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
– bitto
3 hours ago
|
show 5 more comments
import os
seen_lines =
with open('input.txt','r') as infile:
lines=infile.readlines()
for line in lines:
line_stripped=line.strip()
if line_stripped not in seen_lines:
seen_lines.append(line_stripped)
with open('input.txt','w') as outfile:
for line in seen_lines:
outfile.write(line)
if line != seen_lines[-1]:
outfile.write(os.linesep)
Output:
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
New contributor
This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search throughseen_lines
.
– Flight Odyssey
3 hours ago
When I use this code, I seeKeep the change ya filthy animal
twice in the output?
– Mark
3 hours ago
@Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
– bitto
3 hours ago
Wait, I think its because the last line has theEOF
at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of theEOF
. Any way around this? I am on windows by the way
– Mark
3 hours ago
@Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
– bitto
3 hours ago
|
show 5 more comments
import os
seen_lines =
with open('input.txt','r') as infile:
lines=infile.readlines()
for line in lines:
line_stripped=line.strip()
if line_stripped not in seen_lines:
seen_lines.append(line_stripped)
with open('input.txt','w') as outfile:
for line in seen_lines:
outfile.write(line)
if line != seen_lines[-1]:
outfile.write(os.linesep)
Output:
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
New contributor
import os
seen_lines =
with open('input.txt','r') as infile:
lines=infile.readlines()
for line in lines:
line_stripped=line.strip()
if line_stripped not in seen_lines:
seen_lines.append(line_stripped)
with open('input.txt','w') as outfile:
for line in seen_lines:
outfile.write(line)
if line != seen_lines[-1]:
outfile.write(os.linesep)
Output:
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
New contributor
edited 2 hours ago
New contributor
answered 3 hours ago
bitto
4127
4127
New contributor
New contributor
This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search throughseen_lines
.
– Flight Odyssey
3 hours ago
When I use this code, I seeKeep the change ya filthy animal
twice in the output?
– Mark
3 hours ago
@Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
– bitto
3 hours ago
Wait, I think its because the last line has theEOF
at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of theEOF
. Any way around this? I am on windows by the way
– Mark
3 hours ago
@Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
– bitto
3 hours ago
|
show 5 more comments
This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search throughseen_lines
.
– Flight Odyssey
3 hours ago
When I use this code, I seeKeep the change ya filthy animal
twice in the output?
– Mark
3 hours ago
@Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
– bitto
3 hours ago
Wait, I think its because the last line has theEOF
at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of theEOF
. Any way around this? I am on windows by the way
– Mark
3 hours ago
@Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
– bitto
3 hours ago
This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search through
seen_lines
.– Flight Odyssey
3 hours ago
This fixes the problem and is a good solution for small input files, but note that it will be quite slow (quadratic time) for large files due to the linear search through
seen_lines
.– Flight Odyssey
3 hours ago
When I use this code, I see
Keep the change ya filthy animal
twice in the output?– Mark
3 hours ago
When I use this code, I see
Keep the change ya filthy animal
twice in the output?– Mark
3 hours ago
@Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
– bitto
3 hours ago
@Mark I tested the code and i don't see it. Can you copy the code as it is and try again? may be you made some unintentional mistake while typing it.
– bitto
3 hours ago
Wait, I think its because the last line has the
EOF
at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of the EOF
. Any way around this? I am on windows by the way– Mark
3 hours ago
Wait, I think its because the last line has the
EOF
at the end of the line so it sees it as not a duplicate. I tested it. If the last line is a duplicate line, it always keeps it because of the EOF
. Any way around this? I am on windows by the way– Mark
3 hours ago
@Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
– bitto
3 hours ago
@Mark stackoverflow.com/questions/18857352/… might help. I can't say for sure. i am on Ubuntu.
– bitto
3 hours ago
|
show 5 more comments
I believe this is the easiest way to do what you want:
with open('FileName.txt', 'r+') as i:
AllLines = i.readlines()
for line in AllLines:
#write to file
New contributor
At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
– Mad Physicist
2 hours ago
add a comment |
I believe this is the easiest way to do what you want:
with open('FileName.txt', 'r+') as i:
AllLines = i.readlines()
for line in AllLines:
#write to file
New contributor
At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
– Mad Physicist
2 hours ago
add a comment |
I believe this is the easiest way to do what you want:
with open('FileName.txt', 'r+') as i:
AllLines = i.readlines()
for line in AllLines:
#write to file
New contributor
I believe this is the easiest way to do what you want:
with open('FileName.txt', 'r+') as i:
AllLines = i.readlines()
for line in AllLines:
#write to file
New contributor
edited 3 hours ago
New contributor
answered 3 hours ago
Matt Hawkins
86
86
New contributor
New contributor
At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
– Mad Physicist
2 hours ago
add a comment |
At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
– Mad Physicist
2 hours ago
At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
– Mad Physicist
2 hours ago
At that point it would be much simpler to reopen for writing. If you're removing lines, there will be a tail left in the file.
– Mad Physicist
2 hours ago
add a comment |
Try the below code, using list comprehension with str.join
and set
and sorted
:
input_file = "input.txt"
output_file = "input.txt"
seen_lines =
outfile = open(output_file, "w")
infile = open(input_file, "r")
l = [i.rstrip() for i in infile.readlines()]
outfile.write('n'.join(sorted(set(l,key=l.index))))
outfile.close()
add a comment |
Try the below code, using list comprehension with str.join
and set
and sorted
:
input_file = "input.txt"
output_file = "input.txt"
seen_lines =
outfile = open(output_file, "w")
infile = open(input_file, "r")
l = [i.rstrip() for i in infile.readlines()]
outfile.write('n'.join(sorted(set(l,key=l.index))))
outfile.close()
add a comment |
Try the below code, using list comprehension with str.join
and set
and sorted
:
input_file = "input.txt"
output_file = "input.txt"
seen_lines =
outfile = open(output_file, "w")
infile = open(input_file, "r")
l = [i.rstrip() for i in infile.readlines()]
outfile.write('n'.join(sorted(set(l,key=l.index))))
outfile.close()
Try the below code, using list comprehension with str.join
and set
and sorted
:
input_file = "input.txt"
output_file = "input.txt"
seen_lines =
outfile = open(output_file, "w")
infile = open(input_file, "r")
l = [i.rstrip() for i in infile.readlines()]
outfile.write('n'.join(sorted(set(l,key=l.index))))
outfile.close()
answered 2 hours ago
U9-Forward
12.8k21136
12.8k21136
add a comment |
add a comment |
Just my two cents, in case you happen to be able to use Python3. It uses:
- A reusable
Path
object which has a handywrite_text()
method. - An
OrderedDict
as data structure to satisfy the constraints of uniqueness and order at once. - A generator expression instead of
Path.read_text()
to save on memory.
# in-place removal of duplicate lines, while remaining order
import os
from collections import OrderedDict
from pathlib import Path
filepath = Path("./duplicates.txt")
with filepath.open() as _file:
no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)
filepath.write_text("n".join(no_duplicates))
add a comment |
Just my two cents, in case you happen to be able to use Python3. It uses:
- A reusable
Path
object which has a handywrite_text()
method. - An
OrderedDict
as data structure to satisfy the constraints of uniqueness and order at once. - A generator expression instead of
Path.read_text()
to save on memory.
# in-place removal of duplicate lines, while remaining order
import os
from collections import OrderedDict
from pathlib import Path
filepath = Path("./duplicates.txt")
with filepath.open() as _file:
no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)
filepath.write_text("n".join(no_duplicates))
add a comment |
Just my two cents, in case you happen to be able to use Python3. It uses:
- A reusable
Path
object which has a handywrite_text()
method. - An
OrderedDict
as data structure to satisfy the constraints of uniqueness and order at once. - A generator expression instead of
Path.read_text()
to save on memory.
# in-place removal of duplicate lines, while remaining order
import os
from collections import OrderedDict
from pathlib import Path
filepath = Path("./duplicates.txt")
with filepath.open() as _file:
no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)
filepath.write_text("n".join(no_duplicates))
Just my two cents, in case you happen to be able to use Python3. It uses:
- A reusable
Path
object which has a handywrite_text()
method. - An
OrderedDict
as data structure to satisfy the constraints of uniqueness and order at once. - A generator expression instead of
Path.read_text()
to save on memory.
# in-place removal of duplicate lines, while remaining order
import os
from collections import OrderedDict
from pathlib import Path
filepath = Path("./duplicates.txt")
with filepath.open() as _file:
no_duplicates = OrderedDict.fromkeys(line.rstrip('n') for line in _file)
filepath.write_text("n".join(no_duplicates))
edited 59 mins ago
answered 1 hour ago
timmwagener
7471814
7471814
add a comment |
add a comment |
Mark is a new contributor. Be nice, and check out our Code of Conduct.
Mark is a new contributor. Be nice, and check out our Code of Conduct.
Mark is a new contributor. Be nice, and check out our Code of Conduct.
Mark is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53974070%2fhow-to-remove-duplicate-lines%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
You open the file twice, since
input_file
andoutput_file
are the same. The second time you open as read, which is where I think your problem is. So you won't be able to write.– busybear
3 hours ago
@busybear Yes. Open your file as
r+
to read and write to the file at the same time (they will both work).– Ethan K
3 hours ago
Possible duplicate of How might I remove duplicate lines from a file?
– glennv
3 hours ago