Sourcing data fromat from multiple different structures [on hold]
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
New contributor
Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
put on hold as off-topic by Jamal♦ 1 min ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
New contributor
Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
put on hold as off-topic by Jamal♦ 1 min ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
New contributor
Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__ or FormatB.__dict__ to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
python design-patterns scrapy
New contributor
Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 10 mins ago
Maivel
1
1
New contributor
Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Maivel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
put on hold as off-topic by Jamal♦ 1 min ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
put on hold as off-topic by Jamal♦ 1 min ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
add a comment |
0
active
oldest
votes
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes