Sourcing data fromat from multiple different structures [on hold]
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
New contributor
put on hold as off-topic by Jamal♦ 1 min ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
New contributor
put on hold as off-topic by Jamal♦ 1 min ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
New contributor
Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': 'johndoe@email.com',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': 'johndoe@email.com',
'age': 50,
'connected': False
}
There will ba additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item
or ItemLoader
, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
My ideas so far
One potential way to solve this by taking advantage of polymorphism, where I create a Person
class
class Person:
def __init__(self, name, email, age, connected):
self.name = name
self.email = email
self.age = age
self.connected = connected
and subclass it to all the "data mappers" of different data structures, e.g.
class FormatA(Person):
def __init__(self, dict_a):
self.name = ' '.join([dict_a.get('name').get(key) for key in ['first_name', 'last_name']])
self.email = dict_a.get('workEmail')
self.age = dict_a.get('age')
self.connected = dict_a.get('connected')
class FormatB(Person):
def __init__(self, dict_b):
self.name = dict_b.get('fullName')
self.email = dict_b.get('workEmail')
self.age = dict_b.get('age')
self.connected = dict_b.get('connected')
Now let's say I want to store these objects with SQLAlchemy
from sqlalchemy import Column, Integer, String, Boolean
class Person(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
So now I can unpack the FormatA.__dict__
or FormatB.__dict__
to instantiate a new SQLAlchemy object like this:
person = Person(**FormatA.__dict__)
person.add()
person.commit()
Question
I have limited experience in python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing millions of Persons from tens of different structures I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Is there a better industry tested solution in python that is in use for this input data mapping?
python design-patterns scrapy
python design-patterns scrapy
New contributor
New contributor
New contributor
asked 10 mins ago
Maivel
1
1
New contributor
New contributor
put on hold as off-topic by Jamal♦ 1 min ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
put on hold as off-topic by Jamal♦ 1 min ago
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "Lacks concrete context: Code Review requires concrete code from a project, with sufficient context for reviewers to understand how that code is used. Pseudocode, stub code, hypothetical code, obfuscated code, and generic best practices are outside the scope of this site." – Jamal
If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
add a comment |
0
active
oldest
votes
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes