what are the qualities and challenges of real estate data

Available real estate data online is common with quality problems and lack of irregularity. In this blog, we will show some problems with the examples as well as describe some challenges related to real estate data management – gathering, cleaning, as well as regulating real estate data. We would focus on the market of the United States mainly, whereas some challenges might apply to some other countries also.

Data Sources

Many data sources are available in the US for real estate data. You get well-known online resources like Redfin.com, Zillow.com, Realtor.com, and Trulia.com that started to list our “Properties for Sale” as well as a few also get “Properties for Rent”.

We have also specialized in “Properties for Rent” resources like Hotpads.com, Apartmentlist.com, Zumper.com, ForRent.com, Apartments.com, etc. Whereas most websites are nationwide, there are also a lot of niche local resources.

Also, there is an old and important data source known as MLS. The “resource” is not the one resource or company or website but it is a collection of different regional companies, websites, and data sources. In addition, some data integrators and aggregators are there that try to merge as well as offer data from different MLSs.

The majority of MLSs have their schemas, databases, and taxonomies, as well as there, are no universal definitions about what a half vs. full bath means or what is known as the usable square property footage! There are no actual agreements on the property addresses. In older areas, there are no agreements on even the fundamental facts like total bedrooms in the property or square footage as well as their resources.

Let’s discuss important data points as well as explore the present state of data as well as the problems we normally see.

Location or Address
location or address
Introduction

Amongst the main data points for any property is a location or an address. Property is a noticeable asset and needs to get a fixed address. Generally, the address is a “unique key” that most people wish to utilize to match the data between different datasets. Properties just cannot exist within the space therefore, they require a physical address.

Addresses in the US are expected to be properly standard. Many variations are there in the non-mainland states as well as territories. Generally, they have the Address Line 1 and 2, City, Zip, or State, in detail and that’s what USPS thinks about being a standard.

Complete Address Elements
  • Addressee’s name or other firm’s or identifier’s name where relevant.
  • Private mailbox number or designator (#300 or PMB 300).
  • Suburbanization name (Puerto Rico merely, ZIP Code starts with 006 to 009, in case, the area is extremely designated).
  • Street name and number (including suffix, pre-directional, as well as post-directional as given in the USPS ZIP+4 Products for a delivery address or countryside route as well as box number (RR 6 BOX 9), Post Office Box number (PO BOX 485), Highway Contract route as well as Box Number (HC 5 BOX 40), or as given in the USPS ZIP+4 Products for a delivery address). (“PO Box” is incorrectly used if earlier a private box number)
  • Secondary address unit number and designator (like a suite number or an apartment (STE 100, APT 202)).
  • City & State (or any authorized two-letter state acronym).
  • Precise ZIP+4 code or a 5-digit ZIP Code. In case, the firm’s name is allocated a distinctive ZIP+4 code in a USPS ZIP+4 Product, a unique ZIP+4 code has to be utilized in a delivery address.
  • An immediate and logical question, which comes to mind is
  • In case, an address is legitimately standard then what is the difficulty?

Let’s go through some examples of problems in today’s real world

Data Entry Mistakes

Regular spelling mistakes on any line

Incorrect Street Names – 15 Mapel St vs. 15 Maple St

Incorrect numbers 1234 Main St (that doesn’t even available) in place of 123 Main St

Incorrect City Names Bolton vs. Boston (however, both are there in MA)

Incorrect Zip Code 02134 vs. 01234

Any of the mistakes are hard to find or utilize to compare different data sets

E.g. is 13 Mapel St a similar property to 13 Maple St or are they two diverse properties?

Is data for Bolton as well as not Boston or is this any other way available?

Deliberately Mistaken Data

One general reason for fake or wrong data within this industry is as the real estate agents are trying to adjoin the rules at different MLSs. These MLSs have different rules for members as well as some of the rules are automatically applied in the technology systems. Among the easy rules of keeping the data clean is through preventing any “duplicate listings”.

MLS systems are having a fast and fundamental check for an address of every listed property, which does not permit agents to list similar exact properties more than twice. For different reasons, agents provide duplicate listings as well as create new as well as fake listings.

Let’s go through a general scenario:

In the competitive property market of Boston, one unit is having 2 official bedrooms as well as because of the scarcity of housing the former study is “converted” into a bedroom that might be acceptable for some renters however, other renters might get turned off by small bedrooms or increased rents (as more bedrooms need extra rent).

Some mediators made two MLS listings in those cases. Let’s assume there is merely one unit for rent as well as the address of a property as well as the unit is located at 12 Main St Unit 1 in Boston, MA

A mediator might create two marginally different MLS listings like:

12 Main St Unit 1 in Boston MA – 3 Bedrooms & 1 bath at $2500

12 Main St Unit 1A in Boston MA – 2 Bedrooms & 1 bath at $2000

This assists the realtor to deal with two renter groups with different budgets – one is looking for 3 bedrooms, having higher budgets (preferable for the owner and realtor) as well as as the backup listing for a group, which needs 2 bedrooms at lower budgets (not desirable to an owner or a realtor BUT a fine backup strategy if 3rd bedroom is not needed by any possible tenants and if the property remains unrented).

Here are some present listings from Zillow that show some data issues. These listings are from a similar realtor as well as we just can’t tell in case, they are for similar properties or not. The given address is not a valid one, the kind is also townhouse vs. apartment. The pictures for the first listing also look off for a neighborhood, the finishing is upscale but its image resolution is awful. It looks like one or both these listings are false or they might also become real and there lies the real data problem.

screen 1

We have observed Zillow unit numbers within Boston to become something like TH3L, 2OJ, etc. that are extremely unlikely in the old city as old and conventional naming agreements.

Here is the example of 4 different variations on a similar property in Boston, which has 3 different units called Unit 1, 2, and 3 as well as not units U1, 2OJ, and X1 or TH3L. The area Brighton is also utilized synonymously with a city Boston that adds to data mismatch.

You can see here that it’s not easy for even any multi-billion dollar company like Zillow to address the data quality with their data let alone trying to clean data from different disparate resources. There are natural property data management problems in the resource data as well as there are people, who try to game a system for personal benefit.

Address Normalization

Among the initial steps while you collect property data from multiple sources is the capability of identifying and comparing two properties.

The data required to get cleaned as well as normalized to a few standard formats for checking whether the listing for rent or sale is the duplicate listing or any unique listing in a website as well as also compare in case, it is a similar listing across different websites.

Amongst the finest approaches to do address normalization as well as property, data management is by using companies, which have dedicated ample time and efforts to make a procedure easier and one of these companies is Google as well as its commonly used Google Maps. A Google Geocoding API is a wonderful place, to begin with.

Their API is very easy (but costly for larger volumes) to utilize and when every address gets normalized, it can credibly be compared. e.g. in the given example, trying to match 13 Mapel St with 13 Maple St isn’t decisive as it might be a similar property or two diverse properties.

Address normalization would clean up data in maximum cases and case it was only a data entry error and there is no address like 13 Mapel St, then we would know it only after the normalization procedure. The Google API might correct this address to 13 Maple St within some cases as well as in a few cases, it might not that results in problems having address corrections (in the case at all probable).

The issues of real estate data management discussed here as well as examples given are only a small division of what we have encountered in actual life, however, after reading this blog, you will be able to better understand the existed challenges and know that there are no magical or easy solutions of this problem.

To scrape real estate data, you can contact X-Byte Enterprise Crawling or ask for a free quote!

Send Message

    Send Message