In: Economics
1-Why would Zillow use a data lake?
2-Explain dirty data and its impact on the business?
1. To better reduce prices, Zillow leverages OCR technologies in its ingestion method. The framework also enhances the user interface since the data can be accessed quicker.
At Zillow, ensuring data accuracy is a significant subject, public records information arrives in several different formats, and the organisation hires a data scientist whose full-time role is to ensure data consistency. To check for variations in the number of sales purchases, Zillow uses pattern analysis. At the data field level, there are also tests, searching for listings that have, for instance, 30,000 bedrooms. Zillow also flags certain kinds of sales, such since foreclosures, as the Zestimate figures do not use these deals.
The technology framework at Zillow includes Apache Spark. For real-time scoring, the business often uses Redis and Python. For cloud computing, Zillow taps AWS S3 and relies on AWS Redshift and Presto for its warehouse of data. When looking at historical details, Zillow clearly turns to Presto. Beyond the Zestimate, Zillow also provides the viewers with other figures, such as a Turbo Zestimate and a classification for "hot homes" (which estimates how quickly a home can sell). Many of these estimates are based on a measure of Zillow's Zestimate.
Via personalization and quest, Zillow has also invested in anticipating the needs of its customer users. Based about how sparse the signals are for a single user, Zillow uses distinct kinds of user vectors.
2. Dirty data which is unreliable, incomplete or contradictory. Experian estimates that corporations around the world believe that 26 percent of their data is polluted on average. This leads to tremendous damages. It currently costs the average corporation 15 to 25 percent of its income, and the US economy more than $3 trillion a year. Anybody who has had to work with dirty data knows how irritating it can be, but it can be hard to get your mind around the effect untilthe numbers are added up. It is important to consider where it comes from, how it impacts industry and how it can be dealt with, because dirty data costs too much, a sobering understatement.
Dirty data lacks integrity, which ensures that end-users who rely on that information waste extra time checking its authenticity, limiting efficiency and competitiveness further. Growing volumes of dirty documents contribute to further inaccuracies and mounting discrepancies by adding another manual method.
In addition to the lack of sales, filthy data effects corporations more insidiously. Just 16% of company executives trust the consistency that underlies their corporate decisions. When you can't count on your own records, more has to be done to improve records quality and reliability. Garbage in, garbage out.