Catherine Baudin: E-commerce Intelligence: The Art of Mining Semi-Structured Marketplaces
From Anita Borg Institute Wiki
Catherine Baudin is a Research Scientist at the eBay Research Lab. Further details can be found here.
Product suggestions are harder on eBay than on Amazon or NetFlix that only carry books and movies. eBay has a wider variety of goods. eBay is a marketplace so anybody can come and sell or buy. There are established and experienced sellers and buyers as well as inexperienced ones. Big key difference is that there are a hundred million items that are managed by multiple people and not a single store or few people in a store. How do you mine the data in these settings? How do you help a seller find what they are looking for?
Marketplace opportunity: there are millions of products, anybody can sell, across borders selling.
Technical challenge 1: Search. Structured data from a single vendor versus classifieds that are semi-structured or unstructured text. For example, classifieds search is a full text search and searching for "iPod" returns everything that has "iPod" in the title but not all returned products are actually iPods, some are just accessories. Searching a Walmart store for "iPod" returns a list of players and a link to iPod accessories search. The system makes a decision that if you are searching for "iPod" you probably mean iPod player. Moreover, the system gives you options to further refine your search results based on model, type, etc. So it has some structured knowledge of the data. When the same item is listed by different sellers in different ways the system cannot establish that they are, in fact, the same item. For example, Gucci glasses in one listing can state that the lens colour is brown while in another other it is listed as bronze.
Tech challenge 2: Product Recommendation. If group of people has bought a coffee maker and a milk frothier in the past when a new person searches for that coffee maker you can suggest that this frothier is often bought with that coffee maker. Moreover, if people also buy a certain brand of sunglasses with that coffee maker, you can also suggest the sunglasses for a person searching for that coffee maker. However, if the system cannot tell that the coffee maker is the same model it cannot make suggestions. So how do you extract data and features from the unstructured text data? You can analyze item description trying to see that "Gucci", for example, is a brand and that brown or bronze is a colour. Now how can you derive relationships between features that are close to each other? For example, silver colour versus platinum? You have no guarantee that the seller will describe any given product in a certain way. Some people working at eBay have manually mapped the colours as they were extracted from text to aid the system estimate such relationships.
Technical Challenge 3: Multiple Market Types. Product based/structured versus less structured. You have games, books, movies and similar products that are easy to categorize and then you have items, for example Furby, that are difficult to place in a category. You also have items in the middle such as fashions, clothing etc. You can either try to structure the data or accept that data will always be unstructured and deal with that.
How do you recommend a product for sellers: someone selling high end shampoo, suggest another product that sells well, is often searched for and not available. You can also predict demand by searching the supply levels. Find hot products: search for a given term, "Brazilian carotene" for example, and see which products with these words have sold and which have not.
Phase 1: Demand analysis. Track user search to see what product is searched and bid on or purchased frequently enough.
Phase 2: Supply analysis. Separate products returned by the query that are the product searched for and those that are not close enough. Use search constrains that are descriptive and specific enough: price, brand etc. Then look for a niche. For example, from all stethoscopes only the Littman one sells best. Or wedding cake toppers - the system has trouble differentiating on why certain ones do not sell while they are priced between those that do sell. When category manager finds a niche it contacts a power seller with a product suggestion.
Mining semi-structured data is a mix of methods that try to improve on one aspect or another. You can also add more structure to the data and combine methods.
Q&A
Q: In your market when you take low supply high demand item and suggest it to a seller how do you avoid driving a price drop?
A: There are constraints to it and we introduce a threshold to ensure that the supply remains low enough.
Q: Do you have different algorithms for single purchase products versus multi-purchase ones? For example, a coffee machine is usually one per household while you can have multiple sweaters.
A: We do. Suggestion is based on similarity, so if you buy a sweater we recommend a similar sweater. Interestingly enough is that at amazon the aggregated data knowledge alone allows for very accurate suggestions.
Q: Do you look at machine learning and user interaction?
A: Classification is assigning categories. Look at example items that are already categorized and assign similar categories based on that. You can increase the quality by giving an example set if items that sell well, e.g.: for Gucci glasses use the brand as the category as well.
Q: Have you considered matching prods to existing external product databases?
A: eBay bought shopping.com recently because it has large product catalogue. But no catalogue is complete. So that is still an open problem.
Q: Have you tried to correct the error a seller makes in describing the product based on make and model?
A: You can apply such changes interactively but the problem is that there is a question of how much do you bother the used to obtain correct meta data information? Now they have a choice of defining or omitting features. The more you ask the user to enter the more constraint u put on a user so there is a limit to that.
NOTE:
The presentation slides are pending legal clearance approval and may be added at a later date.