Skip to content

How to properly extract unclassified parts? #14

@TLINDEN

Description

@TLINDEN

I have the following HTML snippet:

<ul class="addetailslist">
  <li class="addetailslist--detail">
    Art<span class="addetailslist--detail--value" >
    Weitere Kinderzimmermöbel</span>
  </li>
  <li class="addetailslist--detail">
    Farbe<span class="addetailslist--detail--value" >
    Holz</span>
  </li>
  <li class="addetailslist--detail">
    Zustand<span class="addetailslist--detail--value" >
    In Ordnung</span>
  </li>
</ul>

These are 3 different attributes:

  • "Art" (en: Type) with value "Weitere Kinderzimmermöbel"
  • "Farbe" (en: Color) with value "Holz"
  • "Zustand" (en: Condition) with value "In Ordnung"

My current attempt to parse this looks like this:

type Ad struct {
  Details      []string `goquery:".addetailslist--detail--value,text"`
  [..]
}
var CONDITIONS = []string{"Neu", "Gut", "Sehr Gut", "In Ordnung"}
var COLORS = []string{"Beige", "Blau", "Braun", "Bunt", "Burgunderrot",
	"Creme", "Gelb", "Gold", "Grau", "Grün", "Holz", "Khaki", "Lavelndel",
	"Lila", "Orange", "Pink", "Print", "Rot", "Schwarz", "Silber",
	"Transparent", "Türkis", "Weiß", "Sonstige"}

[..]
	for _, detail := range advertisement.Details {
		switch {
		case slices.Contains(CONDITIONS, detail):
			advertisement.Condition = detail
		case slices.Contains(COLORS, detail):
			advertisement.Color = detail
		default:
			advertisement.Type = detail
		}
	}

So, this works, kinda.

But the obvious problem is, that it will fail if there are overlappings (e.g. a Type occuring as a Color) or if the site adds or removes values. I'd have to constantly monitor these lists and update my code.

As far as I understand the DOM, the attribute names "Art" or "Zustand" are just text values of the <li> elements. Of course I might use manual go code to parse this (using a tokenizer or regexes). But look how the string looks if I extract the whole text of the list using goquery:".addetailslist,text":

Art
                                        Weitere Kinderzimmermöbel
                                    
                                
                                        Farbe
                                        Holz
                                    
                                
                                        Zustand
                                        In Ordnung

I could try to trim it and parse it line-wise. But how stable would that be? Any tiny change might break my code.

Maybe there's a better way, do you have an idea?

any help would be much appreciated!
Tom

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions