reddit Q&A – How to extract data from HTML into Excel

reddit - How to extract HTML
reddit – How to extract HTML

Here’s a question from redditor Sajano90:

… need to know how i can create a list in excel with every product by its name and price (from) www.mindfactory.de
maybe for the start its not needed to be every product but maybe all the cases or all the lcd monitors or something like this.
i think its easiest to take the url from the special categorie of the produkt, and than try to take the first page with its 20 products and find the elements with the name and the price and list them along to another in an excel table.

The answer I gave was a small adaptation of SearchBot(), the beginner web scraping tutorial on the home page.

Under Tools > References… add these reference libraries: Microsoft HTML Object Library and Microsoft Internet Controls.

Sub GetData()
 
    Dim objIE As InternetExplorer
    Dim itemEle As Object
    Dim desc As String, pt1 As String, pt2 As String, price As String
    Dim y As Integer
 
    Set objIE = New InternetExplorer
    objIE.Visible = True
 
    objIE.navigate "http://www.mindfactory.de/Hardware/Monitore+(TFT).html"
    Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
 
    y = 1
 
    For Each itemEle In objIE.document.getElementsByClassName("item_grid_2_cols")
        desc = itemEle.getElementsByTagName("h3")(0).innerText
        pt1 = itemEle.getElementsByClassName("prod_preis")(0).getElementsByTagName("a")(0).textContent
        pt2 = itemEle.getElementsByClassName("prod_preis")(0).getElementsByTagName("a")(1).textContent
        price = pt1 & pt2
        Sheets("Sheet1").Range("A" & y).Value = desc
        Sheets("Sheet1").Range("B" & y).Value = price
        y = y + 1
    Next
 
End Sub

 

reddit Q&A - How to extract HTML, Excel result
reddit Q&A – How to extract HTML, Excel result

Finding the solution was fun because the layout of the target website was very convoluted, requiring the price ‘dollars’ and ‘cents’ to be initially captured in two different variables.

To download this Excel VBA project: reddit – HTML scrape mindfactory.xlsm

And here’s a similar reddit Q&A: Grabbing an HTML table and inserting the cells into Excel?

 
Facebooktwittergoogle_plusredditlinkedintumblrFacebooktwittergoogle_plusredditlinkedintumblr

Launch VBA code macro from a GO or RUN button

To add a GO or RUN button to an Excel sheet to start a VBA macro/subroutine…

Decide where on the sheet to place the button. I’ll make some room along the top the row…

Add 'GO' or 'RUN' button to start Excel VBA macro
Add ‘GO’ or ‘RUN’ button to start Excel VBA macro

Click the Insert tab > Illustrations section > Shapes, and choose a shape…

Insert shape onto Excel sheet
Insert shape onto Excel sheet

Position the shape where and how you want it…

Position shape on Excel sheet
Position shape on Excel sheet

Right-click the shape and click Edit text. Enter GO, RUN, START, or whatever you want the button to say…”

Right-click shape and 'Edit text'
Right-click shape and ‘Edit text’

Right-click the shape and click Assign Macro…, then choose which macro will start when the shape is clicked…

Right-click and 'assign macro' to the shape
Right-click and ‘assign macro’ to the shape

Choose the macro and click OK
Choose the macro and click OK

That’s it — just click the shape whenever you want to start your macro or chain of macros…

Click shape to run your VBA macro/subroutine
Click shape to run your VBA macro/subroutine

To download ths Excel VBA project: getelementsbytagname2.xlsm

Facebooktwittergoogle_plusredditlinkedintumblrFacebooktwittergoogle_plusredditlinkedintumblr

How to add VBA References – Internet Controls, HTML Object Library

In order to control the Internet Explorer web browser, VBA needs 2 additional reference libraries:

Microsoft HTML Object Library   and   Microsoft Internet Controls

 
 

Here’s how to add them…

Open the VBA Editor, and click Tools > References…

Visual Basic Editor - Tools > References
Visual Basic Editor – Tools > References

Make sure these 4 References are active by default:

  • Visual Basic for Applications
  • Microsoft Excel xx.x Object Library
  • OLE Automation
  • Microsoft Office xx.x Object Library
VBA default reference libraries
VBA default reference libraries

Scroll down a tick these 2 web-related References:

  • Microsoft HTML Ojbect Library
  • Microsoft Internet Controls
Add MS HTML Object Library, MS Internet Controls
Add MS HTML Object Library, MS Internet Controls

Resulting in…

VBA - Needed references for building web bots
VBA – Needed references for building web bots
Facebooktwittergoogle_plusredditlinkedintumblrFacebooktwittergoogle_plusredditlinkedintumblr

VBA – Web scraping with getElementsByTagName()

We already looked at getElementByID and getElementsByClassName as VBA methods for hooking onto web page elements.

But when elements (HTML tags) on a page don’t have an ID or class, another common approach for accessing them is using getElementsByTagName.

Examples:

objIE.document.getElementsByTagName("div")

   creates an array (group) of all div elements on a page.

Debug.Print objIE.document.getElementsByTagName("p").Length

   displays number of p (paragraph) elements on a page in the console.

Debug.Print objIE.document.getElementsByTagName("p")(0).innerHTML

   displays inner HTML of 1st p element on a page.

Debug.Print objIE.document.getElementsByTagName("p")(4).textContent

   displays text content of 5th p element on a page.

For Each ele In objIE.document.getElementsByTagName("input")
If ele.Value = "Sign In" then ele.Click: Exit For
Next

   clicks an input element that has a value equal to ‘Sign In’

For Each ele In objIE.document.getElementsByTagName("a")
If InStr(ele.href, "wp-admin") > 0 then objIE.navigate ele.href: Exit For
Next

   gets the link element containing ‘wp-admin’ and navigates to it.

Let’s practice. Let’s say every morning we need to create a report which requires going to a website, copying a bunch of data, and pasting it into Excel, and we’re wanting to automate the task…

I’ll just use a list of last names as example data. Doing this manually would look like…

Example data - a huge list of last names
Example data – a huge list of last names

Paste into Excel…

Pasting data into Excel - tedious!
Pasting data into Excel – tedious!

Not too hard, but many of our ‘Excel + web data’ tasks are more involved. And if they are daily, or otherwise repititive, you could save hours per year by automating them. The Sub below performs our task within seconds … and the code is eternally reusable. Assuming this is a 5-minute task, daily, building an automation super macro would save 30 hours a year!

Paste the code below into a VBA code module (how to open a new VBA module), or download below.

Sub GrabLastNames()
 
    'dimension (set aside memory for) our variables
    Dim objIE As InternetExplorer
    Dim ele As Object
    Dim y As Integer
 
    'start a new browser instance
    Set objIE = New InternetExplorer
    'make browser visible
    objIE.Visible = True
 
    'navigate to page with needed data
    objIE.navigate "http://names.mongabay.com/most_common_surnames.htm"
    'wait for page to load
    Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop
 
    'we will output data to excel, starting on row 1
    y = 1
 
    'look at all the 'tr' elements in the 'table' with id 'myTable',
    'and evaluate each, one at a time, using 'ele' variable
    For Each ele In objIE.document.getElementById("myTable").getElementsByTagName("tr")
        'show the text content of 'tr' element being looked at
        Debug.Print ele.textContent
        'each 'tr' (table row) element contains 4 children ('td') elements
        'put text of 1st 'td' in col A
        Sheets("Sheet1").Range("A" & y).Value = ele.Children(0).textContent
        'put text of 2nd 'td' in col B
        Sheets("Sheet1").Range("B" & y).Value = ele.Children(1).textContent
        'put text of 3rd 'td' in col C
        Sheets("Sheet1").Range("C" & y).Value = ele.Children(2).textContent
        'put text of 4th 'td' in col D
        Sheets("Sheet1").Range("D" & y).Value = ele.Children(3).textContent
        'increment row counter by 1
        y = y + 1
    'repeat until last ele has been evaluated
    Next
 
    'save the Excel workbook
    ActiveWorkbook.Save
 
End Sub

By using getElementsByTagName (especially when getElementByID and getElementsByClassName are not an option) you can scrape data from any web page at the push of a button.

What tasks can you automate?

To download this Excel VBA project: getelementsbytagname.xlsm

Recommended next: reddit Q&A – How to extract data from HTML into Excel

Facebooktwittergoogle_plusredditlinkedintumblrFacebooktwittergoogle_plusredditlinkedintumblr

How to remove your name and other meta info from an Excel file

Excel saves a lot of meta data in your Workbooks by default. Even personal information (like your name!) is stored, which may not always be desirable.

Excel - document properties
Excel – document properties

The best way to remove your name and other info from a workbook is not from within Excel, but from its containing folder in Windows:

Start by right-clicking the Excel file, then click Properties

Right-click Excel file, click Properties
Right-click Excel file, click Properties

In the Properties pane, click the Details tab, then Remove Properties and Personal Information

File Property Details - remove properties and personal info
File Property Details – remove properties and personal info

Next, in the Remove Properties pane, click Remove the following properties from this file, uncheck the box next to Authors and Last saved by (and any other info you want), and click OK

Remove Properties pane
Remove Properties pane

The items you selected are now removed from the file. That is, until you open the file again! Unfortunately, you’ll have to do this each time you open and close the file, as there is no way to prevent the automatic storing of this information.

Additionally, and annoyingly, Excel (and other software) does not allow removal of certain meta data like Date created, Date modified, and Date acessed (notice below, no check boxes next to these items). The Computer name, however, only looks like its being stored; this is actually just showing the computer currently viewing the file properties…

Som file properties can't be removed
Som file properties can’t be removed
Facebooktwittergoogle_plusredditlinkedintumblrFacebooktwittergoogle_plusredditlinkedintumblr

Speed-write VBA code by recording a macro

Sometimes the fastest and easiest way to write a tricky bit of VBA code is to have Excel do it for you, by recording a macro, then tweaking the created code.

For example, in our article about VBA’s getElementsByTagName() method, we wrote a subroutine to automate scraping data from a table on a page by pulling content from the child elements of <tr> tags…

VBA project example
VBA project example

Now let’s say we want to sort the data and make the headings bold. I think the VBA code to make text bold is something like Range(“A1”).Font.Bold, but honestly I don’t remember the exact syntax so I’d usually just take a guess or two then turn to Google to jog my memory. Same with sorting — I’d definitely need some help with that one.

But an easier way, especially for the harder stuff, is to simply record what you want to happen in a macro, then steal the VBA code that Excel created dynamcally during recording.

Let’s try it…

Make sure Excel’s Developer tab is showing and click the Record Macro button…

'Record Macro' button in Dev tab
‘Record Macro’ button in Excel Developer tab

Give your new macro a name…

Name your new Excel macro
Name your new Excel macro

Perform the actions you want to capture code for (in this case, sorting columns)…

Sorting Excel data in 4 clicks
Sorting Excel data in 4 clicks

Click the Stop Recording macro button…

'Stop Recording' macro button in Excel's Dev tab
‘Stop Recording’ macro button in Excel Developer tab

Click the newly-created module in the Visual Basic Editor. We can see a new VBA subroutine was automatically created by Excel when we recorded our macro…

New macro code dynamically-created by Excel in new module
New macro code dynamically-created by Excel in new module

Cut & paste the above macro code into your module, then call it by name in your Subroutine…

Cut & paste macro code to your module
Cut & paste macro code to your module

Now for making the column headings bold … you give that one a try!

To download this Excel VBA project: getelementsbytagname2.xlsm

Facebooktwittergoogle_plusredditlinkedintumblrFacebooktwittergoogle_plusredditlinkedintumblr

How to insert a VBA code Module in Excel’s Visual Basic Editor

To insert a new VBA code module in Excel’s Visual Basic Editor, first open the Visual Basic Editor

Excel - click Visual Basic
Excel – click Visual Basic

Then click View > Project Explorer

VBE - click View, Project Explorer
VBE – click View, Project Explorer

Right-click VBAProject and click Insert > Module

VBE - click Insert, Module
VBE – click Insert, Module

A new, numbered module appears, waiting for your VBA code…

VBA - new Module1 created and waiting for VBA code
VBA – new Module1 created and waiting for VBA code
Facebooktwittergoogle_plusredditlinkedintumblrFacebooktwittergoogle_plusredditlinkedintumblr

How to open Excel’s Developer tab and the Visual Basic Editor

Excel’s Developer tab does not show by default, so if you want to record macros or write macro code in VBA in the Visual Basic Editor, you’ll need to activate it

Open Excel’s File menu and click Options

Excel - Options in File menu
Excel – Options in File menu

Click Customize Ribbon

Excel - Customize Ribbon
Excel – click Customize Ribbon

Check the Developer box and click OK

Excel - check Developer box
Excel – check Developer box

Check the Developer tab…

Excel - click Developer tab
Excel – click Developer tab

To open the Visual Basic Editor, click the Visual Basic button, or shortcut keys ALT + F11.

Excel's Visual Basic Editor
Excel’s Visual Basic Editor
Facebooktwittergoogle_plusredditlinkedintumblrFacebooktwittergoogle_plusredditlinkedintumblr